Just wanted to write about a recent framework I developed for database replication with postgres. For readers it would give you an idea to think in this direction if you come across this problem.
We had a requirement to replicate data realtime from a postgres database from multiple machines which reside inside a firewall to a cloud server on the internet. We evaluated various technologies and tools available on market , every solution we came across requires us to open up port in the firewall. And most of it are not real time. Most of the tools that we saw in market were ETL kind of tool where you take the data in a batch and replicate it , more over it will not work across firewall. I was architecting this product , and i have to come up with a solution no matter what.I opted to write my own framework.
Im a strong believer of build the solution in mind/paper before doing the code. So i have to develop a replication system that would run onvarious machine and which would replicate data to the central server.
Im not going to mention the thought process that i put in for each design decision i have taken , but im going to mention what is the end result.
Step 1. I cracked the JDBC libary of postgres, Took the source code from Postgres opensource repository and i read the code flow of the JDBC driver of the postgres.
Static statement Vs Prepared statement… issue.
Java program would use the jdbc libary to construct a static SQL statement or a prepared statement. When it is a static SQL query you have the query in hand. But when it is a prepared statement is it actually inside teh JDBC driver code where the actual Query is prepared before sending to the native methods to postgres.
I had figured out a place where the entire query leaves the JDBC diver to the native funtions to the database. There i have written a queue to sniff all the querys that leaves the system.
For technical queries regarding sniffing the query from driver write 2 firstname.lastname@example.org
Step 2. Now that I have a queue of the sniffed query i have to ship it across to the server which is across firewall. So WebService comes to resque here. I published a webservice at the Server to accept query and the client identifier and replicate create the connection and issue the query to the database.
Step 3: So have written a engine that woudl take the queued query at the client end and ship the query to the server across across firewall through webservice. And the server end of the webservice would fire the same query on the server end.
Multiple client (Master) postgres databases were able to replicate real time data to a single database cluster on the server.
Very high level design.
After end of regrous design and implementation and performance testing, the framework that I designed and implemented effeciently replicates databases from multiple machines into the cloud server across firewall. It really scales up well….to make me happy.
Feel free to contact me[ email@example.com] if you need more insight on the technical aspects of the framework. Only technical queries invited.