Not any of pentaho, Kettle or Talend supports RRD as a datasource for Business Intelligence(BI). I have designed an Enterprise Information Integration framework/layer over Multiple RRD datasource. This layer will allow the EII-RRD (the solution) users to aggregate the data across multiple RRD files and do BI (average/Max/Min) functions on them.
I will attempt to give a brief description of the problem and the constrains to be considered before deciding a solution for it.
Round Robin Database- is a file base database used to store time and value pair. It is very much used in Network management Solutions where it need to record some values based on the time. Eg. Network latency every 5 min. The database allows you to store average for various time intervals so that when you updated every 5 min, it automatically updates the hourly or daily averages. Pretty useful in performance management portion of NMS solutions.
Now what is the problem? Yes when you talk of Business Intelligence it is a matter of aggregating data across multiple sources and trying to co-relate to obtain some kind of information which is useful for decision making or analyzing.
So here comes the problem statement- you will have RRD files pertaining to a protocol’s response time for an IP of particular network. You will have multiple networks like that. So the BI here is given a time frame, grab the average response time of a particular protocol across all machines, in all network.
Constrains to consider before designing.
The problem will be daunting and computational intensive when you consider the time and space complexity. The solution’s main focus is to address memory complexity the second is Time. Memory complexity is must solve and the timing complexity should be reduced to the point were horizontal scalability would kick in when the resource is limited.
Crux of the solution:
Memory- I used the virtual memory concept of design here. I.e. Consider if the user queries like he needs the average graph from 1970-2008 for every 5 min interval imagine the memory that is going to allocated. i.e.. number of 5 mins between 1970 to 2008.
In my design the processing unit will read the time/value pairs from RRDs files and will hand it over to a Virtual Memory layer. This Virtual Memory layer will promise the processing unit that it has the memory to store all the data (similar to the way the VM in OS does ) but it will allocate memory only if the data is available for that time interval. It is for sure im-material of the user’s request the data will be crowed in around 2008 time frames so the effective memory use will be very less. This kind of Virtual Memory like (promise you have more but do work for less) kind of design is some thing new brought into my design catalogue. It did break the memory issue from from GBs to 1 or 2 MB of usage.
This virtual memory kind of design really helped solve the memory problem, and i loved it and will be using it in my future design.
Time complexity and other aspects of the problem is noting interesting as I solved it with my usual design experience no new learnings there.
Whats in it for you?
Blog readers/architects, when ever you have a design for a problem
which allocates a huge chunk of memory in proportional to the input (or some system parameter which has no bounds) to solve your problem (and) When you end up using only portion of it for actually solving the problem because of the distribution characteristics of the input data.
Consider this virtual memory concept of design in your bouquet of design principles. It might help.
For People who are tired of finding open source solutions or proprietary solutions doing BI on RRD , if you wish to get more insight into my solution or want to discuss any aspect of it contact firstname.lastname@example.org . Only technical questions encouraged.
Just wanted to write about a recent framework I developed for database replication with postgres. For readers it would give you an idea to think in this direction if you come across this problem.
We had a requirement to replicate data realtime from a postgres database from multiple machines which reside inside a firewall to a cloud server on the internet. We evaluated various technologies and tools available on market , every solution we came across requires us to open up port in the firewall. And most of it are not real time. Most of the tools that we saw in market were ETL kind of tool where you take the data in a batch and replicate it , more over it will not work across firewall. I was architecting this product , and i have to come up with a solution no matter what.I opted to write my own framework.
Im a strong believer of build the solution in mind/paper before doing the code. So i have to develop a replication system that would run onvarious machine and which would replicate data to the central server.
Im not going to mention the thought process that i put in for each design decision i have taken , but im going to mention what is the end result.
Step 1. I cracked the JDBC libary of postgres, Took the source code from Postgres opensource repository and i read the code flow of the JDBC driver of the postgres.
Static statement Vs Prepared statement… issue.
Java program would use the jdbc libary to construct a static SQL statement or a prepared statement. When it is a static SQL query you have the query in hand. But when it is a prepared statement is it actually inside teh JDBC driver code where the actual Query is prepared before sending to the native methods to postgres.
I had figured out a place where the entire query leaves the JDBC diver to the native funtions to the database. There i have written a queue to sniff all the querys that leaves the system.
For technical queries regarding sniffing the query from driver write 2 email@example.com
Step 2. Now that I have a queue of the sniffed query i have to ship it across to the server which is across firewall. So WebService comes to resque here. I published a webservice at the Server to accept query and the client identifier and replicate create the connection and issue the query to the database.
Step 3: So have written a engine that woudl take the queued query at the client end and ship the query to the server across across firewall through webservice. And the server end of the webservice would fire the same query on the server end.
Multiple client (Master) postgres databases were able to replicate real time data to a single database cluster on the server.
Very high level design.
After end of regrous design and implementation and performance testing, the framework that I designed and implemented effeciently replicates databases from multiple machines into the cloud server across firewall. It really scales up well….to make me happy.
Feel free to contact me[ firstname.lastname@example.org] if you need more insight on the technical aspects of the framework. Only technical queries invited.