Not any of pentaho, Kettle or Talend supports RRD as a datasource for Business Intelligence(BI). I have designed an Enterprise Information Integration framework/layer over Multiple RRD datasource. This layer will allow the EII-RRD (the solution) users to aggregate the data across multiple RRD files and do BI (average/Max/Min) functions on them.
I will attempt to give a brief description of the problem and the constrains to be considered before deciding a solution for it.
Round Robin Database- is a file base database used to store time and value pair. It is very much used in Network management Solutions where it need to record some values based on the time. Eg. Network latency every 5 min. The database allows you to store average for various time intervals so that when you updated every 5 min, it automatically updates the hourly or daily averages. Pretty useful in performance management portion of NMS solutions.
Now what is the problem? Yes when you talk of Business Intelligence it is a matter of aggregating data across multiple sources and trying to co-relate to obtain some kind of information which is useful for decision making or analyzing.
So here comes the problem statement- you will have RRD files pertaining to a protocol’s response time for an IP of particular network. You will have multiple networks like that. So the BI here is given a time frame, grab the average response time of a particular protocol across all machines, in all network.
Constrains to consider before designing.
The problem will be daunting and computational intensive when you consider the time and space complexity. The solution’s main focus is to address memory complexity the second is Time. Memory complexity is must solve and the timing complexity should be reduced to the point were horizontal scalability would kick in when the resource is limited.
Crux of the solution:
Memory- I used the virtual memory concept of design here. I.e. Consider if the user queries like he needs the average graph from 1970-2008 for every 5 min interval imagine the memory that is going to allocated. i.e.. number of 5 mins between 1970 to 2008.
In my design the processing unit will read the time/value pairs from RRDs files and will hand it over to a Virtual Memory layer. This Virtual Memory layer will promise the processing unit that it has the memory to store all the data (similar to the way the VM in OS does ) but it will allocate memory only if the data is available for that time interval. It is for sure im-material of the user’s request the data will be crowed in around 2008 time frames so the effective memory use will be very less. This kind of Virtual Memory like (promise you have more but do work for less) kind of design is some thing new brought into my design catalogue. It did break the memory issue from from GBs to 1 or 2 MB of usage.
This virtual memory kind of design really helped solve the memory problem, and i loved it and will be using it in my future design.
Time complexity and other aspects of the problem is noting interesting as I solved it with my usual design experience no new learnings there.
Whats in it for you?
Blog readers/architects, when ever you have a design for a problem
which allocates a huge chunk of memory in proportional to the input (or some system parameter which has no bounds) to solve your problem (and) When you end up using only portion of it for actually solving the problem because of the distribution characteristics of the input data.
Consider this virtual memory concept of design in your bouquet of design principles. It might help.
For People who are tired of finding open source solutions or proprietary solutions doing BI on RRD , if you wish to get more insight into my solution or want to discuss any aspect of it contact firstname.lastname@example.org . Only technical questions encouraged.