Not any of pentaho, Kettle or Talend supports RRD as a datasource for Business Intelligence(BI). I have designed an Enterprise Information Integration framework/layer over Multiple RRD datasource. This layer will allow the EII-RRD (the solution) users to aggregate the data across multiple RRD files and do BI (average/Max/Min) functions on them.
I will attempt to give a brief description of the problem and the constrains to be considered before deciding a solution for it.
Round Robin Database- is a file base database used to store time and value pair. It is very much used in Network management Solutions where it need to record some values based on the time. Eg. Network latency every 5 min. The database allows you to store average for various time intervals so that when you updated every 5 min, it automatically updates the hourly or daily averages. Pretty useful in performance management portion of NMS solutions.
Now what is the problem? Yes when you talk of Business Intelligence it is a matter of aggregating data across multiple sources and trying to co-relate to obtain some kind of information which is useful for decision making or analyzing.
So here comes the problem statement- you will have RRD files pertaining to a protocol’s response time for an IP of particular network. You will have multiple networks like that. So the BI here is given a time frame, grab the average response time of a particular protocol across all machines, in all network.
Constrains to consider before designing.
The problem will be daunting and computational intensive when you consider the time and space complexity. The solution’s main focus is to address memory complexity the second is Time. Memory complexity is must solve and the timing complexity should be reduced to the point were horizontal scalability would kick in when the resource is limited.
Crux of the solution:
Memory- I used the virtual memory concept of design here. I.e. Consider if the user queries like he needs the average graph from 1970-2008 for every 5 min interval imagine the memory that is going to allocated. i.e.. number of 5 mins between 1970 to 2008.
In my design the processing unit will read the time/value pairs from RRDs files and will hand it over to a Virtual Memory layer. This Virtual Memory layer will promise the processing unit that it has the memory to store all the data (similar to the way the VM in OS does ) but it will allocate memory only if the data is available for that time interval. It is for sure im-material of the user’s request the data will be crowed in around 2008 time frames so the effective memory use will be very less. This kind of Virtual Memory like (promise you have more but do work for less) kind of design is some thing new brought into my design catalogue. It did break the memory issue from from GBs to 1 or 2 MB of usage.
This virtual memory kind of design really helped solve the memory problem, and i loved it and will be using it in my future design.
Time complexity and other aspects of the problem is noting interesting as I solved it with my usual design experience no new learnings there.
Whats in it for you?
Blog readers/architects, when ever you have a design for a problem
which allocates a huge chunk of memory in proportional to the input (or some system parameter which has no bounds) to solve your problem (and) When you end up using only portion of it for actually solving the problem because of the distribution characteristics of the input data.
Consider this virtual memory concept of design in your bouquet of design principles. It might help.
For People who are tired of finding open source solutions or proprietary solutions doing BI on RRD , if you wish to get more insight into my solution or want to discuss any aspect of it contact email@example.com . Only technical questions encouraged.
Java 5 has provided architects to scale applications memory wise based on the charactersitics of the application’s memory usage pattern.
Java Garbage collector basics
The default GC of java is a serial collection GC. i.e when java decides to do the GC your application threads are suspended until the GC thread finishes.
On a single processor machine, this type of GC is good , but on multiple processor machine this is a kill. Imagine you have your Jboss or IBM WS that runs for banking project , for sure there would be a high hardware investment with muliple processor (not less than 12 processor machine).With this dedicated setup with serial collection your application that ran on 12 processor stops and only one processor is used for the GC activity. Ur applicaitn is in hault. So the throughput of the applicatino is directly impacted by your GC and it worsens with the increase in processors.
So it is a must to customize the GC collections , But remember until you understand the intricasis of the Java Heap and GC dont meddle with the GC collection,leave it to default because a non-expert is more likely to spoil than to increase the throughput.
See the throughput distribution in the below graph..
Java garbage collection design
What would you do if you are given a chance to decide the GC. IF you have a serial algorithm to sweep to all the objects in the memory and then dealocate the unreferenced objects then the BigO of the algorithm you design is directly propositonal to the nubmer of objects in the memory. So the time complexity of the algo you design will worsen for larger system.
How Sun Microsystem gets across with this Time complexity issue???
As far as memory conceptions is concerned based on research it is identified that the young object has the highest probability to die first. That means if an object is created recently it is more likely to die first than an object that has survived for a while. Current GC algo efficiently uses this principle of memory usage to product better BigO numbers.
Entire Jave Heap is seggregated into multiple segment to take advantage of this young die first fact.
The figure shows how the heap is seggregated, The entire Heap is seperated into Young, Tenured and Perm space.
GC algo is split into minor and major runs.
Minor run does GC only in the Young space, and major run does GC on Both Young and Tenured space. That is Major run is the maximum time a GC could take and we dont want this to run that often. To avoid doing the major runs , Java GC uses the Young die first fact and runs GC on the Young space. IF the object survives the run it is moved to Tenured. When Tenured is filled then Major run is triggered. This means major run is mostly avoided.
What this implies for architects?
Intelligently manupulating the young and Tenured size we can inpact the various characteristics of the application.
1. Frequency of GC runs.
2. Time taken for the GC to complete its run.
3. Throughput of the application.
Im not going to explain why it is impacted, readers are expected to understand the relation at this portion of the turtorial.
Java 5 provides you ablity to maniupalte the relative size of the memory segments.
Yes i agree the throughput problem of the serial collector is still open. Java 5 has allowd us to tackle this by providing 2 alternative GC Algos to the traditional Serial Collector method.
1. Throughput collector
2. Concurrent Low Pause Collector
I will attempt to give a short decription of the Above collectors
1. Throughput collector
The throughput collector is a generational collector similar to the serial collector but with multiple threads used to do the minor collection. The major collections are essentially the same as with the serial collector. By default on a host with N CPUs, the throughput collector uses N garbage collector threads in the minor collection. The number of garbage collector threads can be controlled with a command line option (see below). On a host with 1 CPU the throughput collector will likely not perform as well as the serial collector because of the additional overhead for the parallel execution (e.g., synchronization costs). On a host with 2 CPUs the throughput collector generally performs as well as the serial garbage collector and a reduction in the minor garbage collector pause times can be expected on hosts with more than 2 CPUs.
2. Concurrent Low Pause Collector
The concurrent low pause collector is a generational collector similar to the serial collector. The tenured generation is collected concurrently with this collector. This means the pause in the application is close to nil.
Aspiring memory manupulators:)
To start with just observe the memory conceptions of your software system.
java -verbose:gc xyz.jar
[GC 325407K->83000K(776768K), 0.2300771 secs]
[GC 325816K->83372K(776768K), 0.2454258 secs]
[Full GC 267628K->83769K(776768K), 1.8479984 secs
Gather enough understanding of the GC behaviour of your system against the hardware. Just remember a system that is best in Single process will be a pain in Multiprocessor. And the system that is in good in Multiprocessor will be a kill in single. And the effeciency also differs with the applicaitn characterstics. So leave it to default until you are confortable with the details.
So GC tuning by architects is a ever on task through out the lifecycle of the project. And it requires practics.
As always technical queries alone accepted regarding GC tuning at firstname.lastname@example.org
Just wanted to write about a recent framework I developed for database replication with postgres. For readers it would give you an idea to think in this direction if you come across this problem.
We had a requirement to replicate data realtime from a postgres database from multiple machines which reside inside a firewall to a cloud server on the internet. We evaluated various technologies and tools available on market , every solution we came across requires us to open up port in the firewall. And most of it are not real time. Most of the tools that we saw in market were ETL kind of tool where you take the data in a batch and replicate it , more over it will not work across firewall. I was architecting this product , and i have to come up with a solution no matter what.I opted to write my own framework.
Im a strong believer of build the solution in mind/paper before doing the code. So i have to develop a replication system that would run onvarious machine and which would replicate data to the central server.
Im not going to mention the thought process that i put in for each design decision i have taken , but im going to mention what is the end result.
Step 1. I cracked the JDBC libary of postgres, Took the source code from Postgres opensource repository and i read the code flow of the JDBC driver of the postgres.
Static statement Vs Prepared statement… issue.
Java program would use the jdbc libary to construct a static SQL statement or a prepared statement. When it is a static SQL query you have the query in hand. But when it is a prepared statement is it actually inside teh JDBC driver code where the actual Query is prepared before sending to the native methods to postgres.
I had figured out a place where the entire query leaves the JDBC diver to the native funtions to the database. There i have written a queue to sniff all the querys that leaves the system.
For technical queries regarding sniffing the query from driver write 2 email@example.com
Step 2. Now that I have a queue of the sniffed query i have to ship it across to the server which is across firewall. So WebService comes to resque here. I published a webservice at the Server to accept query and the client identifier and replicate create the connection and issue the query to the database.
Step 3: So have written a engine that woudl take the queued query at the client end and ship the query to the server across across firewall through webservice. And the server end of the webservice would fire the same query on the server end.
Multiple client (Master) postgres databases were able to replicate real time data to a single database cluster on the server.
Very high level design.
After end of regrous design and implementation and performance testing, the framework that I designed and implemented effeciently replicates databases from multiple machines into the cloud server across firewall. It really scales up well….to make me happy.
Feel free to contact me[ firstname.lastname@example.org] if you need more insight on the technical aspects of the framework. Only technical queries invited.
Sun has done an excelent work in integrating remote management into JVM. They have build in SNMP Agent like capablity into the JVM. The SNMP OID is synonomous to the MBeans and SNMP MIB is synonomous to the MBean Server.
This beautiful capability allows you to connect to the JVM and monitor the crime we have done in the code with regard to the run time memory/CPU and thread utilization. I became a lover of this feature. I always thought to build this capability into the applications we build. This feature is a real bliss for architects.
Head on Jump
To have a first hand experience with the JVM instrumentation , just follow the simple steps…
1. Have Java 5.0 installed in your system and create an java program that runs until you forcefully stop it. If you have a framework with threadpools, resource management etc.. it is a good example.
2. Run your java program with this additional parameter.
Java -Dcom.sun.management.jmxremote xyz.jar
This command publishes the MBServer in the JVM as a RMI resource for the Jconsole to connect to.
3. Start Jconsole and connect to the Program in your connection dialogue box.
Jconsole will alow you to monitor the Memory used and the threads used. Identify deadlocks etc etc..
See the peformance scaling for an Enterprise Information Integration framework that i recently wrote.. This is a heavy ETL and EII kind of tool that aggregate data from multiple databases. You would difinitly need to have some performance numbers for its production run. jconsole tool would help you prove the roboustness your ur framework by showing the memory and thread occupancy and how controlled they are admist heavy load on the framework.
Developing RIA with Flex fun..
What and why is RIA?
Like fashion industry technology also swings back and forth. Initial when client server architecture was introduced industry started moving in the direction of developing desktop based rich applicaiton which talks to the server. But later architecture matured and industry wanted many clients to access the application so it rediculously called the desktop application as thickclient. and industry stared moving in the direction of thin client funda like HTML and other dynamic web technologies. Now again industry started missing the richness in the traditional destop based thinkclient , and it also got bored with the HTML based request response model. This gave the birth of RIA… FLash and silverlight.
Their aim is to develop thin-think client that can run in browser and still hide all request response boaring stuff that webclients were experiencing.
Flex was the fourrunner in t his market.
Flex development nightmares?
Developing application with flex builder a sample flex program will all be a easy going task . But the real nightmare kicks in when the applicaiton grows big. When you want to build a real production system out of it. The front end MXML becomes really messy. You will have a single monolythic mxml file where we have to keep writing actionscripts and mxml. We can seperate actionscript into seperate file and mxml into multipel file but you will hit a point were you cry for framework.
Cairngrom would be your rescue.
What is cairngorm?
Cairngrom tries to bring in the tradional software engineering best practises in to the Flex applicaiton development.
It is a way you fit in your
1. Business Delegate
2. Service locator
4. Model Locator.
5. Front controller
Kind of patterns in to your flex development. The time to adopt cairngorm frame is bit more but once it is done and when we start fitting in pieces it really pays of.
Following picture gives a architecture of a cairngrom based flex application.
For more details on cairngrom follow some of my favourite links below.
http://sujitreddyg.wordpress.com/category/flex-and-blazeds/ A good fellow developer on the same lines.
By default, Fedora Core 6 systems come with an old Java software installed, so you need to install a newer version of Java Runtime Environment to enjoy all the Java applications out there. In this quick guide, I will teach you how to update/install your Java Environment.
Let’s begin by downloading the latest version of JRE (Java Runtime Environment) from here. Just click on the Download link where it says Java Runtime Environment (JRE) 5.0 Update, then you’ll need to accept the license and download the Linux self-extracting file.
WARNING: Please remember to always replace the xx from the jre-1_5_0_xx-linux-i586.bin file with the latest version. At the moment of this guide’s writing, the latest version was 09, so the file should look like this: jre-1_5_0_09-linux-i586.bin
After you have finished downloading the file, you need to move it into the /opt folder. Open a console and type:
mv jre-1_5_0_xx-linux-i586.bin /opt
Now, you will need to make this file executable so you can extract it. Follow the commands below:
cd /opt – so you can go into the /opt directory
chmod +x jre-1_5_0_xx-linux-i586.bin
And now, let’s run the executable file with the following command:
You’ll be prompted with the License Agreement, hit space until you are asked if you agree or not. Type Yes and the extraction process will begin. After the extraction process is finished, just remove the binary file with the following command:
rm -rf jre-1_5_0_xx-linux-i586.bin
Now, let’s put the Java plugin into your browser’s plugin folder. Konqueror, Firefox and Mozilla browsers will all look into the same folder, for plugins. So type the following command:
ln -s /opt/jre1.5.0_xx/plugin/i386/ns7/libjavaplugin_oji.so /usr/lib/mozilla/plugins/libjavaplugin_oji.so
Well, now you need to make the Java executable available for the whole system, so you can run all the Java applications you encounter. Create the following file with your preferred text editor:
Now paste the following options into the file, remember to enter a carriage return after these lines, then save it. Remember to replace the xx with the latest version you have downloaded:
Now, type the following command to make that file available:
Then type this command to see if the path is correct:
You will see something like this: /opt/jre1.5.0_09/bin/java
Then type these commands:
/usr/sbin/alternatives –install /usr/bin/java java /opt/jre1.5.0_xx/bin/java 2
/usr/sbin/alternatives –config java
After you have entered the last command, you’ll be asked to choose which Java software you want for your system. Just press 2 key and hit enter.
And finally, just type this command to see if everything looks good and your system has a new Java Environment:
/usr/sbin/alternatives –display java
And you can also type this command to see the version of your Java Runtime Environment:
Mine looks like this:
java version “1.5.0_09”
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_09-b01)
Java HotSpot(TM) Client VM (build 1.5.0_09-b01, mixed mode, sharing)
You should now be able to run most of the Java applications out there, with the commands like:
java -jar application.jar