Data analysis remains a vital part of the Data Management industry. Being able to actually see the data helps the human mind understand its underlying meaning. Data visualization tools provide this functionality in many forms – from the graphing tool inside Excel to the embedded visualizer in the Graph database, InfiniteGraph.
Three open source data analysis tools with visualization capabilities are worth noting; each provides a unique take on this fascinating sector within the data management profession. Considering the price is right – free – interested professionals will find the time spent downloading and playing with these tools to be worthwhile.
The R Project for Statistical Computing
R provides a fully-fledged software environment for statistical computing and data manipulation enhanced by graphics and visualization capabilities. The software (in source code format) is available under the GNU public license and is compatible with the UNIX, Mac, and Windows operating systems. R draws on the S statistical programming language and is considered a modern implementation of that venerable FORTRAN alternate.
R is able to run, without alteration, code written for S. Considering S’s continued popularity in statistics research, R provides an open source alternative for those interested in exploring this area. The software easily produces publication quality graphics, including mathematics symbols and formulas.
The R environment running on Mac OS X
R’s features include integrated data handling and storage, operators useful for array and matrix calculations, a large set of data analysis tools, the previously mentioned display and print graphics functionality. It also includes a simple programming language that contains conditionals, loops, recursive functions, along with input and output statements. A data editor and data object browser add to the overall convenience.
Users needing more computing horsepower can utilize functions written in C, C++, and FORTRAN compiled and linked into R. It provides the capability to manipulate data objects in R directly with C code. The possibilities are nearly limitless.
R supports a package framework encapsulating graphics and formula functionality developed by others in the CRAN (Comprehensive R Archive Network) community. This allows new users to come up to speed rather quickly, in addition to easily extending the R environment. Interested users can spend a full afternoon browsing the list of available packages.
Data professionals with a statistical focus and a measure of programming ability need to check out the R environment. It is a valuable tool that shows off the creative worth of a vibrant open source development community. Grandfathered by S, and to some extent FORTRAN, R expands on the legacy of those seminal statistical computer languages.
Weka Brings Advanced Visual Data Mining to Java
Weka is a Java-based collection of machine learning data mining algorithms available under the GNU public license. Primarily developed by the Machine Learning Group at New Zealand’s University of Waikato, Weka features a robust community typical of the open source software community. Weka stands for Waikato Environment for Knowledge Analysis. It is also the name of a bird native to New Zealand.
The analytics software company, Pentaho, is also a major sponsor of the development behind Weka; they provide commercial licenses for companies wanting to use the tool in their own proprietary software. Pentaho’s Business Intelligence software leverages Weka for data mining and predictive analysis functionality.
Weka’s data mining algorithms are callable directly from Java code, or they can be applied to the actual data objects to be mined. Weka requires at least Java version 1.4 and the more recent versions of the product require either 1.5 or 1.6. Weka is compatible with Linux, Windows, and Mac OS X. While Weka works more easily with the Java environment, enterprising Windows users can leverage Weka with the .NET Framework.
In addition to its data visualization capabilities, Weka allows for data pre-processing, classification, clustering, regression, and association rules definition. It uses JDBC to connect to relational database sources; it also can read CSV files. Unable to do multi-table data mining, Weka normally works with single tables or the result of a relational database query.
The software’s interface features both a windowed explorer and command line interface each with similar functionality. Weka’s explorer features separate panels containing distinct data mining utilities mentioned earlier, in addition to the data visualizer.
Weka’s Explorer Interface in Action
One of the better ways to learn the Weka platform is a book titled Data Mining – Practical Machine Learning Tools and Techniques, published by Morgan Kauffman. It provides a general overview of data mining, while including sections specific to leveraging that knowledge by using Weka. The University of Waikato also hosts a wiki specific to the data mining framework.
Data professionals with an interest in furthering their data mining or analytics knowledge should make the effort to download, install, and explore the world of Weka. Its active support community is available to answer any questions or provide insight.
Gephi – an Interactive Open Graph Platform
Gephi fosters detailed analysis of graph networks with its high-end interactive functionality. Freely available through the GNU GPL license, Gephi is compatible with Windows, Linux, and Mac OS X. The tool is considered to be “Photoshop for Graphs.”
Plotting data points with Gephi
Developed in Java using the Netbeans framework, Gephi enhancements are managed by the Gephi Consortium, a group formed as the French equivalent of a non-profit company. Its members include SciencesPo Medialab as well as Neo Technology, the folks behind the Neo4j graph database. Google also sponsors Gephi, having included it as part of their “Summer of Code” student development initiative.
Gephi’s ultimate goal is to facilitate the job of the data analyst by providing an intuitive and colorful way to visualize the patterns lurking within graphed data. It is useful for the practice of Exploratory Data Analysis and serves nicely as a compliment to traditional statistical analytics. Gephi allows for the importing of data in the CSV and GEXF formats as well as being able to connect to data sources for real time data analysis.
Leveraging the open source OpenGL graphics framework, Gephi includes fast graph visualization capabilities, allowing for easier pattern discovery within large data sets. It handles graphs containing up to 50,000 nodes and one million edges. Gephi also features tools for dynamic filtering and real time graph manipulation.
Gephi’s state of the art algorithms, including force-based and multi-level, support the efficiency and quality of graph layout. The user is able to change palette and other settings while running a data set, enhancing the overall analytical efficacy of their work.
Analyzing social graphs is a core function of Gephi’s statistics and metrics framework. The tool offers community detection in conjunction with indicators for closeness, betweenness, clustering coefficient, PageRank, and more. Gephi also provides innovative dynamic network analysis utilities.
Full graph export capability to the SVG and PDF formats is provided, in addition to the ability to edit text labels and colors for final output. Export presets can also be saved for future use. Gephi also supports its own plug-in framework to both extend functionality and share tools within the community.
Those interested in learning more about Gephi need to check out the full collection of tutorials in addition to the product’s Wiki page. Anyone currently working in the graph database space should consider downloading Gephi and having a look. It is a worthy tool for the data analyst.
As the concept of visualizing information becomes more renowned as a needed discipline for the data professional, these three free open source tools provide a pathway to learning more about data visualization. Each product also has its own unique capabilities of interest to those working with data.