Click to learn more about author Mark Hensley.
As an adjunct instructor at University of Redlands, I teach several courses in the fields of business and technology. In that capacity I get to meet a lot of bright young people with a keen interest in data and Analytics. Unfortunately, the little financial resources most students have get eaten up by tuition, living expenses and pizza (and not necessarily in that order).
The result is that students looking to experiment with data and Analytics don’t have a lot of money left over to buy expensive proprietary software. In a bid to help those students, and others with limited financial resources, get started with analytics, I’d like to highlight a series of Open Source Tools that offer the same (or better) functionality as proprietary alternatives.
In particular, I’ll cover databases, extract, transform and load (ETL) and Machine Learning. You’ll notice that I’ve avoided discussing visualizations. That’s because while visualizations (graphs, charts, etc.) are often what people associate with Analytics, they’re typically the last piece of the puzzle. Data Visualization is also a complex topic and one that’s better dealt with on its own in a separate piece.
At the risk of stating the obvious, Analytics are not possible without data. As a result, practitioners are faced with two options: relying on IT to provide the data or becoming self-sufficient. As someone who’s been around the block, I strongly encourage you to choose the latter – IT has other priorities and you may find yourself waiting for a long time. You need to be able to extract data from relational and nonrelational databases on your own, so learning SQL is a must!
- Apache Hadoop: Named after a toy elephant belonging to the son of one of its creators, Hadoop is a framework for the distributed processing of large data sets across clusters of computers – often running on commodity hardware. The success of Hadoop has spawned a number of vendors like Cloudera, Hortonworks and MapR that offer enterprise distributions of the technology.
- MariaDB: Michael “Monty” Widenius, the main author of the original MySQL database, and others have started a new project in the form of MariaDB. Used by the likes of Google, Wikipedia and WordPress, MariaDB offers a plethora of storage engines, plugins and other tools.
- MongoDB: A NoSQL system, MongoDB offers a series of features including ad hoc queries, indexing, replication, load balancing file storage and aggregation. The key value propositions of the database are the ease with which users can access and analyze stored data and its high degree of availability and scalability.
Data is rarely in the format necessary to conduct analysis. Having access to (and knowing how to use) an ETL tool can greatly decrease the amount of time you spend exporting, cleaning and normalizing data. ETL further allows you to automate manual data-related activities such as deduping. While not as sexy as some other topics, ETL is critical in order to conduct Analytics at scale.
- Talend: Talend’s key value proposition is the hundreds of connectors and components that allow users to import data from Hadoop, Spark, Excel, CSV files and more. I highlighted the consequences of having data get siloed in my previous piece. Solutions like Talend help to ensure that data freely circulates between your data warehouse and enterprise systems.
As an analyst, you can’t possibly create models for every single scenario. The automation of Data Modeling is therefore a great place to deploy Machine Learning. Other uses of Machine Learning extend to data categorization and include identifying and recommending useful offerings to customers.
- Java: Java is usually thought of as a general purpose programming language but has particular relevance for Machine Learning. That’s because there are a wide variety of Machine Learning libraries that you can use in Java. While I can’t name them all here, a few that spring to mind are RapidMiner, which provides a GUI and Java API for developing applications, Spark’s MLlib, a massive library of Machine Learning algorithms and Weka, another powerful library of algorithms with a particular emphasis on Data Mining.
- Python: Like Java, Python is a general purpose programming language whose relevance for Machine Learning lies in its libraries. Notable Python Machine Learning libraries include the Google-incubated TensorFlow, a high-level neural network library, Caffe, a library for Machine Learning in vision applications and SciKit-Learn, Python’s Machine Learning module performing clustering, classification, regression and other tasks.
- R: Available under the GNU General Public License, R is a programming language and environment heavily used by the Data Science community to develop statistical software and perform data analysis. As someone who’s not a developer, I like R for its simplicity and effectiveness, large collection of tools for data analysis and the ease of extensibility via additional packages.
Do you agree with my recommendations? Are there others that I missed? Let me know in the comments below!