Advertisement

Pragmatic Polyglot Data Analysis

By on

Click to learn more about author Steve Miller.

Yesterday was a pretty fast-paced day for me in New York. In the morning I took a walk from my midtown hotel to the Koch Queensboro Bridge (Simon and Garfunkel fans will know it as the 59th St Bridge) where I ducked speedy bicyclists crossing over and back to Queens on a gorgeous day.

Then I took on the rapid-fire Polyglot Data Analysis using Jupyter Notebooks tutorial by data scientist Laurent Gautier.

Jupytercon is the official conference dedicated to Jupyter Notebooks, an “open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.” I’ve had the good fortune to use JN for almost four years now, having been smitten by a Strata presentation from originator Fernando Perez about a tool called IPython that morphed into Project Jupyter.

The appeal of Jupyter revolves on its ability to provide a browser-based, interactive development environment that can organize all data analysis components, including access, munging, exploration, visualization, analysis, and modeling – with text and equation markdown. A notebook with such soup-to-nuts computation can readily be shared among analysts/scientists and provide the foundation for demonstration of reproducibility. I routinely “exchange” notebooks with peers to showcase programming and analysis techniques. And my technical blogs are often just the html output from “executed” notebooks.

There are over 40 languages or kernels supported by Jupyter, enough to overwhelm even the nerdiest of data scientists. I primarily use Python, R, Scala, Julia, and WPS, a SAS-clone, As much as I enjoy devising notebooks with the different languages, though, I often rue the inability to combine the best features of each. This is where Laurent Gautier the pragmatic polyglot comes in.

In a fast-paced three hour presentation, Gautier presented a series of Python notebooks that climaxed in a meaty, polyglot data science analysis. His point of departure in the first notebooks was SQL in Python with sqlite3 — fine by me though I generally prefer PostgreSQL as the database engine.

Next up was a demonstration of the Python package Pandas for handling rectangular data in dataframes, akin to SQL tables. As readers of this blog know, I’m a big Pandas fan.

The fourth notebook introduced R’s version of dataframes along with its powerful dplyr manipulation package. The R code was run in a Python notebook using R “magic” capabilities provided by Gautier’s powerful rpy2 Python library. Laurent’s a big advocate of R graphics, so notebook five used magic as well to demonstrate R’s lattice and ggplot2 visualizations in the Python kernel.

Notebook six demo’d more complex rpy2 interoperability, combining the power of dplyr and ggplot2 functions against Pandas data. Heady stuff.

Notebook seven introduced big data capabilities with Spark and the PySpark Python library. At this point there’s SQL, Python, Pandas, R, dplyr, ggplot2, rpy2, Spark, and PySpark code in a single Python notebook. Truly multilingual. The final notebook used all the polyglot capabilities to assemble a hefty series of maps and graphs.

Pragmatic Polyglot Data Analysis really resonated with me. I’ve used each of the components Gautier presented and often combined them in simple ways. But now I feel I can think more complex polyglot out of the gate when architecting data analysis solutions. And Jupyter Notebooks will assuredly be the development glue for those future solutions.

Leave a Reply