Advertisement

JupyterCon and Data Science Analysis 2017

By on

Click to learn more about author Steve Miller.

The inaugural JupyterCon is now in the books. A larger-than-expected  turnout of 700 data scientists, business analysts, researchers, educators, developers, core Project contributors, and tool creators descended on NYC  August 22-25 for in-depth training, insightful keynotes, networking events, and practical talks exploring the Project Jupyter platform.

Jupyter Notebooks is an “open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.” I’ve had the good fortune to use JN for almost four years now, having been smitten by a Strata presentation from originator Fernando Perez about a tool called IPython that morphed into Project Jupyter.  As was evident from the conference, Jupyter Notebooks is increasingly becoming lingua franca of data science analysis and computation.

The appeal of Jupyter revolves on its ability to provide a comprehensive, browser-based, interactive development environment that can organize all data analysis components, including access, wrangling, exploration, visualization, analysis, and modeling – with text and equation markdown. A notebook with such soup-to-nuts computation capabilities can readily be shared among analysts/scientists and provide the foundation for demonstration of reproducibility. I routinely “exchange” notebooks with peers to showcase programming and analysis techniques. And my technical blogs are often just the html output from “executed” notebooks.

UC Berkeley’s Perez kicked off the keynotes by waxing historical on early involvement with interactive Python computing as a physics grad student leading to the development of IPython in 2010. The end goal was “Human-centered, interactive, computing and science”.

What have been engineered are low-level standards that include: 1) messaging protocol, notebook format; 2) reusable libraries that implement them; 3) user-facing applications — IPython, Jupyter Notebook/Lab, JupyterHub, … ; and 4) services that make them accessible – nbviewer, try.jupyter. All in support of an open ecosystem.

Jupyter’s ascendance has been exponential – from humble beginnings to an estimated 6-8 million worldwide users today. A Fall 2017 data science class at Berkeley driven by Jupyter has enrolled 1200 students from over 60 disciplines. A next generation Jupyter Notebooks, JupyterLab, is about to release an alpha version, and JupyterHub30, which “can be used to serve notebooks to a class of students, a corporate data science group, or a scientific research group”, is available now.

Harvard’s Demba Ba affirms the rise of Jupyter in upper echelon academia, delineating the successes of both undergraduate and graduate courses in data science at Harvard driven by Jupyter in the AWS Cloud. Ba’s response to a student’s question “So you would have loved to take your own course?”: “Yes, absolutely!”

One of common themes at the conference was interoperability among notebook languages. Laurent Gautier’s Wednesday tutorial, Pragmatic Polyglot Data Analysis, included Python, SQL, and R in individual Python notebooks. Pandas originator Wes McKinney took that thinking one step further in his talk, Data Science Without Borders. In the next 10 years, he envisions a shared language front-end for data science, making specific language silos much smaller, and including a shared data science computation runtime. McKinney’s work with R language architect Hadley Wickham with the Apache Arrow project is a significant start down this path.

Lorena Barba’s Design for Reproducibility articulated another important conference theme. For Barba, an academic researcher, reproducible means that “authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.” Indeed, “the core problem we are trying to solve is the collaborative creation of reproducible computational narratives.” Along those lines, many of O’Reilly’s new data science books are written in/generated by Jupyter Notebooks.

Mathematician Rachel Thomas enthralled the audience with her talk, How Jupyter Notebook Helped Us Teach Deep Learning to 100,000 Students. The point of departure for her deep learning classes co-sponsored by fast.ai and the University of San Francisco is that students needn’t be a math heavy to progress in DL – “if you can code, you can do deep learning.” The many intriguing student applications of deep learning she presented mostly subscribe to practical understanding and code versus theory. Noted one top student: “I personally fell into the habit of watching the lectures too much and googling definitions/concepts/etc too much, without running the code. At first I thought that I should read the code quickly and then spend time researching the theory behind it… … In retrospect, I should have spent the majority of my Eme on the actual code in the notebooks instead, in terms of running it and seeing that goes into it and what comes out of it.”

Finally, I was excited to see the demos from JupyterLab: The next-generation Jupyter frontend, by Brian Granger, Chris Colbert, and Ian Rose. Model view separation and collaboration in tandem with Google Docs were particularly noteworthy. “JupyterLab is the next generation user interface for Project Jupyter. It offers all the familiar building blocks of the classic Jupyter Notebook (notebook, terminal, text editor, file browser, rich outputs, etc.) in a flexible and powerful user interface that can be extended through third party extensions that access our public APIs. Eventually, JupyterLab will replace the classic Jupyter Notebook.”

It’s hard not to be excited for the futures of both Jupyter Notebooks and JupyterCon. I look forward to working with JupyterLab and attending the conference in NYC next summer.

 

Leave a Reply