Data Science Education: Stanford University’s Program

The Stanford Data Science education program is small with competitive acceptance requirements. It is administrated by their Statistics Department. Guenther Walther, Chairman of the Statistics Department, was asked what students can do to get in the program. He responded:

“The admissions process is competitive, indeed. We had close to 400 applicants for the eight spots in our MS program, with a number of extremely well qualified applicants that could easily have made the cut at a PhD program at a first rate university. The key to a successful application is strong skills in math, statistics, and computing.”

The reason for such heavy math requirements is explained by Irwan Bello, a former student in the program. He wrote:

“The program is small and competitive and prepares its students to join the industry as Data Scientists, or continue on to their PhD in Statistics, Computational Mathematical Engineering, or Computer Science. The program has an academic approach to Data Science and is relatively focused on theory. There is much more mathematical/stochastic/statistical theory than needed for most Data Scientist positions and it’s possible to graduate without knowing about SQL or common Business Intelligence metrics if one doesn’t have industry experience. However, these are easy to learn by oneself, or during internships.”

The Data Science education program focuses on developing strong statistical, programming, and computational skills. A basic Data Science education is provided by electives from within the program, and from related areas. The total of units needed for the Master’s Degree is 45, with 36 requiring a letter grade. Students must maintain a GPA of 3.0, or better, and classes at the 200 level, or higher, are required. Students in the Data Science program “are not required” to satisfy the other course requirements for the MS in Statistics, but must meet the specific requirements of the Data Science program. This program has no thesis requirement.

Flagship Projects

DeepDive

DeepDive is a training system for Data Management designed to locate and extract useful information from what is called dark data. This is unstructured data found hidden in images, figures, text, and tables. DeepDive translates unstructured data into structured data (SQL tables) and then integrates it with a structured database.

DeepDive systems can be used by people lacking an expertise in Machine Learning. DeepDive allows users to perform extractions, integrate unstructured data, and make predictions, using a single system. Users can build end-to-end pipelines, permitting students to focus on sections. Earlier pipeline-based systems required their developers to build their own extractors, integration codes, and other components, with no understanding of how the changes altered the quality of the data. This aspect of DeepDive is described as the key to providing higher quality data more quickly than its predecessors.

DeepDive is led by Christopher Ré, an assistant professor in Stanford’s Computer Science Department. Ré accepted the “SIGMOD 2010 Jim Gray Dissertation Award” for his accomplishments in probabilistic Data Management. After graduating from the University of Washington with a PhD, he went to work at the University of Wisconsin, and then, in 2013, moved on to Stanford. His work in developing a join algorithm, focused on worst-case running time, won “best paper” at the PODS 2012. Ré has also helped in developing a feature engineering framework, resulting in “best paper” at the SIGMOD 2014.

Secure Analytics on the Internet of Things

The goal of this long-term project is to develop a secure system of end-to-end analytics for processing Big Data from the Internet of Things (IoT). To accomplish this, new computational models of cryptography are being explored and developed. A broad range of disciplines, such as analytics, cryptography, networking, and security are being used. It is hoped the research will produce new IoT applications capable of analyzing unstructured Big Data. Data would be encrypted at the source, with novel algorithms accepting the encrypted data as input and passing the encrypted data results along as output. The end user, with a proper key, can then decrypt the data, making the actual results of the computations available.

A Stanford dormitory was modified for the project. A network of smart water taps, shower heads, and other water faucets was installed with the hopes of promoting water conservation. These devices use Bluetooth Low Energy wireless technology to transmit how much water was used, the temperature, and when it was used. This information is then transmitted to student phones. The research focuses on four topics:

The data collection network and its equipment
The engine for analyzing streams of data
Developing algorithms designed to learn about water usage
Exploring intervention options leading to water conservation

Mapping the “Social Genome”

This project is designed to develop tools for predicting how language and networks work together. The long term goal is to find ways engineers, scientists, and community leaders can work more efficiently together using the internet. To this end, models of network structure and language are being studied (consider the difference between the standard email and a hard copy letter.) The original plan focused on three levels of social interactions- individuals, groups, and societies. Natural language processing and social network analysis are used to provide a model of:

What has been said within the community
Who said it
How the information was transmitted
How the transmission affected the network structure
How the evolving structure was affected by linguistic expressions

There are plans to create statistical models using online social networks (Facebook, Reddit, Twitter) and to hyperlink the networks of various news outlets, political groups, labs, and corporations.

Data Science for Personalized Medicine

The use of Big Data has allowed massive amounts health information to be collected at unprecedented levels. Information about genomes, transcriptomes, and other microbiological studies is combined with data from sensors and wearable devices, and medical records, to provide an accurate assessment of an individual’s health. This digital assessment of a person’s body makes predicting their future health much more realistic, and advances the ideal of personalized medicine. A major challenge is collecting, securely storing, and analyzing what is considered private information.

The goal of this project is to design a system capable of dealing with this challenge. To accomplish this, they will:

Develop new algorithms for sampling, the automatic replacement of missing data, and for joint processing
Build a unique framework to store and manage complex data in a secure fashion
Develop new ways to analyze complex, unstructured data

Hands-on Data Science Classes

Though Irwan Bello, a student in the program, suggested it is possible to graduate without knowing about SQL or common business intelligence, Stanford’s Science Data program does offer a Machine Learning class and a business class designed to imitate the experience of investment management.

The Machine Learning class teaches the most current and effective Machine Learning (ML) techniques. Students are taught the basics of Machine Learning and provided with the hands-on knowledge needed to quickly and efficiently apply these skills to new problems. This course offers a broad introduction to ML, data mining, and statistical pattern recognition. Additionally, students learn about Silicon Valley’s best innovation practices for Machine Learning. The following concepts are emphasized.

Supervised learning (neural networks, kernels, support vector machines)
Unsupervised learning (clustering, dimensionality reduction, and Deep Learning)
The course also teaches how to utilize learning algorithms when building smart robots, text understanding, medical informatics, computer vision, audio, and database mining

The Stanford Data Science education program also offers hands-on experience in the Real-time Analysis and Investment Lab (RAIL). This facility attempts to imitate the experience of investment management. It comes equipped with 24 work stations, providing a comprehensive set of applications used by money managers, investment banks, and hedge funds. The students use real-world applications in performing hands-on exercises and handling assigned case studies. This experience helps them to learn about accounting, finance, and investment management.

Stanford Artificial Intelligence Laboratory (SAIL)

On Sept 4, 2015, Stanford announced the creation of the SAIL-Toyota Center for AI Research, a unique research center funded by Toyota ($25 million) to support the development of AI technologies. The joint venture includes Toyota, Stanford, and MIT (a parallel research center is being built at MIT). These organizations share the long-term objective of reducing traffic accidents and casualties, and to assist drivers in new and varied ways. The center’s theme is “Human-Centered Artificial Intelligence for Future Intelligent Vehicles and Beyond.” The Stanford-Toyota Research Center brings researchers from a variety of different fields together with a focus on developing innovative solutions and innovative algorithms.

Currently, Fei Fei Li is the Director of SAIL (and Stanford’s Vision Lab). She has worked with colleagues and students, building smart algorithms for computers and robots to think and see. Fei Fei Li joined Stanford as an assistant professor in 2009 and in 2012 was promoted to the position of associate professor, with tenure.

Ms. Li believes computer vision is the key to developing Artificial Intelligence and that Stanford’s Data Science education program is aiding in the development. “Understanding vision and building visual systems is really understanding intelligence,” Ms. Li stated. “And by see, I mean to understand, not just to record pixels.”

LISTEN NOW: MY CAREER IN DATA PODCAST

Data Topics

Leave a Reply Cancel reply