Data scientists can add another tool to their toolset today: GraphLab has launched GraphLab Create 1.0, which bundles up everything starting from tools for data cleaning and engineering through to state-of-the-art machine learning and predictive analytics capabilities.
Think of it, company execs say, as the single platform that data scientists or engineers can leverage to unleash their creativity in building new data products, enabling them to write code at scale on their own laptops. The driving concept behind the solution, they say, is to make large-scale machine learning and predictive analytics easy enough that companies won’t have to hire huge teams of data scientists and engineers and build the big hardware infrastructures that lie behind many of today’s Big Data-intensive products. And, the data scientists and engineers that do use it won’t need to be experts at machine-learning algorithms – just experienced enough to write Python code.
“Predictive analytics and machine learning let you make a more and interactive direct impact into company value,” says GraphLab CEO Carlos Guestrin, who began developing the technology as an open source offering some years ago as a professor at Carnegie Mellon. Recommender systems, for example, can impact someone’s cart before they exit a web site. “Realtime interactive performance can make a huge difference, but to make that happen you have to bring in a complex talent team that is hard to acquire and piece together lot of diff technologies that don’t talk to each other. With this new product we provide value from the original inspiration when someone has the idea for a new data product to take it all the way to production.”
Putting It Together
Key components to deliver this include the ability to bring together graph, tabular, text and even image data for use by the application the user wants to create. The company says it’s the first to apply advanced machine learning to all these types of data sets in a single platform. “Our innovation in machine learning is more around decreasing the expertise you need to deal with the algorithm itself,” he says, honing in on developing algorithms that require minimal tooling.
Another big innovation, Guestrin says, is its “series of scalable data structures that we call a frame. It is a scalable data frame so that you can go beyond the memory limits of a computer.” On a laptop with just 4 or 8 gigabytes of memory, for example, it’s possible to process hundreds of millions of rows of data using the hard drive, thanks to the product’s “smart way of laying things out so that we can quickly access and iterate over data that doesn’t fit in memory.”
At any point, he says, users can visualize data with visualization tools and move from prototype to production-level scale by running the same code on a Hadoop or Amazon EC2 cluster. “You don’t need an engineering team to translate a data scientist’s idea into Java and build all the hardware,” he says. “Now it can be done easily by one individual.”
It’s the road to getting value fast from the expertise you do hire, says Johnnie Konstantas, vp of marketing. Not every company, after all, can afford a slew of data science professionals, even if the talent were more readily available. “You should be able to get value with a couple of learned individuals, whether data scientists or software engineers, if you have the right platform that makes it easy for them to build those apps.” The reality today, Guestrin says, is that the complexities in the toolsets that data scientists have to use keeps them from being as productive in the organization as they want to be.
GraphLab competes at some level with consulting companies that offer to solve clients’ issues using machine learning, as well as other startups in the machine learning platform space, such as Skytree. It also overlaps a bit with features found in Trifacta’s technology (that company is profiled in a story at The Semantic Web Blog’s sister site, Dataversity, here.) But, says Guestrin (who sits on Trifacta’s technical advisory board), “they have done something fantastic,… but mostly they focus on the earlier parts of the pipeline, while we focus on what it takes to analyze data and make a service out of it.” Guestrin says Trifacta’s CEO sits on GraphLab’s tech advisory board, as well.
More directly GraphLab competes with open source projects like R, Apache Mahout and MLLib. But open source solutions, he says, can have issues related to performance and brittleness. “We and others using our systems have found our tools to be more scalable, robust and mature,” he says. Version 1.0 of GraphLab Create officially will be officially next week, when the third annual GraphLab conference takes place in San Francisco.