Advertisement

Driving Greater SQL Scalability and Flexibility with Machine Learning

By on

SQLIt’s time to dispel some myths surrounding SQL. That’s the message from MemSQL, a scalable real-time Data Warehouse that is designed to ingest and transform millions of events of data per day, while simultaneously analyzing billions of rows of data using standard SQL.

As that description makes clear, there’s no reason to believe that there’s no such thing as scalable SQL, according to Gary Orenstein, Senior VP for Products at MemSQL. One of the oft-cited reasons for moving from SQL to NoSQL is concern that SQL solutions can’t scale, Orenstein says. But today, there’s a renewed awareness that it is possible to scale SQL, partially thanks to Google’s Cloud Spanner globally distributed, relational database service that counts among its features horizontal scaling.

“It’s challenging and not for the faint of heart, but it is technically possible,” he says. The perception that SQL suffers from scalability failings comes from the trials and tribulations people experienced over the decades with single-node databases that in fact didn’t scale well. “But that’s not true anymore,” says Orenstein. It’s possible to add additional nodes on industry-standard hardware to achieve scale-out for analyzing petabytes of data and supporting millions of users.

Going hand in hand with the scalability myth is the idea that SQL isn’t flexible. IT leaders generally also cite this as a reason for moving to NoSQL from SQL. As an example, Orenstein says that a knock made against SQL is that it’s not possible to make changes to a table while the database is online.

Not so, for example, with “MemSQL reshaping that norm with commands like Online Alter Table,” he says, and with support for performing transaction processing and real-time analysis on semi- and unstructured data streams (like JSON documents) in the same environment as structured data.

Spurning the Myths, Raising the Stakes

If the above SQL concerns were true, MemSQL would have a hard time fulfilling a mission that is centered on supporting the operational side of data activity – on helping the frontline of the business and the applications that power it on a daily basis, as Orenstein explains it. At the highest level, he says, MemSQL reunites the segmented transactional database and analytical Data Warehouse worlds, “to provide instant analytics on data that is up to date to the last click or event or current moment of business.”

That necessitates by MemSQL a focused effort on real-time transactionality to maintain the state of the application (i.e., the business); on performance leveraging memory-optimized tables to keep up with demands to deliver analytics in a timely way (i.e. fast answers to drive the business forward); and scaling systems in a distributed way to keep up with customers (i.e., the business’ growth).

It continues to push these goals in its latest release, MemSQL 6. The first of its three primary product pillars, Orenstein says, is extensibility, a term that refers to in-database functions that provide custom processing – stored procedures, user-defined functions and user-defined aggregates. MPSQL (Massively Parallel Structured Query Language) was developed for extensibility, with benefits that include the ability to centralize processes in the database across multiple applications, the performance of embedded functions, and the potential to create new machine learning functions.

Businesses that have invested in other database solutions that have developed their own approaches to in-database functions and want to move to a more modern solution or spend less money (or both) now can be welcomed with open arms by MemSQL, he says. “It’s a great way to take legacy apps nearing the end of their useful existence on older platforms to a more modern platform,” he says.

The second pillar targets query processing performance. Highlights here include conducting operations on encoded data and compressed data for very rapid scans – up to one billion rows per second per core, the company says – and taking advantage of Intel advancements with Single Instruction, Multiple Data (SIMD). The CPU can complete multiple data operations in a single instruction, essentially vectorizing and parallel processing a query across multiple channels. “For some queries that means they can sometimes be 80 times faster” than in the previous version, he says.

Enhancing online operations for business front lines is the third product pillar. “A front line system has to support mission-critical workloads – it can’t go down, so we have enhancements to keep it online,” providing resiliency against network and machine failures, he says.

Matters of Machine Learning

Regarding the extensibility pillar, Orenstein expanded on its ties to Machine Learning. By providing database developers the ability to create custom processing functions, they can program machine learning algorithms natively in SQL in place. As an example, they could build an algorithm that makes it possible to use machine learning to determine how data best clusters together.

“The benefit is you do Machine Learning right there with operational data and in a position to serve the data in realtime, and you can connect all your favorite BI tools,” he says.

Given the number of people likely familiar with SQL in an organization, which could reach to the thousands, enabling Machine Learning in SQL could have broad impact when it comes to closing the gap between data science and production applications. “By bringing Machine Learning to SQL we foresee enabling a much larger portion of an enterprise organization to be able to take advantage of these techniques,” he says.

“This is built-in functionality to take advantage of Machine Learning with a simpler architecture, a faster architecture and one that is constantly able to use the most up-to-date data.”


Real-time data Machine Learning scoring is possible, too. Take the example of an energy company that MemSQL worked with, which is related to running a large drilling operation and a desire to know if a drill bit is stable or if its usable life is gone. Before MemSQL the company would take that drill bit’s sensor data and send it to an overnight batch process with a traditional static system and in the morning get input as to its state, he says, and by then the operator was practically getting on with the day on a best guess basis. MemSQL was able to have the model exported from SAS into MemSQL via PMML (Predictive Modeling Markup Language) format so that as sensor information came in, the data was scored in realtime.

“So, second by second the operator knows how the drill bit is doing, and can stop using it just before it breaks and needs an expensive repair,” he says. The use cases for real-time Machine Learning scoring extend to many industries, including financial services for trading and fraud monitoring and media and communications for understanding viewership in a way that can benefit both the audience and advertising partners.

Consider it the latest step in the company’s efforts to help customers capture and make use of their mountains of data in as rapid a way as possible.

 

Photo Credit: Wright Studio/Shutterstock.com

Leave a Reply