by Charles Roe
DATAVERSITY™ recently interviewed Damian Black, the CEO of SQLstream, Inc. Damian will be giving a session at the NoSQL Now! Conference in San Jose, CA from August 20-22, 2013. The session is titled “Streaming Hadoop HBase Case Study: Turbo-charging Hadoop for real-time machine data analytics.”
The Speaker Spotlight Column (and its parallel venture the Sponsor Spotlight Column) is an ongoing project that focuses on highlighting several of the central issues represented at the many Data Management conferences produced by DATAVERSITY.
The primary emphasis of the interview was to question Damian Black on his work and history within the industry, with particular importance on his presentation at the upcoming conference:
DATAVERSITY (DV): What are you going to discuss during your session at NoSQL Now! (Please be specific about the most important theme you’ll be discussing and the benefits the audience will obtain)?
Damian Black (DB): The session explores the challenge of accelerating Hadoop to process live, high velocity unstructured data streams in order to deliver the low-latency, streaming operational intelligence demanded by today’s real-time businesses. Hadoop has been the driving force behind Big Data analytics but as the technology hits the mainstream, many industries are seeking to take a step further and eliminate latency from their business completely. With the emergence of both streaming and the SQL language as key components of Big Data architectures, this presentation discusses how to best utilize the strengths of Hadoop in real-time operational intelligence applications using streaming SQL queries, a new concept for Big Data.
DV: What is really important about such a topic in terms of the current state of NoSQL (and databases in general) and/or how the industry going to transform moving into the future?
DB: There are two important emerging Big Data technology trends. The first, SQL may be a surprise to some, but is inevitable as Big Data storage technologies seek wider and mainstream adoption. SQL as a query language does not mandate an underlying RDBMS. SQL is emerging as an additional query language layer for Big Data storage platforms rather than a replacement for existing NoSQL platforms.
The second is streaming operational intelligence for high velocity data, in particular, from log data. Even a year ago, streaming was not considered a particularly mainstream technology. That perception has now dramatically changed, with the realization that Hadoop and NoSQL have the same inherent flaws when it comes to real-time and low latency analytics as previous RDBMS platforms. Storing data, even in-memory, introduces latency and once stored queries must be re-executed continuously as new data arrive. This is adequate for some use cases, but for applications such as cyber-security, telecommunications and the Internet of Everything, we believe streaming data management offers the only technical and cost-effective solution.
DV: In terms of your current organization, what is the principal focus of your work? How does you topic tie into that work?
DB: The SQLstream vision is simple – to transform the way in which real-time data are processed. Machine data generated by their servers, networks and sensors contain valuable insights into transactions, performance and fraud for example, but is only useful if acted on in real-time. True real-time data management platforms have been a niche area, with expensive, bespoke development, and systems highly tailored to specific applications.
Our mission is to change that, to deliver a true SQL-standards based streaming Data Management platform that turns high velocity machine data into real-time operational intelligence before the data are stored. Our current focus is the interoperability of streaming SQL queries over Hadoop and NoSQL storage platforms such as HBase. SQLstream is a massively parallel, continuous SQL query execution engine that can be deployed as a streaming query language extension to Hadoop, interfacing with existing query and storage APIs.
We chose SQL as it is the only standards-based data query language, is well defined and understood globally, it offers low cost and rapid development, and most importantly, it is ideally suited to dynamic optimization of streaming query execution over a massively distributed server infrastructure.
DV: Please tell us a little about yourself and your history in the industry, past work experience, and how you got started in the data profession?
DB: I was a top Computer Science graduate in the UK and then set out on a career in the high tech software sector, with senior management positions at HP, XACCT and Followap. It was during this time that the idea for SQLstream emerged. After managing the introduction of several very successful but architecturally different event management platforms, I realized there must be a better way, a standards-based data management platform capable of supporting a wide range of real-time applications.
I am also the author of 11 US patents covering the broad area of distributed, relational streaming, and was a finalist in the 1995 International Management Challenge. I am currently co-founder and CEO of SQLstream, and although British by birth, I now live in San Francisco Bay Area in California.
DV: What is the biggest challenge happening in your particular area of the industry at this time?
DB: From a business perspective, the biggest challenge is the total cost of real-time performance from solutions built on Big Data storage platforms. Until now, Big Data has been considered as a technology, but as the technology seeks adoption by mainstream enterprises, cost is a much more important consideration. Ultimately there are latency constraints as a result of the technology platforms. There is also a cost tipping point, beyond which adding more bigger and faster servers makes little impact on the real-time performance of the system.
From a wider technical perspective, data collection architecture is also a challenge for many organizations. The core SQLstream platform can scale to many tens of millions of records per second by adding more servers, but there are often cost and bandwidth constraints on the collection and backhaul of the data. We are now getting to the point where data volume and velocity exceeds the capacity on Gigabit Ethernet links for example.
DV: How is such a change influencing your job?
DB: As a response to this new trend, streaming operational Intelligence is emerging as the next phase of Big Data, enabling true real-time performance from structured and unstructured machine data, and serving wider and wider audiences. Delivering real-time end-to-end traveller journey times, identity theft, fraud and cybersecurity alerting, real-time game scoring, and real-time promotions, are all examples where the customers and consumers benefit from real-time, streaming operational intelligence.
With our platform in place, we have focused recently on improving the total cost of performance for streaming (greater throughput and lower latency with less hardware), and on extending our range of data integration adapters and agents, for both data collection and streaming downstream integration with Hadoop, other Hadoop-based and RDBMS storage platforms, data warehouses and operational systems. In particular we have introduced a new log file remote agent collection architecture to address the issue of backhaul bandwidth. Our remote log agents offer resilience (in the event of communication link failure) and configurable filtering and analytics capabilities. This makes it possible to do processing at source, and can reduce dramatically the volume of data to process.
DV: How have your job, and/or the work you are doing at your organization, altered in the past 12 months? How do you expect it will change in the next 1-2 years?
DB: The original vision is substantially in place, to build and maintain a massively scalable, distributed stream computing platform that uses standard SQL for stream analysis and integration. Over the past year, we have focused on intelligent log data collection and analysis, as well as visualization tools for high velocity data. Over the next two years we expect to see streaming data management adopted as a de facto component of all Big Data architectures, the emergence of SQL as a compatible partner to NoSQL, a greater proportion of data to be held in memory rather than persisted, and staggering new levels of data velocity with the growth of the Internet of Everything.
DV: Are there any other emerging technologies (such as Big Data, Semantics, Cloud computing etc.) going to affect your job in future?
DB: There is some really interesting work going on at the minute around semantic analysis, machine learning and predictive analytics. However, we believe the foundations of data management will change completely with the emergence of all in-memory, massively distributed computing platforms. Larry Ellison announced this as the future of Data Management at last year’s Oracle OpenWorld Conference, and I would tend to agree.
DV: What is something noteworthy about yourself, outside of the work environment, which you would like to tell the conference attendees and our readers that they may not know?
DB: I try to put the same level of passion, commitment and enthusiasm into everything I do. I am a keen skier and have skied back-country with former national team and Olympic skiers, but recently discovered mountain biking. However I have found that managing a Big Data startup in Silicon Valley is pretty straightforward in comparison with riding down steep rutted cattle tracks with seasoned mountain bikers. The last such attempt I fractured a rib and probably my wrist but at least got to buy a new bike!
If you are interested in attending Damian’s session at NoSQL Now!, please see the conference schedule at: http://nosql2013.dataversity.net/agenda.cfm?confid=74&scheduleDay=PRINT
His session is on Thursday, August 22nd at 9:30am.
About NoSQL Now!:
NoSQL Now! is an educational conference and exhibit focused on the emerging field of NoSQL technologies. NoSQL (Not Only SQL) refers to the new breed of databases that are not based on the traditional relational database model, including document stores, key value stores, columnar databases, XML databases and graph databases. The NoSQL Now! Conference is designed to educate developers, data managers and architects on how these new technologies work, the applications they are best suited for, and how to deploy them. The upcoming 2013 event is expected to draw over 800 attendees. Additional details are available at http://www.NoSQLNow.com.