Using a Data Lake Engine to Provide Self-Service Insights

By on

Understanding and fulfilling customer needs is the key to business success, and customer data is the foundation upon which that success is built. Accessing and analyzing data is almost always dependent on data engineers and other IT staff, while decision-makers wait to receive insights. One way to skip the wait, and deliver data directly to end-users, is by creating an internal Data-as-a-Service (DaaS) model, enabling access to enterprise data no matter where it resides, without the assistance of IT staff.


According to the DAMA DMBoK2, there are two models for Data-as-a-Service: One model uses data from outside the company, and the second model uses the company’s internal data, presented as a “service” via the IT department to internal data consumers. External DaaS uses data licensed from a vendor, provided on demand, rather than being stored and maintained by the licensing organization. A common example of this type of Data-as-a-Service includes information on the securities sold through a stock exchange and associated prices. The internal model of DaaS uses the concept of “service” within an organization to provide a company’s own enterprise data or data services to various functions, people, and operational systems.

Daniel Newman, in a post on Forbes entitled Data as a Service: The Big Opportunity for Business said that most companies with onsite data storage and analysis “are hard-pressed to keep up with increasing demand for data-driven insights.” DaaS offers catered data streams tailored to client needs, saving valuable time and effort, he said. When companies have access to the data they need in an easy-to-use format, it makes leveraging that data as an asset much easier and less time-consuming.

Tomer Shiran, co-founder and CEO of Dremio, says that the goal is to make it possible for companies to finally become data-driven, striving toward the “Holy Grail of analytics, to ask any question of the data at any time, regardless of how big the data is or what system it’s stored in.”Shiran believes accessibility to analytics should be similar to a utility: “Just like you can tap into electricity or open the faucet at home and you have water. You don’t have to worry about it.” The reality, he said, is that companies don’t have all their data in one place, so they are far from being able to access and analyze their data easily.

Scattered Data and the Burden on IT

Considering many companies see their data as their main differentiating asset, they should be able to take advantage of it, Shiran said, but for most companies that’s impossible. With data scattered in multiple different systems, accessing it for analytics becomes too complex and overwhelming, and the skillsets aren’t there to be able to organize it and run queries on it.

IT staff today are forced to copy and move data from the lake to data warehouses, cubes, BI extracts, and aggregation tables in order to gain enough performance be able to ask questions of it, Shiran said. But doing so also dramatically shrinks the scope of data available for analysis. “The as-yet unrealized goal is to be able to ask questions on all of the data, regardless of where it is and still get an extremely fast response.”

End-users don’t understand or don’t care about the difference between an Oracle database and a directory of parquet files on S3, he said. “The only way this is ever going to work is if you can ask questions on the data where it is, and increasingly that’s in data lake storage.”

To users on the business side, a data set is a data set, and they just want to easily add new sources, and experience fast response times regardless of whether they’re querying on a single source or across multiple sources. “People don’t want to go through a travel agent anymore. They want to be independent and free to move quickly”

The workplace has evolved to where analysts on the business side are very knowledgeable about the use of data and want to be able to explore all of it and ask their own questions. “These folks no longer want to just see a print-out on their desk in the morning. They want to go and do it themselves.”

Hadoop and Vendor Lock-In Challenges

Hadoop-based data lakes ultimately became difficult for companies to create, maintain, and use he said, so the people who got the most value out of them were the developers and technical staff.

“Dremio started by thinking that if you could start all over with a clean slate and make it radically easier and faster to query data lake storage and other sources, it would be magical.”

The “clean slate” mentality allowed them to see the wisdom in capitalizing on current technology trends in the industry, such as cloud adoption, and in particular the trend towards landing and storing all types of data in cloud-based data lake storage like AWS S3 and Microsoft ADLS. And the dramatically increasing volume of that data means it is becoming ever less practical to copy, transform, and move it into data warehouses. It was starting to become clear that all companies, not just startups and technology companies, would be leveraging the public cloud in a big way, Shiran said, so they wanted to build on that trend.

They also wanted to have an open approach, where companies could choose the clouds they wanted and easily migrate between them. “A lot of companies do have a multi-cloud strategy. Being able to utilize that same technology both for your on-premise data lake and your cloud-based data lake is equally important.”

A problem they wanted to avoid was vendor lock-in, a trend they’d been hearing about from companies over the last decade. Being locked into specific vendors or into a specific kind of data warehouse with skyrocketing costs has been a pain point for customers, he said. “Our focus as a company has been to innovate in a way that allows the customer to use other compute engines and other tools with their data.”

Data Lake Engine

A modern system has to be able to support data independence and innovation by providing quick, accessible answers to user requests, no matter where the data resides. Dremio combines data lake storage with their purpose-built data lake engine, providing flexibility and control for data architects, and self-service for data consumers, Shiran said. With the data lake engine, data consumers perform their analytics directly against the data lake, at full interactive performance. All data remains in place, as the data lake engine eliminates data copies and moves.

The data lake engine provides a user-generated semantic layer with an integrated, searchable catalog that indexes all metadata so business users can easily make sense of all their data. It can connect to any BI or Data Science tool and looks just like a relational database. Data curation in a standard SQL virtual context allows fast, easy, and cost-effective filtration, transformation, joining, and aggregation of data from one or more sources, all without any involvement from IT and data engineering teams.

Data architects maintain complete control: Sensitive data can be masked, row and column-level permissions can be set, and role-based control ensures smooth access to whatever end-users need. Data lineage is built-in, with relationships between data sources, virtual datasets, and queries maintained in Dremio’s data graph, showing exactly where each dataset came from.

Shiran used Royal Caribbean Cruise Line as an example of a company that uses DaaS to provide a personalized experience for their customers. “They have created a very modern data architecture in the cloud, on Azure, and they have data in a couple dozen different systems feeding into Azure Data Lake Storage,” he said, ranging from property management, to their casino, to their reservation systems.

Customer behavior is captured in the period before booking a cruise and as customers shop for their trip, and this is combined with information about what they do on the cruise, as well as the feedback that the customers provide after their cruise. That comprehensive data collection process provides a much deeper understanding of their customers, allowing Royal Caribbean, for example, to send a retired couple a different targeted cruise offer than the offer they would send to a family with four young children.

Massive Change Begets Opportunity

“We are in the midst of a massive change due to the rise of the public cloud, and with the resulting separation of compute and storage,” Shiran said. In the past, with Hadoop clusters, the compute ran on the storage because at the time, networking was the biggest concern. “It was the shuffle speed, and wondering if I had enough networking bandwidth to actually make these big queries work.” Now with the cloud, networking is no longer an issue, and because storage is offered as a service, the compute is separated. “So now you see this opportunity for companies to choose the best tool for the job.”

In an interview on Sourceforge, Shiran said that DaaS is a paradigm for making data easy to discover, curate, share, and analyze no matter where it is being managed, no matter how big it is, and no matter what tool is used for analysis or visualization. DaaS integrates several functional areas into a single, scalable, and self-service solution. By adopting the DaaS paradigm, companies can make their data consumers more self-sufficient and independent, while making their data engineers more productive.

“Companies need to be data-driven in order to survive in the world that we live in now, but unless it’s easy, that’s just not going to happen,” Shiran said.

Image used under license from

Leave a Reply