Today, we’re seeing more companies embrace cloud-based technologies to deliver superior customer experiences. An underlying architectural pattern is the leveraging of an open data lakehouse. That is no surprise – open data lakehouses can easily handle digital-era data types that traditional data warehouses were not designed for.
Data warehouses are great at both analyzing and storing the tables and schema that represent traditional business processes surrounding products, sales transactions, accounts, and other structured data. Open data lakehouses can additionally analyze and store semi-structured and unstructured data, which includes data like click-stream data, sensor data, geospatial data, and media files. Analysis is performed via traditional SQL queries and ML/AI programming frameworks. On top of this flexibility, the open data lakehouse offers these capabilities with free, open-source packages and open data formats. But unlike the data warehouse, open data lakehouses don’t come as one integrated platform. They are best-of-breed OSS stacks to deliver the query execution capabilities, transactional support, and bullet-proof security.
In this article, we’ll look at how companies are building the open data lakehouse as an augment to the data warehouse. The open data lakehouse is a more flexible stack that solves for the high costs, lock-in, and limitations of the traditional data warehouse. Specifically, we’ll look at how companies are securing the open data lakehouse, including initial challenges and their open-source solutions.
The open data lakehouse consists of low-cost, scalable data lake storage (e.g., AWS S3), database-like data management functionality (e.g., Apache Hudi, Apache Iceberg, Apache Ranger), open data formats (e.g., Apache Parquet, ORC), governance/security (e.g., Apache Ranger, AWS Lake Formation), ML and AI Frameworks (e.g., TensorFlow, PyTorch) and SQL query processing engines (e.g., Presto). On top you have your reporting and dashboarding tools along with your data science, ML, and AI tools.
While this article will focus on security, it’s important to note that SQL query capabilities, ML and AI frameworks, and transactional support can all be added to your data lake. Many companies are evolving to this architecture for the reasons listed above – better cost, more flexibility, and better price-performance than the data warehouse paradigm.
Implementing Data Security: The Data Platform Team
As the data lake has become widely used, digital-native companies are more closely managing the data security and governance of their diverse data sets and their corresponding use. Controlling who has access to what data and what permissions a user might have is critical. For the teams working on data lakehouse security, the organization typically consists of the data platform owner, the data practitioner (i.e., data analyst, data scientist, data engineer), and the security administrator. For the purposes of this article, we’ll focus on the data platform owner and the data practitioner.
When it comes to data lakehouse security, there are three key areas that need to be addressed:
- Multi-user support
- Role-based access control
In the last year, we’ve seen a pronounced effort around building technologies that address these areas for the data lakehouse. Before, it was a challenge to address these security requirements – the data platform team would have to custom-build and manage these policies on their own. As companies grow, their data and the users who need access to that data increase dramatically. Keeping up with that scale from a security perspective was very hard; many times, it meant sharing access credentials across teams or just giving everyone access to everything in the lakes.
Now, as more proprietary and personal data is being stored and more data practitioners work on the data lakehouse, security needs to be much tighter. Below, we’ll dive into these three key security areas and why they’re important.
Data practitioners need access to computing clusters that the data platform owner provisions for them. This is why identity access management and authorization are important. Multi-user support within an open data lakehouse architecture helps make this possible, so it’s a critical component of security. Instead of everyone being a data platform owner, it means giving narrower rights to multiple users or specific users credentials to specific clusters, which reduces “key-person” risk coverage across teams. Ultimately, the data platform team wants easy management of a set of users. Sharing credentials across an organization doesn’t meet today’s security requirements.
Authorization levels for an organization’s users are the next critical piece of security. Data needs to be authenticated and authorized in a unified way – you want to make sure the right people within your organization have the right access to their data. Some of the more common RBAC technologies we see in the open data lakehouse stack are Apache Ranger and AWS Lake Formation. Both offer fine-grained access control for your data, giving data platform owners more control over who can access what data.
Audit support allows for the centralized auditing of user access based on permission levels. Additionally, Apache Ranger does auditing on an audit, which is when users interact with data, it tracks what they did. It’s also important to be able to track when users request access to data and if those requests are approved or denied based on permission levels.
Key Technologies to Enable Data Security
We’ve touched on a few technologies, so let’s dive a little deeper into them. When it comes to securing your data in the data lakehouse, there are three technologies to dive into: Apache Ranger, AWS Lake Formation, and Presto.
Apache Ranger is an open-source framework that allows users to manage data security across the data lake. One of the big benefits of Ranger is its open and pluggable architecture, meaning it can be used across clouds, on-prem, or in hybrid environments and can be integrated with various compute and query engines including Presto, Google Big Query, Azure HDInsight, and many more. Apache Ranger gives you unified data access governance and security for your data.
Amazon Lake Formation is an Amazon service that makes it incredibly easy to set up a secure data lake in a matter of days. For AWS users, this service is very easy to integrate into your existing stack and is typically the go-to choice. Lake Formation provides the governance layer for AWS S3, and it’s incredibly simple to set up – users define their data sources and what access and security policies they want to apply, and they’re up and running.
Presto is an open-source SQL query engine for the data lakehouse. It’s used for interactive, ad hoc analytics on data as well as the common reporting and dashboarding use cases. It runs at scale at some of the top digital companies like Meta/Facebook, Uber, Bytedance, and Twitter. With Presto, data platform owners get built-in multi-user support for their Presto clusters (which access the data in the data lake to run queries). Presto makes it easy to control who has access to what data. If you use a Presto managed service, you can leverage pre-built integrations with Apache Ranger and/or AWS Lake Formation to take advantage of the security and governance those technologies provide as well.
Securing data in the data lakehouse has become even more paramount as more companies are looking to augment their cloud data warehouse with the insights on their lake. With all the benefits the data lakehouse offers, including better cost, more flexibility, better scale, and being more open, digital-native companies want to leverage it more than ever before. And now it’s possible to rest assured that the data lakehouse security is on par with the data warehouse. With more fine-grained access control and governance capabilities in the market today, it’s now possible to architect a fully secured data lakehouse.