Click to learn more about author Andrew Brust.
Today’s leading Cloud Platforms include numerous components for storing, processing and analyzing large volumes of data. All the basics are there: storage, analysis and processing, streaming data processing, data pipelining, data warehousing, BI and even AI. But while it’s great to have all those raw components, how do you tie them together into a comprehensive architecture? While the parts are great, they still must be assembled into the whole.
In this post we’ll explore the various data processing and management components available to you on the Amazon Web Services (AWS) platform. We’ll discuss what’s possible with each of them and review various options for tying them together. Then we will contrast this to using an end-to-end platform that sits atop and leverages the best of these components, integrating them seamlessly, in the background.
A Bit of Background
An important thing to keep in mind is that the Public Cloud gives you building blocks – Amazon calls them “primitives” – that deliver the functionality and the customizability necessary to build very sophisticated solutions. But what the Cloud doesn’t give you is the fit and finish, the white glove treatment, or the turnkey experience.
The Cloud provides a suite of products that can be integrated together by customers – but it doesn’t provide a ready-to-run solution. Each of these products/components is often best of breed in its microcosm, designed and optimized for a specific purpose.
In doing this, the Cloud providers create a challenge for data teams – not an unsolvable one, but a challenge nonetheless. How you address this challenge can make the difference between project success and failure. We will now acquaint you with the problem and its underpinnings and set you on the right path for a sensible solution that mitigates risk and frustration, and prevents failure.
Stars in Alignment, for the Cloud
For context, it’s important to understand the current Cloud adoption imperative, as it puts in focus the motivation for building successful Cloud Analytics solutions.
A number of industry trends have combined to create the market demand we see today for Public Cloud solutions. Among them:
- Paying for what you need: The Cloud works on a combination of elastic resource deployment and utility-based pricing. Rather than having to lay out significant capital funds to acquire technology infrastructure for your heaviest intermittent workloads, the Cloud lets you use operating expense funds to pay for just the resources you need. This applies to both computing and storage resources.
- Externally-borne data: An increasing amount of enterprise customers’ data originates off-premises. This means the data needs to be collected and consolidated into a single location and one that needn’t necessarily be on-premises. By extension, the Analytics infrastructure and software that will process this data needn’t necessarily run on-premises, either. Cloud storage can be an ideal place to land the data, and Cloud Platforms may be the best place to run the processing and Analytics on that data.
- Rapid obsolescence cycles: Innovation in hardware infrastructure, whether it be around storage, memory, or processing (CPU or GPU), proceeds at a rapid pace. Hardware purchased now will obsolesce in less than a year. This makes ownership and physical installation of such infrastructure by the customer unattractive. As rapid upgrades are desirable –and sometimes necessary for competitive reasons – renting (in the Cloud) is better than owning (on-premises).
Taken together, these factors seem like a perfect storm – in a good way. The Cloud has never been readier to run Analytics workloads and customers have never been readier to run Analytics workloads in the Cloud.
But as ideal as this may seem, it raises pressure and expectations around implementing complex technology in Cloud environments that are still relatively immature. That’s a perfect storm of its own – in a bad way. The risk of project failure is acute.
Analytics in the Cloud isn’t a flawed strategy, but relying exclusively on first-party Analytics components is often a recipe for disappointment, if not disaster. In the next section we’ll explore what those components are and what integrations between them are provided by AWS. In subsequent sections, we’ll point why they’re usually not sufficient, on their own.
What’s in the Cloud Analytics Stack?
Let’s now level-set and define the major components to a public Cloud Analytics stack. Since all of the major Public Cloud providers have a dizzying array of products, in the spirit of not overwhelming, we will just point the most salient ones on offer. The list below is a manifest of these major components, described in general terms, with the relevant AWS product in parentheses.
- Object storage (Simple Storage Service – S3)
- SQL over object storage (Athena)
- NoSQL Database Management (DynamoDB)
- Relational Database Management (Relational Data Service – RDS)
- Data Warehouse (Redshift)
- Hadoop and Spark cluster services (Elastic MapReduce – EMR)
- Data Transformation/ETL (Data Pipeline, Glue)
- Streaming data processing (Kinesis)
- Business Intelligence and Data Visualization (QuickSight)
The above list has a total of nine components – and a total of 10 AWS products. That may seem like a lot – but this is just a minimalist list. For example, products like Amazon ElasticSearch Service (search-based Analytics), Amazon Neptune (NoSQL Graph Database) and Amazon SageMaker (Machine Learning) have been omitted from the list.
Integrating These on Amazon
Amazon services are integrated through a collection of what might be called “bilateral interfaces.” In other words, rather than each service integrating with each other one, Amazon has implemented specific integration pairs, with some services being more commonly integrated then others.
For example, most services can work natively with Amazon S3. This is central to Amazon’s strategy of encouraging customers to use S3 as their “Data Lake.” For example:
- Elastic MapReduce can reference s3://-based URLs in almost any context where it would do so with hdfs://-based URLs to resources in the Hadoop distributed file system
- Elastic MapReduce components Hive, Spark SQL and Impala can each create external tables from files stored in S3
- Other systems, like DynamoDB, Redshift and Aurora, simply have built-in data import/load facilities, which can load data directly from files in S3 buckets right into their respective stores
- Amazon Kinesis Firehose can load streaming data directly into S3
- Other integration pairs exist as well. For example:
- Amazon Kinesis Firehose can load streaming data directly into Redshift
- Redshift’s COPY command can load data directly from DynamoDB
- QuickSight can ingest data from S3, Redshift, RDS, Aurora, Athena and EMR
The above lists are not comprehensive, but they illustrate a pattern: most AWS Analytics services integrate with S3 and many others integrate with Redshift and DynamoDB. These are the three most “blessed” services in the Analytics stack and the three most integrated with. But what about other permutations?
For example, what if we wanted to load data from Kinesis Streams into a table in DynamoDB or Aurora, or, more simply, what if you wanted to replicate data from DynamoDB into Aurora in real-time? Such pairings are possible, but they require a lot of work. In the latter case, customers must compose a connection using Kinesis Firehose and AWS’ serverless compute service, Lambda. This is not for the faint of heart.
In the next post, we’ll look at options and approaches for leveraging these components with 3rd party solutions for an agile, end-to-end experience.