Rethinking Extract Transform Load (ETL) Designs

Click to learn more about author Aditi Raiter.

Are you in a work environment where streaming architecture is not yet implemented across all IT systems? Have you ever been in a situation when you had to represent the ETL team by being up late for L3 support only to find out that one of your team members had migrated a non-performant SQL to PROD? Are you an expert handling technical questions from a large build team leaving you no time to manage code designs/reviews for all projects? You are not alone – this article details out a strategy to overcome such concerns.

In a typical work environment, there are change control processes to request work from technical teams. A new change request enters the ETL queue, then requirements are gathered, quoted, and sent off to the build (development) team. If the build team has questions, they get help from seniors, or else they code common ETL best-practice tasks themselves. Most of them end up doing these tasks:

Manually copy source DB data model on target environments. Add ETL identifiers to metadata
Manually identify incremental change data capture strategy at table level
Manually set up error handling mechanism on file creates/DB updates
Manually set up audit logs
Manually set up performance strategies by say bulk loads, parallel processes
Manually set up recoverability/re-runnability mechanism by logging incremental data updates in target database based on source database time zone dates
Set up mechanism to load history.

Common team challenges in the above typical ETL build process:

No set ETL development standards on how to design performance, logging, commenting, naming, audit, error handling, etc. Team members usually look at older designs or code it per their understanding
Inability to identify issues with the older ETL design functions
Inability to allocate time for detailed code review sessions for all projects
Inability to accommodate all UAT testing with IT maintainability test scenarios along with business test cases
Quick-fix development approach in most situations
Interest in loading huge historical data set without holistically considering implications for retrievals with table partitioning for all reporting use cases

These issues typically land in a complex ETL builds mesh:

Although the code has ETL best practices implemented, many loads end up with the below discrepancies:

Modularization scope does not remain consistent. Multiple functionalities end up being clubbed together in one module occasionally
Multiple FTP server locations are set up per request to load and drop files to be picked up by application integration tools
Missing comments on business rules
Multiple log tables on a schema implemented for different projects
Name variations in ETL build types:

1. Load to “Stage/Raw/Detail/Executor” tables from source tables

2. KPI builder jobs load to “Aggregators/KPI” tables

3. Loads from/to webservices follow varied table name conventions

4. Module naming conventions are also not consistent

You can imagine the effort spent to maintain these loads! It is not usable or maintainable.

To summarize, the main challenge is the ETL build team’s time is spent more as a technician rather than a mentor.These challenges can be overcome by improving maturity around ETL builds.

ETL Maturity Level

When it comes to platform maturity levels, most of the ETL platforms are managed. One of the key steps to a stable standardized platform is by implementing benchmarked functionality modules or build templates.

Recommended ETL Build Designs

The concept is to standardize all IT processes involving error handling, audit logging, performance strategies while loading into persistent stage tables, and re-runnability on failure log tables, and to build a mechanism to load history.

*Proposed tasks for traditional ETL builds*

Possible ETL Build Approach

The idea of configuring and modularizing is not new, but segregating all enterprise ETL builds into standard functionality modules is unique.

We will talk about the proposed approach in the subsequent sections.

Build and Agree Atomic Template Types

Atomic template types can be built using metadata injection step in Pentaho Data Integrator (PDI) – community version (open-source ETL tool). Metadata injection is the dynamic ETL alternative to scaling robust applications in an agile environment.

It is used to build configurable transformation templates to get the outputs desired.

This mapping logic can look very similar on other tools. Below is an example on Pentaho Community open-source version.

*Source:* *Jens Bleuel About Kettle (PDI)*

Build Standard Functionality Types with Atomic Templates

Standard functionality types are the common functionalities used to build ETL. Some examples include:

Standard schema log update module
Shared connection file read module
Create and build FTP file module
Source tables Extractor Module (Configuration Table for Source DB/Tables)
Load to table based on CDC strategy module
Bulk insert only load to PSA table module
KPI aggregator (parameterized count) module
Tables to Table De-normalizer module

Implement Configurable Master Job That Ties Standard Functionalities

Chaining standard functionality modules in one job is not maintainable. Build a master job task that reads functionality modules from the configuration table. Build config tables for standard functionality modules (Jobs) and wrapper job parameters. Log the rows read and rows loaded for each functional module.

I would like acknowledge Veselin Davidov for a discussion on configurable PDI processes. Let’s take the idea to implement a large-scale enterprise use case for a global company.

Enterprise Use Case – PDI File Generator

Let’s take an example of an ETL use case where the requirement is to build an ETL load that generates a file. For such use cases, build a one-time file generator load that can accept configurable inputs indicating the desired file to be created.

This starts with designing a table for some of a configurable input as below:

Source connection database type, name
Sources select, from and where query
Source File type, File name
Transformation Rules
Incremental Insert field ID/name
Incremental Update field IDs/name
Incremental Delete Log ID/name
Target connection database type, name, table name
Source File type, File name

The next step is to design and build functionality modules. Here are the main ones:

Log Before File Create Module

This module is used to create audit logs for the process before a file can be generated. The module scans through a list of files configured in filegenerator_table_to_file. It calculates the next expected date and time to build a configured file and logs a data pull start date. If the expected run time of the file is greater than the load run date, a file create status field is marked continue. If not, this is marked canceled. The name of file to be created is logged in a file name field.

Build and FTP Files

This module is used to build a file and FTP it to a configured location. The below sample also caters to an additional requirement to store the file in different locations along with FTP to an internal server.

The Get Report Data step builds a file with a fully configurable code. The select statement is formed by reading parameters from the configuration table xml to connect to a database to build a file for the date range calculated. One option to use configurable connections is by utilizing a shared connection xml file. The load reads connection name as a parameter from the shared file.

The above step can be replaced with a popular idea of using the metadata injection step as proposed by Diethard Steiner and a few others.

Wrapper Job

PDI wrappers generally have a re-runnability feature that can update logs before and after a process runs.

Configuration table for reading next core module:

The below configurable transformation reads core modules from the configuration table above.

The below execute next module reads the module name from the previous transformation and executes it.

Once the load is built and reviewed, any new requirement to configure FTP file generation use cases is a matter of few steps. These are the following:

Configure a file with an insert statement to the configuration table
Create folders on FTP server to drop files
Optionally create folders on additional servers to drop file copy
Any new configuration will need to be tested to see if the select statement configured creates the files needed in the right location with proper log entries

Achievement Areas

These are some of the success areas with the process:

Reduced time by about 90% for implementing similar new requests
Reduced support documentation times by 90% for incorporating changes to standard modules
Easily repeatable PDI functionality that can be configured to run for other use cases
Easier logging by recording rows read, processed at standard functionality level
Reduced PROD issues maintenance time by logging errors at the standard functionality levels
Reduced Knowledge Transfer times by including comments for agreed business logic
Recorded “how to configure” videos to scale trainer/originator

Closing Thoughts

Traditional ETL tools still have their place when it comes to batch daily KPI reporting or for real-time sensor data reporting catering near real-time use cases. Data analytics factories need not always be the Amazons and Googles of the world to demonstrate value generation. ETL build times can be drastically reduced using simple configurable designs for timely valuable insights.

To streamline the overall goal of improving predictability of ETL builds, a little change in mindset can do the trick:

Start with a generic and scalable build thought process that can repurpose configurable standard functionalities for simple and cost effective designs
Start with implementing readable code that reduces knowledge transition times
Look for optimal designs to maintain reduced number of objects in source code repository
Follow at least one time a stringent review process for core functionality modules
Think efficiently to implement future performance scenarios

Data Topics