Data Warehouses Should Stage Source Data

Click here to learn more about author Michael Blaha.

A data warehouse is a database that is dedicated to data analysis and reporting. It combines data from multiple operational applications and provides one location for decision-support data. A warehouse should have one staging table for each source table or file.

Warehouse Data Flow

As the diagram shows, day-to-day operational databases and external sources provide the raw data. Special logic monitors the data sources for changes and prepares periodic extracts of new data.

Staging tables accumulate the history of input data from the periodic extracts. The staging tables do not modify the input data. They merely accumulate it along with some system metadata. There is one staging table for each source table or file.

Data may then flow to an operational data store (ODS). The purpose of the ODS is to integrate overlapping source data. The ODS provides a single source of normalized operational data. Thus if there are two sources of account information, there are two staging tables, and one account table in the ODS.

Ultimately the data reaches the data warehouse, either directly from staging or via an ODS. The warehouse restructures the data into facts and dimensions – a format suitable for data mining and decision support queries. For example, an account dimension would combine one or more account data source tables.

An Example

Consider a customer account application. Customers can use a web interface to view and edit their own data. Here’s a table from the application.

Here’s the corresponding staging table. The ‘STG’ prefix in the table name denotes a staging table. ‘CA’ indicates that the staging table is for the customer account application. The last portion of the staging table name is the name of its source table.

The staging table adds several system attributes to the source schema.

record_key – the primary key of the staging table. The record_key is a surrogate key, which is a unique generated integer.
effective_date – when the staging record takes effect.
expiration_date – when the staging record no longer applies. A current record is indicated with an expiration_date far into the future, such as 12/31/9999.
checksum_indicator – an error check. If checksum_indicator matches the computed checksum, the record is unlikely to be corrupted.
batch_load_key – the processing batch for the staging record.
current_row_flag – true if the staging record is the most recent copy of the source record. The current record is also indicated with a future expiration_date.

There can be multiple staging records for each operational primary key due to the accumulation of history. For example, there can be a series of staging records, each with its own record_key for the same customer_account_ID. The effective_date and expiration_date indicate the timespan of a staging record.

Benefits of Staging Data

The staging tables provide several benefits.

Baseline. The staging tables serve as a buffer to isolate the warehouse from the source systems. This isolation can become important when warehouse processing is interrupted and has to be restarted.
Auditability. The data sources evolve according to operational needs. The staging tables capture source data at the time of each extract. Auditability is important when there is a question of lineage for a warehouse data element. Staging tables permit strict traceability from user analytics back through to source data.
Prototyping. Developers can use SQL to access staging data and test different warehouse structures. Such code enables agile warehouse development. Simple SQL can be left as is. Complex SQL can be hardened into ETL once it is vetted by the business.
Warehouse evolution. Developers can reprocess warehouse data upon restructuring. This is especially helpful as the warehouse evolves with agile development.

Techniques for Staging Data

There are two techniques for staging data. One approach is to write ETL code to load staging data.

Another approach is to use automation. A2Bdata is the only such tool of which we are aware. A2Bdata has two aspects. It generates staging schema by reading the database catalog for source tables. In addition it can load periodic extracts of source data into staging tables.

In Conclusion

Staging tables provide a buffer between the warehouse and the data sources. There is one staging table for each source table or file. Staging tables duplicate source schema and add system attributes. Staging tables hold a history of data from their operational counterparts. Updates in operational data lead to new staging records.

Sadly many warehouse initiatives lack staging data. There is no good reason for omitting staging data in a substantial deployment. There are several benefits of staging data. The only downside is the effort of creating and maintaining staging data, which can be mitigated with automation.

LISTEN NOW: MY CAREER IN DATA PODCAST

Data Topics

Data Warehouses Should Stage Source Data

Leave a Reply Cancel reply