Data Architecture and Data Science: What is the Intersection?

Data Science, in practice, should ultimately combine the best practices of information technology, analytics, and business. On the other hand, Data Architecture enables data scientists to analyze and share data throughout the enterprise for strategic decision-making. Thus, without a sound Data Architecture in place, data scientists will remain severely handicapped in their abilities to develop and productionize data models. This is the primary point of intersection between Data Architecture and Data Science.

However, both Data Science and Data Architecture specialists need to have a sound understanding of business issues before they can design a model-development and testing environment for business use. An IBM developer explores the architectural thinking embedded in Data Science.

According to Science Direct, Data Architecture accomplishes the two following goals for the enterprise Data Science teams:

It allows “strategic development” of data models by “insulating the data from the business as well as the technology process.”
It provisions an “environmental foundation” for ensuing model-development activities with approval from the data owner.

Thus, it is logical to assume that the data architect and the data scientist play complementary roles in an enterprise Data Science team.

The Data Architect and the Data Scientist: Complementary Roles

Though Data Science and Data Architecture have multiple cross-over points in actual practice, the data architect is more an authority on hardware technologies while the data scientist is an expert in mathematics, statistics, or software technologies. The data architect translates business requirements into technology requirements, defines data standards and principles, and builds the model-development framework for data scientists to use. The data scientists apply principles of computer-science, mathematics, and statistics to build models.

An Enterprise Data Architecture is multi-layered — typically beginning at the data-source layer and ending at the “information delivery layer.” So, diverse experts may be involved in architecting the various layers of a complex Data Architecture, which includes the underlying hardware, operating system, data storage, and the data warehouse. The modern data architect is often a multi-skilled individual, with expertise in data warehouses, relational databases, NoSQL, streaming data flows, containers, serverless, and micro-services. Although newer technologies are surfacing on the data-technology landscape every day, technology vendors are still waiting to see their widespread adoption in businesses.

At the outer layer of information delivery, the data scientist is certainly in charge. This Dzone article about Data Science for Modern Data Architecture explains how Data Science controls predictive analytics.

Data Privacy Act: Data Architecture Provisions Secure Storage for Data and Models

In the post-GDPR era, Data Architecture has the additional role of provisioning secure storage facilities for both historic data and built models for periodic audit purposes. In this scenario, Data Architecture has a higher significance for Data Science practice. This is the second point of intersection between Data Architecture and Data Science.

The data assets in an organization will cease to remain as assets unless data-privacy issues are integrated in the Data Architecture framework. Moreover, data version control will soon become a standard feature of enterprise Data Architecture. As much as it triggers a sense of excitement among the modern data scientists, the Data Privacy Regulation also signals a new era of additional compliance for Data Science practice.

Big Data Architecture and Data Science: Where Is the Point of Intersection?

Why Big Data Architects Are Not Data Scientists stresses that software skills are not enough to build solid big data development architectures (referred to as “infrastructure” in the post). A lot has matured in the hardware-technology space, and data scientists are rarely equipped to handle the advanced hardware (environmental) requirements of typical big data projects.

Big data projects are generally deployed on cloud systems, and big data architects are expected to possess both big data technology frameworks and hardware environments to be effective in actual, result-driven projects. These senior team members are often used to convince clients during the pre-project, buy-in sessions. Big data architects have a unique blend of superior statistical skills, programming language skills, and presentation skills, in addition to having a sound understanding of hardware environments. Big Data Architecture is another situation where Data Science and traditional Data Architecture (data engineering) intersect.

The Data Engineer as the “Architect” of Data Science Teams

An Altexsoft post describes the Data Engineer’s role:

“In a multidisciplinary team that includes data scientists, BI engineers, and data engineers, the role of the data engineer is mostly to ensure the quality and availability of the data.”

The data engineer ensures that the gathered data is prepared for analysis, and the analytics infrastructure is ready for use by data scientists. In that sense, the data engineer plays the role “chief architect,” readying the data environment for further analysis. This may be thought of as a point of intersection between Data Science and Data Architecture.

A KD Nugget post describing the explicit differences between data scientists and data engineers, clarifies that though the data scientist is primarily the “data analyst,” and the data engineer is chiefly responsible for preparing the data pipelines for analysis, there is some overlap between the two roles, especially in implementing machine learning algorithms in the production stage.

The Next-Generation Data Architecture for the Explorer Data Scientist

On-demand Data Architectures will enable “ad hoc or on-demand” data access for “exploratory” Data Science in future. In the exploratory use cases, the data scientist will want to access data “whenever” and from “wherever” (platform of choice). In next-generation Data Architectures, concepts borrowed from “self-service data preparation” and “data virtualization” will be used by citizen data scientists and business analysts.

At the DATAVERSITY® Data Architecture Summit, Donna Burbank, a leading Data Strategy expert with more than 20 years of experience, made a presentation titled Emerging Trends in Data Architecture,in which she warned that technology was changing so rapidly, it may become “challenging to keep up with the latest innovations in Data Architecture.” She also does a monthly webinar series around the same topic.

The Healthy Data Science Organization Framework

The Healthy Data Science Organization Framework is a set of guiding principles on the data-analysis process that data scientists can use to cultivate and nurture a healthy analytics mindset. This framework is designed to help the organization develop better understanding of their business, data generation, data modeling, and model deployment, and overall Data Management practices.

Image used under license from Shutterstock.com

LEARN MORE ABOUT OUR PRIVATE CDMP TRAINING

Data Topics