So You Want to be a Data Architect?

By on

Being a data architect requires a good understanding of the cloud, databases in general, and the applications and programs used to maximize their potential. A fully functional data architect understands all the phases of Data Modeling, including conceptualization and database optimization. They also understand a continuing education is part of the job.

Typically, a data architect has a degree in information technology, computer science, computer engineering, or a similar field. Like an architect who create homes or buildings, a data architect develops a blueprint representing a data system that supports an organization’s short-term and long-term goals.

A data architect should know how to:

  • Design models of data processing that implement the intended business model.
  • Develop diagrams representing key data entities and their relationships.
  • Generate a list of components needed to build the designed system.

Until recently, organizations often built architectures of fairly standard format and called them data warehouses. However, new technologies have dramatically altered the way businesses gather information and serve their customers. Instead of reacting to events after the fact, businesses now must anticipate or predict their needs, and the shifts of the market, as a way to optimize outcomes and profits. Businesses that don’t upgrade their legacy data dumps will suffer gradually decreasing profits due their slowness and inefficiencies.

Discussing data architecture, the Managing Director at Global Data Strategy, Donna Burbank said:

“Data Architecture, in its broadest sense, asks, ‘What are we trying to do as a business?’ And then from all the diverse technologies, ‘What’s the best fit for that purpose and how do they work together?’ What’s unique about data is that it’s partly a business role and partly a technology role. At many of the companies I visit, the first thing I do is draw a picture of their existing architecture, and you’ll see the spaghetti diagram. Then, when we’re done, there is a nice, clean Data Architecture.”

A good data architect understands their goal is to maximize the flow of data from consumers to the website, and back again. The architecture filters, defines, and stores data by using certain types of databases, programs, and applications. Data Architecture should support the organization’s goals and provide a common language for the people using it. Security, Data Governance, and the organization’s business philosophies are also considered when creating an architectural design for processing data. Ideally, a system’s architecture should help in making business decisions. The design may include an operational data store (nontraditional data operations, including such things as real-time operational reporting and refining unstructured data). Necessary skills for data architects (and the most requested) are Data Modeling and database design.

Data Modeling

A data model is a group of concepts organized into data relationships, data constraints, and data semantics. Most data models also include a set of basic operations for manipulating data in the database. Data Modeling is considered the first step in designing a database. It considers the data contained in the database (its content), the relationships between data items, and restrictions on the data. These concepts are presented broadly, and do not include implementation details. The process of the data modeling creates a formal (or semi-formal) presentation of the database structure.

It is necessary to determine the purpose of the database, how it will be used, and who will be using it. If the database is complex or used by several different people, the design should include how and when people can use the database. Ideally, a Data Modeling project will develop its own mission statement, which can be referred to during the design process. These statements provide a focus that is communicated to all other personnel and keeps everyone on the same page.

Database Design

There are two basic principles used to guide the design of a database. One defines redundant data (also called duplicate information) as wasteful. It wastes space and increases the chance of inconsistencies and errors (one version gets updated, the other doesn’t). Another principle states the accuracy and completeness of data improves overall efficiency. Any reports based on inaccurate data from the database will contain the same incorrect information. Consequently, any decisions made using those reports could do more damage than good.

A properly designed database offers access to accurate, up-to-date information. Because an efficient design is essential to the success of a business, investing time to thoroughly research the needs of a database design is a good idea. A good database design includes:

  • Reducing redundant data by dividing all the data into subject-based tables.
  • Ensuring the accuracy and integrity of the information.
  • Supporting the data processing goals of the business.

Enterprise Data Architecture

An enterprise data architecture model is basically a “strategic design model” that acts as the foundation for achieving the business’s goals. Many enterprise data models currently being used have been tailored specifically to the needs of the organization, including the use of metadata and Data Governance. The shift to enterprise data models is driven by six key business needs:

  • The democratization of data (data sharing, security, quality, and governance).
  • Handle massive amounts of data in real-time.
  • Support a self-service philosophy for customers and clients.
  • Shift to predictive analytics.
  • Provide greater responsiveness to online users.
  • Plan for the future (new data sources, new applications).

Cloud-Based Data Lakes

At the core of modern enterprise data architecture is the concept of integrating cloud-based data lakes.

Organizations are often blocked from using data by incompatible formats and the limitations of an old database. As a consequence, cloud-based data lakes are quickly replacing data warehouses. (One of the “continuing education” responsibilities of a data architect is to monitor the current developments within the cloud computing community.) Hybrid clouds are also becoming popular.

Data lakes, unlike data warehouses, will store all data types: unstructured, semi-structured, and structured. In a data lake, data is stored in its raw format. Because of the way data lakes are designed, data doesn’t need to be defined while being captured. The data is defined before being read. A data lake can store data from relational sources (from a database) and non-relational sources (such as social media and IoT devices). ETL (extract, transform, load) is not required, streamlining the process of making data available for analysis. Cloud-based data lakes are extremely scalable and can support large amounts of data for a reasonable price. There is a strong possibility the data architect will be communicating and working with a more specialized cloud architect during the set-up of a cloud account.

The Responsibilities of a Data Architect

While no specific path exists for becoming a data architect, a potential candidate needs extensive skills. Typically, a data architect will come with a degree in computer science, IT, or a similar field. Hands-on experience can be gained from entry-level IT jobs in database administration or programming. Years of experience are typically necessary to become a data architect. If one has the experience and skills, but lacks the degree, IBM offers a certification process that might be used in place of the degree.

A strong understanding of RDBMS and SQL systems, analytics platforms, Java and Python, ETL, Hadoop, Spark, Yarn, Kafka, and other tools is necessary. A “big data architect” must have expertise in popular Hadoop distribution platforms, such as HortonWorks, Cloudera, and MapR.

Grant Case, a Senior Analytics Architect, and one of the authors of Data Architecture Basics, shared:

“Data architects can’t just focus on optimizing the technology to the exclusion of all else. If they’re not basing plans on business concerns in addition to the scale and cost constraints, then they’re not creating a truly robust data architecture.”

Craig Statchuk, a big data architect at IBM, offered some words of wisdom to people considering the field:

“The good part is you start most days in the new big data world. This includes everything within the role of a big data architect — someone who fulfills the needs of the entire enterprise beyond IT. In effect, this role is about taking care of more users in more places. Therefore, the pro is that most days you’ll start with a clean slate. You may not know what the day holds for you, but by lunchtime, you’ll have a long list of things to work on, to create, and hopefully resolve in a short period of time. There’s a lot of value placed on immediate results.”

Image used under license from

Leave a Reply