Data curation is highly focused on maintaining and managing metadata, and not the database itself. Consequently, much of data curation involves such things as good communications and the usage popularity of services or articles. Data curators not only create, manage, and maintain data, but may also be involved in determining best practices for working with that data. Data curators often work the data using a visual format, such as charts or a dashboard, and store “objects” with attached metadata, rather than files.
The data curator bridges the worlds of Information Technology (IT) and Data Science/Business Intelligence. Massive amounts of data may be readily available, but if it is not cataloged and curated correctly, it is essentially useless. The IT department would have problems locating and providing requested data, and data scientists, wanting to work with the data to create informative and accurate reports, would get the wrong data. As organizations evolve in their use of data, a data curator becomes a necessity.
The use and research of big data is still relatively new, having started in 2005 with the introduction of Hadoop. Consequently, the development of new positions to handle new responsibilities continues as the field matures. In the near future, the new position of data curator will become a necessity for some organizations. Without a data curator, data scientists and data analysts spend huge amounts of their time doing organizational work, instead of finding, preparing, and optimizing data for analysis.
A Philosophy of Organization
The pre-digital card catalogs used in libraries a few decades ago provide a good example of metadata. Essentially, metadata describes “data offering information about the data.” Generally speaking, metadata supplies the how, when, what, where, and why of data. Metadata is a brief amount of information, used in a cataloging system, to provide the most basic information in a summary, making the data easier to find and track.
An (active) data dictionary is a centralized metadata repository, using general software to provide information about data relationships, origin, usage, and format. A data dictionary system used only by designers, researchers, and administrators, and “not a part the DBMS Software,” is called a “passive data dictionary” (these are manually updated, with no changes to the DBMS). A data dictionary is often organized using a spreadsheet format, with each attribute listed as a row, and each column labeled as an element. Common elements included in a data dictionary are:
- Attribute Name: Each attribute is given unique identifier (an attribute is a specification defining a feature of an object).
- Optional/Required: Indicates information required before a record can be saved.
- Attribute Type: Defines the type of data allowed in a field (date/time, text, numeric, enumerated list, booleans, and unique identifiers).
As big data research expands, data catalogs have grown in popularity. Data catalogs develop the concept of organizing metadata by acting as both a search engine and a wiki (a server program allowing users to collaborate in creating the content for a website), and make it easier for analysts to locate the data they need.
A data catalog is available to any user as a first stop during data research and is normally located within a cloud or an on-premise server. It automatically indexes data systems. Part search engine, a data catalog crawls through databases and BI systems to find the data being sought.
The data curator is a person who takes the organization of metadata to the next level and works with data dictionaries and data catalogs. The curator needs to have a good understanding of the systems storing the data, and the tools available for processing the data. Up-to-date knowledge about datasets, databases, and data curation is necessary. The data curator also understands the various types of analysis performed, as well as the expectations of data scientists and management. Ultimately, the data curator helps data scientists to be more productive.
Data Curators Streamline the Analytics Process
Data curators fill the gap between data scientists and data analysts. They will typically have a better understanding of the data and the analytics workloads than the data engineers, because they work more closely with management and marketing.
Data scientists find meaning in data, but rely on IT to provide the data. It is normal for data scientists to begin an analytics project by initiating a work request with IT. The request describes the data required for the project, as well as detailed formatting requirements, update frequencies, and the tools they need to perform the analysis. IT then assigns the request to a data engineer, who checks for any additional requirements, and then finds the requested data.
However, if the data isn’t organized, there is often a fair amount of confusion as data scientists attempt to communicate their needs to the IT department. Data engineers come with an understanding of infrastructure, and data scientists understand the meaning of the data, but without organized data, the two groups have trouble communicating their needs. The data curator provides a system that allows IT and data scientists to work together smoothly and efficiently (most of the time).
Tools for Data Curators
As organizations adapt to include big data, data curators become a necessity in making organizations and individuals more efficient and productive. They provide a service within the organization. Data curators have a variety of tools and websites available for their work:
- Digital Curation Resources: A catalog of tools for digital curators and data creators.
- DCC Tools: A collection of curation and Data Management tools.
- OpenRefine: A free open source tool designed for working with complicated, messy data (and transforming formats), extending it with the internet, and linking it to other databases.
- The DMPTool: A free, open-source, online application for creating Data Management plans, as required by funding agencies for the grant proposal submission application.
- The Qualitative Data Repository (QDR): Curates, preserves, publishes, and promotes the download of digital data in the social sciences. The repository provides guidance for managing, citing, and using qualitative data.
- re3data.org: Access and share data with over 2000 research data repositories.
Data Curation vs. Content Curation
Data curation involves organizing the data of a business, hospital, or some other organization. Content curation, on the other hand, involves gathering relevant, useful information from other websites and sharing it by way of links, to improve on the visitors’ experience.
Content curation provides links to other articles or resources. It “refers” visitors to articles or information of interest. It’s a simple way to provide engaging material that was created on another website. Curating content allows a website to cover a much broader range of topics with minimal effort. Curated content can be combined with an introduction or an opinion.
Metadata and Machine Learning
Metadata extraction and metadata insights lay the foundation for machine learning (ML) models. Once a model has been adequately trained, it can be used to provide faster searches and responses. A search done using a traditional, hierarchical, “file” scheme is inefficient and clumsy. A file-based approach to finding data has essentially no metadata. data curation, by comparison, is remarkably efficient.
Data curation manages data as objects and provides an exceptional option for the storage of unstructured data. An object storage platform uses the totality of the data, regardless of whether it is a document, an image, a video, or any other unstructured data, and stores it as a single object. The metadata resides within the acquired data, and comes with descriptive information about the object, and the data itself.
Metadata is anchored within captured data, or objects. As a result, object storage enables “versioning” — an important feature in Machine Learning training. Using this unique feature to store objects, data scientists can version their data, allowing their collaborators to reproduce the results later. This versioning feature helps shorten research time and obtain desired results faster. It also promotes reproducible machine learning pipelines, as well as validating data reliability.
Image used under license from Shutterstock.com