Data management is faced with a large and growing lexical challenge. Many terms are created that do not have roots, prefixes, or suffixes applicable to data management. Many terms are used, misused, abused, and abandoned when their usefulness is over. Many terms are undefined or multi-defined and have no denotative definitions that are readily accepted. People are throwing out improper and undefined terms, and then pumping those terms without any idea what they really mean.
The most inappropriate term of late is ‘unstructured data.’ The term originated from database technicians for data that were not able to be processed by a Structured Query Language (SQL). Since the operative term in SQL is ‘Structured,’ then data that could not be processed by SQL must be unstructured, and hence the term ‘unstructured data.’ In other words, the data were unstructured with respect to SQL! To make the situation worse, data that were considered highly structured, meaning more structured that the tabular data processed by SQL, were renamed to ‘semi-structured data’ so the term would be between structured data and unstructured data.
The term ‘unstructured data’ has been used prolifically over the past several years, but has not been defined with a comprehensive, denotative, readily accepted definition. Like many other terms in data resource management, ‘unstructured data’ is part of a large and growing lexical challenge.
The first major problem with the term ‘unstructured data’ is that the definition is based on what the data are not, rather than what the data are. In other words, ‘unstructured data’ are data that are not readily processed by SQL. No wonder data management has a lexical challenge. Imagine if all data in the data resource were defined based on what they are not, rather than what they are. Today’s problem of data with minimal meaning would result in a gigantic problem trying to determine the real meaning based on a definition of what the data are not.
The second major problem with the term ‘unstructured data’ is that the people who created the term and are using the term haven’t consulted Webster’s dictionary, or any dictionary for that matter. The term ‘unstructured’ means without structure or having no structure. In fact, ‘unstructured data’ do have a structure, but it’s a structure that cannot be processed by SQL. Therefore, the term is profoundly wrong and should be abandoned.
The question then becomes, what should the term or terms be for data that are structured in a manner that cannot be processed by SQL. The best place to start is with Webster’s dictionary, or a similar reputable dictionary, and develop terms based on the intricacy of the data structure based on definable relationships within the data.
Unstructured means not structured, having few formal requirements, or not having a patterned organization; without structure, having no structure, or structureless. Extending that definition to unstructured data means that the data are not structured, have few formal requirements, or do not have a patterned organization. In other words, unstructured data are an amorphous mess without any structure. They have no definable relationships.
Structured means something arranged in a definite pattern or organization; manner of construction; the arrangement of particles or parts in a substrate or body, arrangement or interrelation of parts as dominated by the general character of the whole; the aggregate of elements of an entity in their relationships to each other, the composition of conscious experience with its elements and their combination.
Structured data are more intricately structured than unstructured data. They are structured by tables, rows, and columns that are stored in traditional database management systems and can be readily processed by structured query languages. They are often referred to as tabular data.
For example, employee data with employees, their attributes, and their pay checks with attributes are structured data. Similarly students taking classes and receiving degrees, motor pool vehicles with trips and maintenance, and facilities with rooms and equipment are typical structured data.
The term ‘semi-structured data’ was a loosely defined term that represented a data structure between structured data and unstructured data. Taken literally, it means the data are partially structured and partially unstructured, a definition that is basically meaningless. An excellent replacement term that provides more meaning is ‘highly structured data.’
Highly structured data are more intricately structured than structured data and cannot be readily processed by structured query languages. They have more intricate definable relationships that often stretch the capabilities of structured query languages. Highly structured data can be analyzed to reduce that intricate structure to simpler structures for processing by structured query languages and tools, or by other languages and tools.
For example, documents contain text that is richly structured at the physical, grammatical, and semantic levels. The physical structure consists of chapters, sections, paragraphs, sentences, and so on, with an overlying structure of pages. The grammatical structure is subjects, verbs, adverbs, adjectives, prepositions, and so on, that form sentences. The semantic structure is the meaning, foundation, introduction, presentation, conclusion, precedent, and so on, that are portrayed by the physical and grammatical structure.
Complex means composed of two or more parts; having a bound form; hard to separate, analyze, or solve; a whole made up of complicated or interrelated parts; a composite made up of distinct parts; intricate as having many complexly interrelating parts or elements.
Complex structured data are more intricately structured than highly structured data. They are any data that are composed of two or more intricate, complicated, and interrelated parts, that may include relationships between relationships, and cannot be interpreted by structured query languages. These complex structures can be broken down into the individual component structures that are more easily processed.
For example, voice is text with tonal inflections which are more intricately structured than pure text. Add the video of the person behind the voice with all the non-verbal mannerisms, such as body movements, facial expressions, eye movements, and so on, which are more intricately structured than the voice. Moving from text, to voice, to video is moving up the sequence of more intricately structured data.
Weather is another example of complex structured data. The known relationships, such as jet streams, ocean currents, and other drivers of weather and climate are very intricate. The new field of cosmoclimatology with coronal mass ejections, solar proton events, gamma ray bursts, inter-stellar clouds, galactic superwaves with an electromagnetic pulse, solar flares, galactic cosmic rays, and so on, are unknown relationships that likely drive weather and climate. Even the Earth’s reversing magnetic field is believed to have a relationship to weather and climate.
Terms like poly-structured data and multi-structured data have been used in place of complex structured data, but with reference to database management systems. They are not used with reference to an organization’s data resource. Those terms are poor replacements for complex structured data.
Ultra means going beyond others or beyond due limit; extreme; beyond the range or limits of; transcending; beyond what is ordinary, proper, or moderate; extreme; excessive.
Ultra-structured data are the most intricately structured data with interactions and relationships that are near or beyond the limits of human comprehension. These ultra-structured data need to be broken down into simpler structures to be more easily processed.
For example, social interactions, biochemical interactions, neurology, genomics, quantum mechanics, and so on, are ultra-structured data. The genetic code in DNA, how that code is interpreted to trigger cell differentiation, how it is interpreted to product amino acids and proteins, and so on, are intricate relationships at the limits of human understanding. Stephen Wolfram in A New Kind of Science makes the case that cellular automata could be the key to breaking the genetic code, which makes the relationships understandable.
The progression from unstructured, to structured, highly structured, complex structured, and ultra-structured data forms a continuum of data structuring. The terms presented above are labels for broad groupings of that continuum that have fuzzy boundaries depending on a person’s interpretation. However, the broad groupings are an excellent way to discuss the increase in degree of data structuring, the relationships involved in data structuring, and the methods of processing those different degrees of data structuring.
In addition, a set of data can move between the broad groupings depending on how well the relationships are comprehended and understood. For example, the data about weather and climate, or the data about the DNA code could move from the ultra-structured data grouping to the complex data structured grouping based on an increased understanding of the relationships.
Data management professionals must create and promote the use of proper terms, that are comprehensively and denotatively defined, and are readily accepted. The current state of a large and growing lexical challenge is not appropriate for developing and promoting a formal, certified, recognized, and respected data management profession. Only by developing and promoting the proper terms can a formal data management profession ever be developed. An excellent place to start with developing the proper terms is with the different degrees of data structuring.