by Angela Guess
Bradley Fordham recently wrote an article for Art+Data in which he argued, “When it comes to analytics, in particular for product ideation and optimization, listening to what the data does not say is often as important as listening to what it does. There can be various types of ‘silences’ in data that we must get past to take the right actions. Here I will focus on the most common. Frequently very large data sets will have a proportionately small number of items that will not ‘parse’ (be converted from raw data into meaningful observations with semantics or meaning) in the standard way. A common response is to ignore them under the assumption there are too few to really matter.
He continued, “The problem is that oftentimes these items fail to parse for similar reasons and therefore bear relationships to each other. So, even though it may only be .1% of the overall population, it is a coherent sub-population that could be telling us something if we took the time to fix the syntactic problems. Do not allow syntactically inconsistent data to be silent. In real data sets, we often find semantic discrepancies (differences in meaning) from one item to the next where we expect similarity. A common example is ‘omission values’. Some items may have a zero, some may have the special value NULL, some may have blanks, some may have user-entered values such as ‘?’ or ‘N/A’. Do these all mean the same thing or not for our analysis? Another place semantic gaps often form is in the relationships between data items/records. ”