By Jay Zaidi
Due to the proliferation of software tools within companies, end users tend to cringe when there is talk of a new tool that they should use for data analysis, data validation and data transformation. They are very comfortable using office productivity tools or other data processing tools that are at their disposal, so the initial reaction to change is negative. Such tools are relatively easy to use and are familiar to them, since most users have had years of hands-on experience with them.
At a recent meeting with internal customers from technology, business and operations groups, I was extolling the virtues of a commercial Data Quality (DQ) tool that we had rolled out across the enterprise and was trying to do my best to convince them to give it a try. After I was done, a member from the operations group asked me the following question -
“Why should I use this new Enterprise Data Quality tool, when I can carry out the same data analysis and data quality tasks using my favorite data analysis tool(s) (e.g. Microsoft Excel, Microsoft Access, Toad, SQL*Plus, SAS, Tableau, TIBCO Spotfire, Qlikview, etc.) and can get the job done much faster?”
I would have asked the same question, had I been using a particular set of tools for years, and was now being asked to learn a new tool and change the way I processed data. I thanked the audience member for his question and realized that the same question was probably on the minds of others in the room as well. The pressure was on! This was my opportunity to educate them on the “What” and “Why”, since I firmly believe that behavior change can’t be forced, but must happen voluntarily. I began to ponder my response. After gathering my thoughts, I made the following points to the audience, hoping that I would succeed in convincing them to switch tools, when I was done:
- Data Profiling/Forensics: DQ tools give insight into data patterns, statistics about profiled data, minimum and maximum values, standard deviation, inferred data types, Null and Not Null values, etc out-of-the-box, without writing a single line of code. This is a powerful data analysis capability that enables users to gain an understanding of data outliers or anomalies. In addition to this, custom rules can be executed against data columns.
- Data Analysis across Heterogeneous Sources: Data formats that are most often used are – XML data, data in text files, data in delimited files, and data residing in relational databases. Analyzing data in various formats will need some level of pre-processing, mapping and even coding in office productivity tools. However, most of the industry leading data quality tools are format agnostic and enable data analysis against the native formats. This capability reduces time-to-market and the level of effort required to process data.
- Out-Of-The-Box Data Quality Functions: Data quality tools offer many data quality rules and data transformation functions out-of-the-box (e.g. Data Standardization, De-duplication, Cleansing, Data Normalization, etc.). Users can develop custom functions as well. Some examples of out-of-the-box functions are address standardization, name standardization, product standardization, etc. This reduces development and testing effort and enables all end users to use these functions.
- Data Quality Score Carding and Trending: Most tools offer some form of data quality score carding and trending reports, out-of-the-box. Scorecards can be assigned thresholds and color coded, to show data quality hotspots (Red, Yellow, or Green). Custom reports can be developed by accessing data quality metrics stored in the tool’s results repository.
- Data Quality Results Repository, Exception Reporting and Alerts: Current and historical results of data quality activities performed with the tool are stored in the tool’s results repository. This data can be used to create exception reports and alerts.
- Data Quality Rules Repository: Typically in an n-tier application, data quality edits may be embedded within XML schemas, User Interface Code, the Application Tier or the Database tier. Maintaining these edits is not trivial and there is a tendency for things to fall through the cracks, as the application evolves. Externalization of data quality rules from code, is a best practice that I emphasize to my internal customers. This facilitates maintenance, provides transparency, promotes re-use, supports version control and simplifies audits. Teams are no longer forced to mine code (XML Schemas, Java, SQL, PL/SQL, Stored Procedures, Database Packages, etc.) and other software components, to find data edit-related code and logic, which is an extremely resource intensive and time-consuming exercise. Most data quality tools provide a rules repository that can be used to store data quality rules and re-use them, as required.
- Data Matching/Reconciliation: I am bombarded by requests from teams that wish to reconcile data between a file and a data store, between two files or across multiple data stores. They do this manually now or have developed custom routines to do it. All the mature data quality tools provide robust data matching capabilities that can be used for reconciling data.
- Automation: Data quality jobs can be automated and scheduled to run via standard job scheduling tools such as Autosys. Most data quality tools support such integration or have native scheduling capabilities.
- Performance and Scalability: The leading data quality tools are built to process millions of records and scale as the performance requirements grow, and
- Data Governance, Standards and Policies: By utilizing a data quality tool, firms can develop policies and standards for rule definition, data quality web services etc. This facilitates consistency in implementation, stronger data controls via the tool, transparency into data quality rules, supports good data governance, reduces costs and streamlines operation.
As I was running through my list, I noticed several heads nod in agreement and heard some oohs and aahs. My message was resonating with the audience. Based on experience, I have learned that building awareness and educating end-users, results in removing “mental barriers“. Human beings tend to push back on ideas or concepts that don’t resonate with them, things they do not understand or ideas that they don’t agree with. My goal was to overcome the audiences’ initial skepticism and highlight the following benefits:
- Increased productivity
- Eliminate key-person dependencies
- Re-use and share the work done by other team members within their department (Why re-invent the wheel?)
- Eliminate redundant data quality checks
- Leverage pre-built functions, and not write custom code
- Proactively monitor data quality via scorecards
- Automate various data analysis and data quality processes
- Profile data to conduct forensics – identify data anomalies and outliers, and
- Leverage the tool’s other data analysis and data transformation capabilities
I did emphasize that tools are merely “enablers” and can not solve their data quality problems. Users are still in the driver’s seat and have to leverage a tool’s capabilities to develop solutions to address their specific business or data quality requirements. I believe I succeeded in convincing the audience to switch to the data quality tool. What do you think?