Most businesses now hold the belief that data is a huge asset or even “the new oil”. Yet many still have difficulties with managing quality. In a recent report from O’Reilly titled The State of Data Quality in 2020, one of the main conclusions is that data quality is going to get worse before it gets better. While common problems such as dealing with unifying multiple sources and missing data are still challenges, organizations are starting to adapt data governance policies with C-level support.
Common Issues Cause Big Problems
Business users are well aware how frustrating it is to use an Excel spreadsheet with a lot of blank spaces where data should be. It’s equally challenging when different abbreviations or terms are used for the same thing, such as “New York” vs “NY”. There’s no way to do a correct automatic tally without making changes to the fields or adding them up manually.
Imagine doing this on a much larger scale with multiple, larger and possibly unfamiliar data sources. This is why it’s estimated that data scientists spend roughly 80% of their time cleaning data instead of developing algorithms.
When acquiring new data sources, organizations should ask about what kind of processing is done to ensure standardization. For instance, when dealing with patent data, there are many nuances and differing standards by patent office that our team is familiar with. This allows our customers to spend high value time on activities that produce the largest rewards.
The Basics of Data Governance
To fix the kinds of issues mentioned above, the right policies need to be in place to guide an organization’s data collection procedures. A good framework includes rules, processes, and procedures which are followed consistently.
Using the “New York” example above, a rule could spell out that all states use their postal abbreviations (NY), state information should be collected on a web form with a drop down selection to eliminate manual entry, and that if NY is combined with a zip code from outside New York state, the record is flagged or eliminated.
At IFI, we’ve spent years developing rules to process and combine more than 100 data sources into a consistent format. We also have quality control techniques that catch problems before they go live. The CLAIMS Direct platform can also be leveraged to manage other types of data and combine it with patent data. For example, we currently track 2,245 name variations for the BASF corporation. Even unusual names such as GASF AKTIENGESELLSCHAFT, BEE AA ESU EFU AG, and BADIS CHE ANVIN & SODA FABRIK AKTIENGESELLSCHAFT are normalized to BASF SE.
AI as an Incentive and an Assistant
The alluring possibilities of using AI to discover new insights are compelling business leaders to get more serious about data quality. Clean data is a must when training AI models and an ongoing need once they are deployed—the consequences of creating and using a faulty model to base decisions on could even be dangerous.
In the face of the coronavirus, many companies are touting the success of their models to come up with treatments or vaccines for the virus. But experts are ringing the alarm bells that much of the data going into the models hasn’t been properly vetted, leading to possibly erroneous results.
Conversely, AI is also assisting with cleaning up data problems—more than 40% of the O’Reilly survey respondents said they are using it in some way. Some companies are building algorithms in-house to correct inconsistencies by building their own supervised machine learning tools. There are also commercial tools available, although they are usually part of a larger enterprise data governance solution.