In today’s world, data has become the lifeblood of science, business, and government. It is essential to ensure that data are accurate, complete, and up-to-date. Without accurate data, discovering something novel, building a product or service around that discovery, and monetizing the discovery can become more difficult. In some cases, bad data can be dangerous, such as in the financial sector where bad data can cause the financial system to collapse. Therefore, it is crucial to understand how to repair your data.
There are two ways to repair data quality problems, which can be categorized as addressing pre-existing issues and preventing problems that haven’t happened yet.
Addressing Pre-Existing Issues:
The first approach is to address pre-existing issues. This can be time-consuming, expensive, and demanding work. The first step is to understand the provenance of all data, what they truly mean, and how good they are. In parallel, the data must be cleaned. This means going through the data to find and correct errors, or at least eliminate them from further analyses. A complete rinse, wash, and scrub may prove infeasible for big data. However, an alternative is to complete the rinse, wash, scrub cycle for a small sample, repeat critical analyses using these “validated” data, and compare results. This alternative must be used with extreme caution.
However, cleaning up erred data is not enough. The sheer quantity of new data being created or coming in is growing too rapidly to keep up. Over the long term, the only way to deal with data quality problems is to prevent them.
Preventing Problems That Haven’t Happened Yet
The second approach is to prevent problems that haven’t happened yet. This is where scientific traditions of “getting close to the data” and “building quality in” are most instructive for Big Data practitioners. This means taking care to design experiments, define terms, and understand end-to-end data collection. Building controls (such as calibrating test equipment) into data collection, identifying and eliminating the root causes of error, and upgrading equipment every chance they get. Keeping error logs and subjecting data to the scrutiny of peers are also essential steps.
To adapt these traditions to their circumstances, organizations should consider the following:
- Specify the different needs of people who use the data.
- Assign managers to cross-functional processes and important external suppliers.
- Ensure data creators understand what is expected.
- Measure quality and build in controls that stop errors in their tracks.
- Apply Six Sigma and other methods to get at root causes.
- Recognize that everyone touches data and can impact quality, so engage them in the effort.
Once you get the hang of it, none of the work to prevent errors is particularly difficult. But too many organizations don’t muster the effort. There are dozens of reasons — excuses, really — from the belief that “if it is in the computer, it must be the responsibility of IT,” to a lack of communication across silos, to blind acceptance of the status quo.
It is time for senior leaders to get very edgy about data quality, get the managerial accountabilities right, and demand improvement. For bad data don’t just bedevil Big Data. They foul up everything they touch, adding costs to operations, angering customers, and making it more difficult to make good decisions. The symptoms are sometime acute, but the underlying problem is chronic. It demands an urgent and comprehensive response, especially by those hoping to succeed with Big Data.
Conclusion
Data quality is a critical issue that cannot be overlooked, particularly in the era of Big Data. Addressing preexisting data quality problems is time-consuming and expensive, but it must be done to ensure accurate analysis and decision-making. Preventing data quality problems from occurring in the first place requires a cultural shift within organizations, where everyone is engaged in the effort to improve data quality.
By adopting the best practices of scientists and applying them to their own circumstances, businesses can create a culture of data quality that prevents errors from occurring and ensures that the data they use for analysis and decision-making are accurate and reliable. Ultimately, the effort put into improving data quality will pay dividends in better decision-making, reduced costs, and increased customer satisfaction.
At Anyon Consulting, we understand that improving data quality can be a daunting task, which is why we offer a range of services to help you achieve your data quality goals. Our experienced consultants can help you with data profiling, data cleansing, data integration, and other techniques to improve the accuracy and reliability of your data. We can also work with you to develop data quality metrics and monitor the effectiveness of your data quality program over time. With our help, you can ensure that your data is of the highest quality and use it confidently to drive your business forward. Contact us today to learn more about how we can help with your data quality needs.