Big Data: Garbage In, Garbage out
Robin Woolen, MBA, IGP has worked in the field of information lifecycle management since 1994 with a specialty in strategic consulting focused on enterprise-scale information management.
Big Data. One of the most desired and overused buzzwords in all of business today. Every organization in the public and private sectors is chasing the perceived benefits of Big Data and Business Intelligence. Now there is no doubt there is a veritable treasure trove of customer data just waiting to reveal insights into where an organization should go to satisfy their customer’s future needs or even make money directly off of the marketing potential this data represents when sold to third party buyers. This is just a few of the reasons organizations are racing to build data warehouses or lakes to start collecting data to begin fishing for these benefits.
The problem comes in the word “racing.” The recent CrowdFlower 2016 Big Data Scientist Survey of data scientists found that they typically spend 60% of their time cleaning and organizing data sets, 19% collecting data sets and a mere 21% of their time actually analyzing data. It should not be surprising that cleaning and organizing data sets is the least favored job that a data scientist does in their daily business process and yet that is exactly what they spend the most time on. Why might you ask? Because the data sources they are working from are unreliable. There’s an old saying in the information management world of “garbage in, garbage out.”
You have to have good, clean, trusted data before you can believe any results from any analysis. That should be obvious, but the fact remains that the actual people dealing with this information today spend the majority of their time finding and fixing raw data before they can even get to producing anything of value for their organization. They can’t do otherwise because basing a decision on bad data could be disastrous. If you’ve been following this series with any regularity you can probably already guess that I am going to suggest that a lot of time could be saved if these organizations had a strong information governance program in place.
The good news is that technology has advanced to the point where an organization can deploy user-friendly systems that allow the average staff member to do a lot of business intelligence analysis on their own.
A strong information governance program actively removes redundant, obsolete and transient data from the system which improves the quality of information stored in the data lake or warehouse. A strong information governance program also makes identifying the appropriate data sets needed for analysis faster and easier to find. This means information governance will have a major impact on 79% of your organization’s data scientists daily work life. Forget the fact that happy employees produce more and consider that anyone with the title of “Data Scientist” is not working for minimum wage. There is real value is getting these people producing as quickly as possible.
Another hot commodity in today’s business world is the Data Scientists themselves. These are highly educated theoreticians and statistical analysts that are in short supply at this particular time, which makes them very expensive. They specialize in creating “what if” scenarios that pull data from a variety of sources to find relationships your average person wouldn’t even consider. Is this important work? Absolutely, however, I question that the average organization is really in a position in terms of their Information Governance to support a lot of theoretical activity waiting on unknown results. This is not to say they aren’t worth it because they are, but if they are spending 60% of their time cleaning up data rather than running analysis how much money is the organization really willing to spend?
The good news is that technology has advanced to the point where an organization can deploy user-friendly systems that allow the average staff member to do a lot of business intelligence analysis on their own. This can make more financial sense to The-Powers-That-Be that want to get into Big Data and Business Intelligence, but don’t want to hire dedicated resources for the job. Yes, it does mean more work for everyone else, but there is a lot of value that can be found when you give the people that work in the business a tool that lets them analyze these various data streams on their own. It is always better to stack up quick wins to prove something works and build out a capability from there.
Once again, the results you get are only as good as the underlying data. It is critical that an organization takes the time to get rid of any redundant, obsolete and transient data before linking a data stream to any shiny, new system. You don’t want to take trash into the new house. Preparation is the key.