“Dirty data” is a term that’s often thrown around, but it can be a nebulous term.
Often, what people mean by “dirty data” is that the data isn’t set up in the way we need it to be for our particular business purposes.
So, for example, someone might say that a field is broken because we have 50% NULL values in that field. But there could also be a good reason for a field to have 50% NULLs.
Let’s say your dataset is a collection of customer surveys. The website form could be designed in a way that a nonresponse to a question results in a NULL value in the resulting data table. So, a NULL value could show you that the customer skipped that particular question.
But, the argument against that type of form design would be that you can’t tell the difference between a customer’s intentional nonresponse and an error with the form design of that question. Instead, you could require a response from the user, but include an option like “N/A” or “Decline to answer.”
This is why it’s important for everyone in the data chain to think categorically. The business user requesting the creation of the survey, the web developer that creates the form, and the analyst responsible for studying the results.
A Simple Example of Thinking Categorically
Let’s look at a very simple example: a shopping list.
- 2 bottles of soda
- 2 lbs of bacon (should be on sale)
- White socks
- Baby carrots (should be on sale)
- 4 frozen pizzas
As you can see, this was written stream of consciousness. It gets the job done, but look at how it could be structured:
In this table, each row represents an item. And each item has certain attributes: department, quantity, and whether or not the item is on sale.
When you start thinking of data as a collection of items that have attributes, it naturally makes you think of what other attributes might be useful to describe the items. Whereas the list is geared towards thinking vertically; your goal is simply to list items.
This is why you might not think of Department as a valuable attribute when writing your list. And the result is a list of food items and clothing items that are mixed in with each other. With the table, it was easy to organize the list by department.
The table format also forces you to consider whether a value is necessary for every attribute of every item. With the list, the person writing it didn’t think to specify how many socks they needed. But the table makes you consider what quantity you need for every item.
This format can look more cumbersome; it does take up more space. But almost like learning a second language, the more you think in this way, the faster you get at honing in on the piece of data you need. And the benefit is more clarity and specificity.
Thinking Categorically with Big Data
When you’re dealing with millions of items spread across a complex structure of many different tables, the need to think categorically becomes that much more important.
Let’s say you’re working on a table of advertising campaigns. You might be creating/modifying the table structure, or you might be creating/modifying the means by which data is inserted into that table. Either way, it’s important to consider the following questions:
- What constitutes a record in this table?
- What are the necessary attributes that apply to each record?
- For each of the attributes, is a value required for every record? Or is it acceptable to have NULL values?
- For each of the attributes, should we have any datatype constraints? Should we limit the possible values to a range or a list of choices?
Now, let’s say you’re creating a survey like the one described earlier. Should we create a record in our Campaign table to track this? It’s not technically advertising if we’re asking for customer feedback. But it’s possible for our Campaign table to track more than just advertising campaigns, if that is what the business needs. So, if we do add it, we should probably have an attribute like Campaign Type, so that we can classify this particular campaign as a customer survey instead of an advertising campaign.
What if we intend to send out emails on different dates spread across the next three months: does that represent one record in the Campaign table? Or multiple records for each email send date? Or a record for each month that the emails are sent? What if we’re sending these emails to customers from different business units? Should we have separate records in the Campaign table to represent each business unit?
If you extrapolate these questions to all of the attributes on the Campaign table, you can see how difficult it is to prevent dirty data. Dirty data stems from the creation of data that isn’t structured how the business needs it to be. And avoiding that requires us to think categorically at every step of the development process.