Welcome to our Data Dictionary series, where we continue to explore the terms you see when reading news stories and blog posts about Big Data.
You can read Part 1 of the series here.
Understanding Big Data means grasping the meaning behind the many terms that those in the tech industry toss around when talking about data collection, analysis, interpretation and the like. For those still getting used to the jargon, here are more terms and concepts that you’ll want to understand.
Dirty data: Data that has not yet been cleaned and is filled with duplications, inconsistencies and inaccuracies.
Data lake: The repository of large amounts of raw data. Not to be confused with a data warehouse, where large amounts of data also are stored, but typically only after the data has been cleaned, structured and integrated with other data. Information from a data lake is not in a form where businesses can use it to make decisions.
Data unification: The process by which data from different sources is integrated and cleaned (i.e., duplication and inaccuracies removed). This is currently a major trend among large businesses that have enormous amounts of data “siloed” in different areas.
ETL: This stands for extract, transform and load. This process is used to download data from a warehouse for use in analytics and in reports.
Byte sizes: This is one that confuses many. Here are four commonly mentioned byte sizes.
- A gigabyte contains 1 billion bytes
- A terabyte contains 1,000 gigabytes, or 1 trillion bytes
- A zettabyte contains 1 billion terabytes, or 1 sextillion bytes
- A yottabyte contains 1,000 zettabytes, or 1 septillion bytes
Metadata: Data that describes other data, typically summarizing basic facts about the data. A well-known example is the metadata on Word documents such as last date modified, date created and the author.
Database as a service: A database is a collection of organized data. A database as a service is typically a database hosted in a cloud and sold on a metered basis to other companies. Microsoft Azure and Amazon Relational Database Services are examples of this.
Infrastructure as a service: Refers to the hardware provided by an outside vendor that handles data collection and storage, typically as part of a cloud-based service that also offers database as a service.
Business intelligence: The use of data collection and analysis to create a data-driven decision-making process for a businesses.
Taxonomy: Classification of collected data into a pre-determined category. This is used to make for easy access and retrieval for specific sets of data.