A Deep Dive into the Importance of Data Design, Quality Assurance, and the Urgent Need for Defined Roles in the Data Industry

5 minute read

The neglect of data design and data quality is a pervasive issue within the data industry, hindering efforts to extract value from data. Although data science is focused on making data useful, data engineering is concerned with making data usable. However, the art of creating good data is often overlooked, and without quality data inputs, data scientists and data engineers can do little to help.

Furthermore, even when data is available, data quality is essential, as collecting low-quality data leads to a futile battle with the fundamental principle of “Garbage In, Garbage Out”. It is crucial to consider investing in good data before proceeding with any projects, instead of focusing on fancy algorithms, models, and hiring a parade of data scientists.

It is essential to note that the importance of data quality is widely recognized within the industry, as it is the foundation of the entire multibillion dollar data/AI/ML/statistics/analytics industry. However, it remains unclear who is responsible for the design, collection, curation, and documentation of high-quality datasets. Data engineers, statisticians, researchers, UX designers, and product managers have all claimed responsibility, leading to confusion and inconsistency. Data quality seems to be an “everybody’s job” that ends up being nobody’s job.

The lack of attention given to data quality and data leadership is concerning, as it diminishes the value of the data science profession. Neglecting these two critical prerequisites is disheartening, and the absence of a specific job title or community exacerbates the issue. The data quality professional, data designer, data curator, data collector, data steward, dataset engineer, or data excellence expert career lacks a defined name, making it challenging to search for candidates with the required skills.

It is crucial to recognize that data labeling jobs, such as mindless data entry or survey collection, are not comparable to the data quality professionals required for high-quality data collection. This dismissal is unfair to the category of genius required for successful data collection.

Therefore, it is essential to establish a specific job title for the person responsible for designing, collecting, curating, and documenting high-quality datasets to ensure data quality and avoid confusion. Without such a role, it is challenging to find excellence in candidates with the required symphony of skills, making data quality an “everybody’s job” and nobody’s responsibility.

Policies and guidelines for data collection, usage, and access vary depending on the organization, industry, and country. However, there are some common principles and best practices that most organizations follow:

Data collection policies: Organizations should clearly define what data they collect, how they collect it, and why they collect it. Data collection policies should also specify who has the authority to collect data and under what circumstances.

Data usage policies: Organizations should establish guidelines for how data can be used within the organization. These guidelines should specify who has access to the data, how the data can be used, and what the consequences are for misuse or unauthorized access.

Data access policies: Organizations should define who has access to data and under what circumstances. Access to sensitive data should be restricted to only those who need it to perform their job functions, and all access should be logged and audited.

Data retention policies: Organizations should establish guidelines for how long data should be retained and how it should be disposed of when it is no longer needed. These policies should also specify how data should be archived for long-term storage.

Data security policies: Organizations should establish guidelines for how data should be protected from unauthorized access, use, and disclosure. These policies should include physical security measures, network security measures, and access controls.

Privacy policies: Organizations should establish guidelines for how they collect, store, use, and share personal information. These policies should comply with applicable privacy laws and regulations, such as the GDPR in the European Union or CCPA in California.

Data governance policies: Organisations should establish guidelines for how they manage data as a strategic asset. Data governance policies should specify who is responsible for data management, how data quality is maintained, how data is integrated across systems, and how data is used to support decision-making.

Overall, policies and guidelines for data collection, usage, and access are critical for ensuring that organisations collect and use data responsibly, protect sensitive data from unauthorised access, and comply with applicable laws and regulations.

Let’s say a retail company is collecting customer transaction data for their online store. The company is interested in analyzing this data to gain insights on customer behavior and improve their sales. However, they notice that the data is incomplete and contains errors. For instance, some transactions are missing information about the product purchased, while others have incorrect information about the price or quantity.

To improve the data quality, the company assigns the following tasks to their team:

Data Engineer: The data engineer is responsible for ensuring that the data is accurate, complete, and consistent. They start by creating a data pipeline that automatically collects transaction data from the company’s database and performs data cleansing techniques such as data profiling and data validation. They also establish rules for data entry and ensure that the data is standardised to reduce errors and improve consistency.

Data Scientist: The data scientist is responsible for analyzing the data and extracting insights. They work closely with the data engineer to ensure that the data is clean and accurate. They use statistical methods and machine learning algorithms to identify patterns and trends in the data, and they create predictive models to forecast future customer behaviour. They also work with the responsible for the data to ensure that any data privacy or security issues are addressed.

Responsible for the data: The person responsible for the data is the gatekeeper of the data. They ensure that the data is being collected and used in a legal and ethical manner. They work with the data engineer and data scientist to ensure that the data is clean, accurate, and secure. They also set policies and guidelines for data collection, usage, and access, and they ensure that any sensitive information is protected.

Through the contribution of these three roles, the retail company was able to improve the quality of their transaction data, enabling them to make better decisions and improve customer satisfaction.