Whitepaper

14 min read

Data Quality Rules: enforcing reliability of datasets. Data Quality Assurance using AWS Glue DataBrew

In today's data-driven world, maintaining the quality and integrity of your data is paramount. Ensuring that organizations' datasets are accurate, consistent and complete is crucial for effective decision-making and operational efficiency. Our upcoming eBook, "Data Quality No-Code Automation with AWS Glue DataBrew: A Proof of Concept," provides practical strategies and tools to help you achieve top-notch data quality.

In this blog post, we're excited to share a preview from our eBook that guides you through creating data quality rules in AWS Glue DataBrew, using HR datasets as an example to enhance their reliability. Following these steps ensures your data is clean, consistent and ready for analysis.

Stay tuned for the release of our eBook, and don't miss out - sign up now to join the waiting list and be among the first to access this valuable resource.

Data Quality Rules

In modern data architecture, the adage "garbage in, garbage out" holds true, emphasizing the critical importance of data quality in ensuring the reliability and effectiveness of analytical and machine-learning processes. Challenges arise from integrating data from diverse sources, encompassing issues of volume, velocity and veracity. Therefore, while unit testing applications is commonplace, ensuring the veracity of incoming data is equally vital, as it can significantly impact application performance and outcomes.

The introduction of data quality rules in AWS Glue DataBrew addresses these challenges head-on. DataBrew, a visual data preparation tool tailored for analytics and machine learning, provides a robust framework for profiling and refining data quality. Central to this framework is the concept of a "ruleset", a collection of rules that compare various data metrics against predefined benchmarks.

Utilize AWS Glue DataBrew to establish a comprehensive set of data quality rules tailored to the organization's specific requirements. These rules will encompass various aspects such as missing or incorrect values, changes in data distribution affecting ML models, erroneous aggregations impacting business decisions and incorrect data types with significant repercussions, particularly in financial or scientific contexts.

Employ DataBrew's intuitive interface to create and deploy rulesets, consolidating the defined data quality rules into actionable entities. These rulesets serve as a foundation for automating data quality checks and ensuring adherence to predefined standards across diverse datasets. We discuss all these steps and explain them step-by-step in the ebook.

After defining the data quality rulesets, the subsequent step involve crafting specific data quality rules and checks to ensure the integrity and accuracy of a dataset - which we focus on in this blogpost. AWS Glue DataBrew allows for the creation of multiple rules within a ruleset, and each rule can include various checks, tailored to address particular data quality concerns. This flexible structure enables the user to take a comprehensive approach to validating and cleansing data.

In this phase of our PoC, we focus on implementing a set of precise data quality rules and the respective checks that correspond to common data issues often encountered in human resources datasets. These rules are designed not only to identify errors, but also to enforce consistency and reliability across a dataset.

Row Count Verification

Rule: Ensure the total row count matches the expected figures to verify no data is missing or excessively duplicated.

Accurately verifying the row count in our dataset is essential for ensuring data completeness and reliability. In AWS Glue DataBrew, setting up a rule to confirm the correct total row count ensures that no records are missing or inadvertently duplicated during data processing. This check is crucial for the integrity of any subsequent analyses or operations.

To set up this check, you will need to follow these steps within the DataBrew console under your designated data quality ruleset:

Initiate a New Rule

Begin by navigating to the section of the DataBrew console where you can manage your data quality rules. Click on ‘Add a new rule’ to start the process of creating a rule focused on the row count.
Assign a descriptive name to this rule that indicates its purpose, such as ‘Check Record Count’. This naming helps to easily identify the rules function in ensuring data accuracy.

Configure the Rule Scope

Set the ‘Data quality check scope’ to ‘Individual check for each column’. While checking the row count might seem to be an overall dataset check, this setting ensures that the rule is evaluated with the proper scope and triggers correctly.
Opt for ‘All data quality checks are met (AND)’ under ‘Rule success criteria’. This selection specifies that for the rule to pass, the row count must exactly match the expected number without deviation.

Defining the Check for Row Count

Within the rule, select ‘Number of rows’ from the options under ‘Data quality checks’. This particular check focuses directly on quantifying the datasets rows.
For the condition, choose ‘Is equals’ to set the precise expectation for the row count.
Enter the value ‘5000’ as the exact number of rows expected in the dataset. This figure should reflect the anticipated row count based on your data acquisition parameters or initial dataset size estimations.

By implementing this rule, you establish a robust verification process for the row count, which plays a critical role in maintaining the data's integrity. It ensures that the dataset loaded into AWS Glue DataBrew is complete and that no data loss or duplication issues affect the quality of your information. This rule is an integral part of our data quality framework, supporting reliable data-driven decision-making.

Duplicate Row Check

Rule: Identify and remove any duplicate records to maintain dataset uniqueness.

Ensuring the uniqueness of data within our dataset is crucial for maintaining the accuracy and reliability of any analysis derived from it. To effectively identify and eliminate any duplicate rows in our dataset, we employ a structured approach within AWS Glue DataBrew. This process involves setting up a specific rule dedicated to detecting duplicates. To begin, access your previously defined data quality ruleset in the DataBrew console. From here, you will add a new rule tailored to address duplicate entries.

Initiate a New Rule

Navigate to the interface where the existing rules are displayed and select ‘Add another rule’ to start defining a new condition focused on duplicates.
For the rule’s identification, label it descriptively to reflect its purpose. A suggested name might be ‘Check Duplicate Rows’, which clearly states the rule’s function.

Configure the Rule Scope

Set the ‘Data quality check scope’ to ‘Individual check for each column’. This setting allows the rule to evaluate each column independently, ensuring comprehensive coverage across all data points.
Define the ‘Rule success criteria’ by selecting ‘All data quality checks are met (AND)’. This criterion ensures that the rule only passes if all checks within it confirm the absence of duplicates, reinforcing data integrity.

Establish the Check for Duplicates

Under the first check, titled ‘Check 1’, choose ‘Duplicate rows’ from the list of available data quality checks. This selection specifically targets duplicate records within the dataset.
For the condition to assess the check, use ‘Is equals’. This condition will be used to evaluate the result of the duplicate check against a predefined value.
Specify the value as ‘0’ and select ‘rows’ from the dropdown menu. This setting means that the rule will only pass if there are zero duplicate rows found, aligning with our criteria for data quality.

By meticulously configuring this rule, we ensure that our dataset is thoroughly scanned for any duplicate entries, and any found are flagged for review or automatic handling, depending on the broader data governance strategies in place. Implementing this rule is a key step towards certifying that our data remains pristine and that all analyses conducted are based on accurate and reliable information.

Uniqueness of Key Identifiers

Rule: Confirm that each Employee ID, email address and SSN is unique across all records, preventing identity overlaps.

Initiate a New Rule

Access the DataBrew console and locate your active data quality ruleset. Here, select 'Add another rule' to begin defining a new rule aimed at checking the uniqueness of specified columns.
Provide a meaningful name to this new rule, such as 'Check Unique Values'. This name assists in identifying the purpose of the rule within the broader data quality management framework.

Configure the Rule Scope

For the 'Data quality check scope,' choose 'Common checks for selected columns.' This setting allows the rule to focus on the uniqueness check across multiple specific columns, rather than the entire dataset or individual columns in isolation.
Select 'All data quality checks are met (AND)' for the 'Rule success criteria.' This configuration ensures that for the rule to pass, all specified columns must meet the uniqueness criterion without exception.

Select Columns to Check

Under 'Selected columns,' proceed to choose the specific columns to include in this uniqueness check. Select the columns labeled 'Emp ID,' 'e mail,' and 'SSN' from your dataset. These fields are critical identifiers that should not be duplicated within the dataset.

Configure the Uniqueness Check

Under 'Check 1,' select 'Unique values' from the list of available data quality checks. This option will assess whether each entry in the specified columns is unique.
Set the condition to 'Is equals' and enter the value '100'. Then, choose '% (percent) rows' from the dropdown menu. This setting specifies that 100% of the rows for each selected column contain unique values, affirming the absolute uniqueness required for these identifiers.

By diligently configuring this rule, you ensure that critical personal and professional identifiers such as Employee ID, email and SSN are uniquely assigned to individual records, enhancing the reliability and accuracy of your dataset. This step is crucial for maintaining the quality of your data and ensuring that all analyses derived from this dataset are based on correct and non-duplicative information.

Non-null Critical Fields

Rule: Employee ID and phone numbers must not contain null values, ensuring complete data for essential contact information.

For the integrity and completeness of our human resources dataset, it is imperative to ensure that certain critical fields, specifically Employee IDs and phone numbers are always populated. A null value in these fields could indicate incomplete data capture or processing errors, which could lead to inaccuracies in employee management and communication efforts.

Initiate a New Rule

In the DataBrew console within your project’s context, navigate to the section where data quality rules are managed and click on ‘Add another rule’. Doing this starts the process for defining a new validation rule.
Name the rule descriptively to reflect its purpose, for example, 'Check Not Null Columns'. This helps to easily recognize the rule’s function in subsequent management and audit processes.

Configure the Rule Scope

Set the 'Data quality check scope' to 'Common checks for selected columns.' This setting allows the rule to simultaneously evaluate multiple specified columns under a unified criterion.
Choose 'All data quality checks are met (AND)' for the 'Rule success criteria'. This ensures that the rule will only pass if all conditions are met across the selected columns, confirming no null values are present.

Select Columns to Check

Under 'Selected columns', choose the columns that need validation for non-null values, namely 'Emp ID' and 'Phone No'. These fields are crucial for maintaining operational contact and identity verification within the dataset.

Configure the Non-null Check

For 'Check 1', select 'Value is not missing' from the options in the data quality checks. This check will verify that each entry in the specified columns contains data.
Set the condition to 'Greater than equals' and specify the threshold as '100'. Choose '% (percent) rows' from the dropdown menu. This configuration demands that 100% of the rows for each selected column meet the condition of not having null values.

By configuring this rule, you ensure that no records in the dataset have null values in the Employee ID and phone number fields, reinforcing the completeness and usability of your HR data. This step is crucial in maintaining high-quality, actionable data that supports effective HR management and operational processes.

Validation of Numerical Data

Rule: Employee IDs should be integers, and the age field should not contain negative values, maintaining logical data integrity.

Initiate a New Rule

Navigate to the 'Rules' section within the DataBrew project environment and select 'Add another rule' to initiate the creation of a new data quality rule.
Provide a clear and descriptive name for this rule, such as 'Check Positive Values'. This naming helps in quickly identifying the rule's purpose and ensures clarity in data quality reports.

Configure the Rule Scope

Set the 'Data quality check scope' to 'Individual check for each column.' This allows the rule to apply specific checks to each column independently, ensuring that each field meets the set criteria without being influenced by other data columns.
Select 'All data quality checks are met (AND)' for the 'Rule success criteria'. This setting confirms that every condition specified under this rule must be satisfied for the rule to pass, ensuring comprehensive validation.

Establish the Check for Numeric Values

Under 'Check 1', choose 'Numeric values' from the list of available data quality checks. This selection specifies that the rule should focus on validating numeric data within the chosen columns.
For the specific columns to validate, choose 'Emp ID' from the dropdown menu initially. This step is crucial as it directs the rule to apply the following conditions specifically to the Employee ID field.
Set the condition to 'Greater than equals'. This condition ensures that the rule checks whether the values are greater than or equal to the threshold you will define next, or not.
For the value, select 'Custom value' and enter '0'. This configuration mandates that the Employee IDs must be zero or positive, effectively excluding any negative numbers which are not valid for this field.

By implementing this rule, you will effectively ensure that critical numeric fields such as Employee ID and age do not contain negative values, thus upholding the logical consistency and reliability of your dataset. This proactive approach in data validation is integral to maintaining high data quality standards necessary for accurate and reliable HR analytics and operations.

There are even more data quality rules to set, but we will extend this topic further in ebook: "Data Quality No-Code Automation with AWS Glue DataBrew: A Proof of Concept," where we present the entire data quality process. We will demonstrate profile job certification, data quality validation and how to conduct cleaning the dataset.

This eBook will be available soon, offering you the insights and tools necessary to maximize the potential of your datasets and more. Ensure your data is accurate, reliable and ready for impactful analysis. Click here to join the waiting list.

Data Management

data quality

AWS Glue DataBrew

Ebook

Last updated: 30 July 2024

Written by

Mateusz Krupski

Senior Data Engineer

Like this post?
Spread the word

Want more? Check our articles

Tutorial

Data pipeline evolution at Linkedin on a few pictures

Data Pipeline Evolution The LinkedIn Engineering blog is a great resource of technical blog posts related to building and using large-scale data…

Tech News

If LLM’s did not exist. ML innovations in 2023 from a data scientist’s perspective

The year 2023 has definitely been dominated by LLM’s (Large Language Models) and generative models. Whether you are a researcher, data scientist, or…

How do we apply knowledge sharing in our teams? GetInData Guilds

Do you remember our blog post about our internal initiatives such as Lunch & Learn and internal training? If yes, that’s great! If you didn’t get the…

getindator create an image illustrating the concept of data ske b0d7e21f 9c85 40d2 9a52 32caba3aece3

Tutorial

Data skew in Flink SQL

Data processing in real-time has become crucial for businesses, and Apache Flink, with its powerful stream processing capabilities, is at the…

Use-cases/Project

Geospatial analytics on Hadoop

A few months ago I was working on a project with a lot of geospatial data. Data was stored in HDFS, easily accessible through Hive. One of the tasks…

5 main data-related trends to be covered at Big Data Tech Warsaw 2021. Part I.

A year is definitely a long enough time to see new trends or technologies that get more traction. The Big Data landscape changes increasingly fast…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com