Data quality testing | Blog

Data quality testing is a crucial component of data observability, ensuring that data meets standards of accuracy, consistency, completeness, and reliability before being used in business operations or analytics. This involves validating data against predefined rules, such as checking for duplicates, verifying formats, ensuring integrity, and confirming required fields are filled. Effective data quality testing builds trust in data and enables confident, data-driven decisions.

This post explores the data observability workflow at incident.io, addressing common challenges encountered and their solutions.

Data observability workflow at incident.io

incident.io relies on dbt’s native testing features for data observability, integrating both processes into the same pipeline for streamlined production and CI workflows. Custom data tests and schema test overrides provide flexibility to meet specific requirements.

What challenges did we face?

Relationship tests

Overview

One powerful schema test is the relationship test, verifying column values in a table against corresponding matches in another table. However, timing issues in ingestion pipelines can lead to “flaky” test failures.

Challenge

Fluctuations in data ingestion timing caused failed relationship tests due to varying data freshness across tables.

Solution

To address this, a pragmatic solution involved marking a relationship test as failed only if a value is missing from its parent table twice in a row. This approach minimized false failures caused by timing discrepancies.

  • Using dbt’s store_failures feature to log missing column values from relationship tests.



{% macro should_store_failures() %}
{% set config_store_failures = config.get('store_failures') %}
{% if target.name not in ["test", "dev"]
and config_store_failures is none
and config.get("severity") == "error" %}
{% set config_store_failures = true %}
{% endif %}
{% if config_store_failures is none %}
{% set config_store_failures = flags.STORE_FAILURES %}
{% endif %}
{% do return(config_store_failures) %}
{% endmacro %}

  • Customizing dbt’s default relationship test to include logic for checking if a value was missing in the previous test run.

If it’s the first test run, the `is_recurrent` variable is set to false for all values. Here’s a Python helper function that can be used to parse a manifest directory:

“`python
def parse_manifest(manifest_dir: str) -> List[Dict[str, Any]]:
results = []
with open(f”{manifest_dir}/manifest.json”) as manifest_file:
manifest: Dict[str, Any] = json.load(manifest_file)
invocation_id: str = manifest[“metadata”][“invocation_id”]
for node, value in manifest[“nodes”].items():
node_payload: Dict[str, Any] = {
“invocation_id”: invocation_id,
“config”: value[“config”],
“unique_id”: node,
“database_id”: f”{value[‘database’]}.{value[‘schema’]}.{value.get(‘alias’, value[‘name’])}”,
“path”: value[“path”],
}
results.append(node_payload)
return results
“`

In the process of enhancing the incident.io payload, we utilized the table names generated by Python to trigger an incident via the incident API. The code snippet below demonstrates how this was achieved:

“`python
failed_tests: List[str] = [
node[“unique_id”]
for node in parse_run_results(run_results_dir)
if node[“status”] == “fail”
]
database_ids: List[str] = [
node[“database_id”]
for node in parse_manifest(manifest_dir)
if node[“unique_id”] in failed_tests
]
payload[“metadata”][“queries”] = “; “.join(
[f”SELECT * FROM {database_id}” for database_id in database_ids]
)
“`

Additionally, we integrated the customized incident alert source as an incident feature, as depicted in the images provided. This customization allows for a more streamlined incident management process within the incident.io platform.

In conclusion, while our focus was on addressing common data observability challenges, it’s crucial to adapt solutions to suit the specific needs and data stacks of individual organizations. For more insights on data quality testing and related topics, feel free to subscribe with your email to receive regular updates.

Leave a Reply

Your email address will not be published. Required fields are marked *