Skip to content

Airflow Data Quality Provider#69413

Draft
gopidesupavan wants to merge 2 commits into
apache:mainfrom
gopidesupavan:airflow-dq-provider
Draft

Airflow Data Quality Provider#69413
gopidesupavan wants to merge 2 commits into
apache:mainfrom
gopidesupavan:airflow-dq-provider

Conversation

@gopidesupavan

Copy link
Copy Markdown
Member

Adds a new apache-airflow-providers-dq provider,
DbApiHook-based data quality checks.

Airflow already has SQL check operators, and many users rely on them for data
quality today. This provider does not replace that path; it adds a small
DQRule / RuleSet layer for checks that need stable rule identity, persisted
history, and a connection to Airflow assets. That makes quality results easier
to inspect over time, lets downstream asset consumers gate on recent quality,
and also gives LLM-assisted workflows one schema to generate when proposing
checks from table context. Execution still goes through existing DbApiHook
connections.

Ships:

  • DQRule and RuleSet models for named data quality rules.
  • Built-in SQL checks for common table and column checks, executed through
    common.sql / DbApiHook, plus custom_sql for database-specific or more
    complex checks.
  • DQCheckOperator and the @task.dq_check TaskFlow decorator.
  • A configurable results backend under [dq] results_path for task, run, and
    rule-level history.
  • A read-only API and minimal UI plugin for viewing task/run results and rule
    history.
  • Experimental asset helpers, asset_quality() and require_quality(), that
    attach provider-owned quality metadata to assets without changing Airflow
    core.
  • Documentation and example Dags covering end-to-end usage with and without
    LLM-generated rules.

This first version is intentionally small. It focuses on a deterministic rule
shape, SQL execution through common.sql, persisted results, and lightweight
visibility in the Airflow UI. It is not trying to be a full data quality
platform in the first drop.

Design decisions:

  • Results are stored through an object-storage/local-file backend instead of
    adding new metadata DB tables in the first provider drop. This keeps the
    provider self-contained, avoids Airflow core migrations, and lets deployments
    choose a durable store such as S3/GCS/local files via [dq] results_path.
    The backend stores keyed JSON records for task runs, task instances, and
    per-rule history so the UI can read common views without scanning unrelated
    runs.
  • Asset support is implemented by extending assets with provider-owned metadata,
    not by changing Airflow core. Static quality configuration is attached to
    Asset.extra["airflow.dq"]; runtime summaries are attached to asset events
    under extra["airflow.dq.result"]. This lets users try asset quality gating
    now, while leaving room to discuss deeper asset integration later if the
    provider gets traction.
  • The first release starts with DbApiHook / SQL execution because Airflow
    already has strong provider coverage through common.sql. File and
    object-store data checks are left for a later iteration.

Possible later iteration:

  • File/object-store based checks, where Airflow reads data from S3/GCS/local
    files or other object stores and runs quality rules directly against that
    data. This PR deliberately starts with the DbApiHook path first.
  • OpenLineage integration for data quality facets.

Was generative AI tooling used to co-author this PR?
  • Yes

Generated-by: following the guidelines


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@gopidesupavan

gopidesupavan commented Jul 5, 2026

Copy link
Copy Markdown
Member Author

LLM generated rules and executed via dq

Screenshot 2026-07-05 at 13 29 04 Screenshot 2026-07-05 at 13 29 15

Overall view of the task executed rules and run history

Screenshot 2026-07-05 at 13 34 42 Screenshot 2026-07-05 at 13 34 57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant