Airflow Data Quality Provider#69413
Draft
gopidesupavan wants to merge 2 commits into
Draft
Conversation
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Adds a new
apache-airflow-providers-dqprovider,DbApiHook-based data quality checks.Airflow already has SQL check operators, and many users rely on them for data
quality today. This provider does not replace that path; it adds a small
DQRule/RuleSetlayer for checks that need stable rule identity, persistedhistory, and a connection to Airflow assets. That makes quality results easier
to inspect over time, lets downstream asset consumers gate on recent quality,
and also gives LLM-assisted workflows one schema to generate when proposing
checks from table context. Execution still goes through existing
DbApiHookconnections.
Ships:
DQRuleandRuleSetmodels for named data quality rules.common.sql/DbApiHook, pluscustom_sqlfor database-specific or morecomplex checks.
DQCheckOperatorand the@task.dq_checkTaskFlow decorator.[dq] results_pathfor task, run, andrule-level history.
history.
asset_quality()andrequire_quality(), thatattach provider-owned quality metadata to assets without changing Airflow
core.
LLM-generated rules.
This first version is intentionally small. It focuses on a deterministic rule
shape, SQL execution through
common.sql, persisted results, and lightweightvisibility in the Airflow UI. It is not trying to be a full data quality
platform in the first drop.
Design decisions:
adding new metadata DB tables in the first provider drop. This keeps the
provider self-contained, avoids Airflow core migrations, and lets deployments
choose a durable store such as S3/GCS/local files via
[dq] results_path.The backend stores keyed JSON records for task runs, task instances, and
per-rule history so the UI can read common views without scanning unrelated
runs.
not by changing Airflow core. Static quality configuration is attached to
Asset.extra["airflow.dq"]; runtime summaries are attached to asset eventsunder
extra["airflow.dq.result"]. This lets users try asset quality gatingnow, while leaving room to discuss deeper asset integration later if the
provider gets traction.
DbApiHook/ SQL execution because Airflowalready has strong provider coverage through
common.sql. File andobject-store data checks are left for a later iteration.
Possible later iteration:
files or other object stores and runs quality rules directly against that
data. This PR deliberately starts with the
DbApiHookpath first.Was generative AI tooling used to co-author this PR?
Generated-by: following the guidelines
Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.