fix: restore column-vs-literal comparison in isGreaterThan/isLessThan family (#227)#273
Open
nikolauspschuetz wants to merge 1 commit into
Conversation
Author
|
Ready for review. Restores column-vs-literal comparison across the |
…ly (awslabs#227) Deequ 2.0.x's Check.isLessThan/isLessThanOrEqualTo/isGreaterThan/ isGreaterThanOrEqualTo forward columns = List(columnA, columnB) to satisfies, which makes Deequ require both operands to be existing columns. This regressed the long-supported column-vs-literal usage (e.g. isGreaterThanOrEqualTo("cluster_size", "1")), failing with 'Input data does not include column 1!' (issue awslabs#227). Route the comparator family through Deequ's satisfies with an empty columns list (the pre-2.0 behaviour), building the SQL predicate in the wrapper. columnB may now be a column name or a SQL literal/expression. Column-vs-column comparisons are unchanged. Adds regression tests for column-vs-literal comparisons.
50fc96c to
83c28e2
Compare
Author
|
Thanks for the automated review pass. The current revision addresses these findings:
Resolving the threads accordingly. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
isGreaterThanOrEqualTo(and the rest of the comparator family) regressed for column-vs-literal comparisons. This used to work:but now fails with the constraint message
Input data does not include column 1!— the second operand is interpreted strictly as a column name. Reported in #227.Root cause
This rode in with the bundled Deequ jar upgrade to 2.0.x (PyDeequ is pinned to
com.amazon.deequ:deequ:2.0.8-...viapydeequ/configs.py). In Deequ 1.2.x the comparators built a plain SQL predicate ("<colA> >= <colB>") and passed nocolumnslist, so a literal second operand worked. In Deequ 2.0.x the comparators passcolumns = List(columnA, columnB), and Deequ validates that every entry is a real dataframe column — so literals fail. The PyDeequ wrappers themselves never changed; they just forward to the regressed Scala methods.Fix
Route the whole comparator family (
isLessThan,isLessThanOrEqualTo,isGreaterThan,isGreaterThanOrEqualTo) through Deequ's publicsatisfies(...)with an emptycolumnslist — exactly the pre-2.0 behavior. Column-vs-column comparisons are unchanged; column-vs-literal/expression works again. Applied across the entire family for consistency.Note: as with Deequ's own
satisfies,columnBis treated as a Spark SQL expression, so string literals must be quoted by the caller (e.g."'foo'"). This matches the original 1.2.x semantics.Tests
Added to
tests/test_checks.py:test_comparator_against_literal— column-vs-literal for all four comparators, expecting Success.test_fail_comparator_against_literal— a failing literal comparison, expecting Failure.Validated against real Spark 3.5 / Deequ 2.0.8: the 2 new tests plus all 8 existing column-vs-column comparator tests pass (10 passed). CI will exercise the full pyspark 3.1/3.2/3.3/3.5 matrix.
Closes #227