Skip to content

fix: make binningUdf/maxBins optional in hasNumberOfDistinctValues/hasHistogramValues (#81)#274

Open
nikolauspschuetz wants to merge 1 commit into
awslabs:masterfrom
nikolauspschuetz:fix/issue-81-optional-binning-args
Open

fix: make binningUdf/maxBins optional in hasNumberOfDistinctValues/hasHistogramValues (#81)#274
nikolauspschuetz wants to merge 1 commit into
awslabs:masterfrom
nikolauspschuetz:fix/issue-81-optional-binning-args

Conversation

@nikolauspschuetz

Copy link
Copy Markdown

Problem

hasNumberOfDistinctValues and hasHistogramValues force the caller to pass binningUdf and maxBins, but Deequ's Scala API defines both with defaults. So a natural call:

check.hasNumberOfDistinctValues("col", lambda x: x == 6)

raises TypeError: ... missing 2 required positional arguments: 'binningUdf' and 'maxBins'. Reported in #81.

Fix

Make binningUdf and maxBins optional (=None). When omitted, the wrapper substitutes Deequ's own defaults pulled from the JVM via the ...$default$N accessors — the same pattern PyDeequ already uses in satisfies():

  • binningUdf default → Scala Option.empty
  • maxBins default → Histogram.MaximumAllowedDetailBins

Pulling defaults from the live jar (rather than hardcoding) keeps them correct across Deequ versions. Fully backward-compatible: existing positional calls are unchanged.

Tests

Added to tests/test_checks.py:

  • test_hasNumberOfDistinctValues_without_binning_args
  • test_hasHistogramValues_without_binning_args

The pre-existing positional-arg tests are left untouched to prove backward compatibility. Validated against real Spark 3.5 / Deequ 2.0.8 (8 passed). CI runs the full pyspark matrix.

Closes #81

hasNumberOfDistinctValues and hasHistogramValues forced callers to pass
binningUdf and maxBins, although Deequ's Scala API marks both as optional
with defaults (binningUdf=None, maxBins=Histogram.MaximumAllowedDetailBins).
Default to the Scala-side defaults via the apply$default$N accessors when
the args are omitted, mirroring the existing satisfies() pattern. Existing
positional callers remain backward compatible.

Fixes awslabs#81

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

@nikolauspschuetz nikolauspschuetz marked this pull request as ready for review June 27, 2026 17:30
@nikolauspschuetz

Copy link
Copy Markdown
Author

Ready for review. Makes binningUdf/maxBins optional on hasNumberOfDistinctValues/hasHistogramValues (Deequ's Scala API already defaults them), using the same $default$N accessor pattern as satisfies. Backward-compatible. Validated locally on Spark 3.5: full test_checks.py → 88 passed. cc @sudsali @chenliu0831 — would appreciate your review. Closes #81.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hasNumberOfDistinctValues and hasHistogramValues require arguments that should be optional?

1 participant