feat: add Distance analyzer for categorical feature drift (L-infinity, chi-squared) (#164) by nikolauspschuetz · Pull Request #276 · awslabs/python-deequ

nikolauspschuetz · 2026-06-26T18:57:19Z

Problem

Deequ's Distance analyzer (com/amazon/deequ/analyzers/Distance.scala) computes distribution distance for feature-drift detection (L-infinity and chi-squared), but it was not exposed in PyDeequ. Requested in #164 (labeled good first issue / help wanted / feature request).

What this adds

A Distance helper plus a CategoricalDistanceMethod enum exposing categorical distance:

from pydeequ.analyzers import Distance, CategoricalDistanceMethod

dist = Distance(spark)
# distribution dicts as produced by the Histogram analyzer: {category: count}
d = dist.categoricalDistance({"a": 5, "b": 5}, {"a": 10},
                             method=CategoricalDistanceMethod.LInfinity)

Supports both L-infinity and chi-squared, with the full alpha / Yates / Cochran parameters.

Design note (please scrutinize)

Deequ's Distance is a plain object with static-style methods, not an Analyzer subclass, so it cannot go through AnalysisRunBuilder.addAnalyzer(...).run(). I exposed it faithfully as a helper rather than faking analyzer integration.

categoricalDistance requires a Scala mutable.Map[String, Long]. py4j auto-unboxes individual java.lang.Long back to Python ints (re-entering as Integer), which makes Deequ's e._2.toDouble throw ClassCastException. To keep values genuinely Long-typed, each dict is round-tripped through a one-row Spark DataFrame with a MapType(StringType, LongType) column and collected JVM-side. Tradeoff: a small Spark job per distribution — acceptable for a drift helper, and the only reliable path found. Open to a lighter approach if reviewers prefer one.

Scope: categorical distance only. numericalDistance requires constructing a JVM QuantileNonSample[Double] with no clean Python path and is intentionally left out (matches the issue).

Tests

Added to tests/test_analyzers.py: test_Distance_categorical_LInfinity and test_Distance_categorical_Chisquare. Validated against the live Deequ 2.0.8 jar on Spark 3.5 (2 passed; full analyzer suite 27 passed / 22 pre-existing xfails, no regressions). Docs updated in docs/analyzers.md.

Closes #164

github-actions

Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

nikolauspschuetz · 2026-06-27T17:30:42Z

Ready for review. Adds a Distance helper (categorical L-infinity + chi-squared) for feature-drift detection, exposing Deequ's Distance object faithfully (it's not an Analyzer, so it's a helper rather than addAnalyzer). Validated against the live Deequ 2.0.8 jar on Spark 3.5: test_analyzers.py → 27 passed / 22 xfailed. See the PR body for a reviewer note on the py4j Long-boxing handling. cc @sudsali @chenliu0831 @rdsharma26 — would appreciate your review. Closes #164.

Expose Deequ's com.amazon.deequ.analyzers.Distance categorical distance (L-infinity and chi-squared) in PyDeequ. Distance is a Scala object with static-style methods rather than an Analyzer subclass, so it is wrapped as a Distance helper class (not via addAnalyzer) that bridges two {category: count} distributions to the JVM and returns the numeric distance. - Add Distance class with categoricalDistance(distribution1, distribution2, correctForLowNumberOfSamples, method, alpha, Yates/Cochran thresholds) - Add CategoricalDistanceMethod enum (LInfinity, Chisquare) - Build the required Scala mutable.Map[String, Long] via explicit Long-boxing (java.lang.Long[] array -> genericWrapArray -> zip -> toMap), with NO Spark job. Array element slots preserve Long boxing JVM-side and no value is read back into Python, so the typing survives (py4j otherwise auto-unboxes per- value Longs to int, triggering a ClassCastException in Deequ's e._2.toDouble). Uses only core Scala 2.12 stdlib (no ambient Java->Scala implicits), so it is portable across all supported Spark builds 3.1-3.5. - Guard against empty distributions with a clear ValueError - Document in docs/analyzers.md and add tests covering L-infinity, chi-squared, single-category, empty-dict ValueError, and the invalid-method path Numerical distance is intentionally out of scope: it requires a JVM QuantileNonSample[Double] with no convenient Python construction path.

github-actions

Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

The alpha argument to categoricalDistance flows through LInfinityMethod(scala.Option.apply(alpha)) -- the only Option[Double] bridge in the Distance wrapper -- and was previously untested. Assert that two different alpha significance levels yield different distances, proving the value is genuinely consumed JVM-side.

github-actions

No issues found.

Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

nikolauspschuetz · 2026-06-29T17:20:01Z

Thanks for the automated review pass. The current revision addresses these findings:

Map construction — _to_scala_mutable_long_map no longer uses createDataFrame/collectAsList. It builds the mutable.HashMap[String, Long] from JVM Long[] arrays via genericWrapArray/zip/toMap, so there is no per-call Spark job and no reliance on a java.util.Map -> Scala implicit conversion.
Empty distributions — guarded: categoricalDistance raises ValueError if either distribution dict is empty.
correctForLowNumberOfSamples docstring — matches Deequ semantics: True returns the raw/unscaled statistic, False returns the normalized (KS-corrected L-infinity / chi-squared) result.
Seq.canBuildFrom (Scala 2.12) — documented as a known constraint in the method docstring; all supported Spark builds (3.1-3.5) ship Scala 2.12.
Tests — added test_Distance_categorical_single_category, test_Distance_categorical_empty_dict_raises, and test_Distance_categorical_invalid_method_raises (plus an alpha test) covering the edge/error paths.

Resolving the threads accordingly.

github-actions Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread pydeequ/analyzers.py

Comment thread pydeequ/analyzers.py Outdated

Comment thread pydeequ/analyzers.py Outdated

Comment thread tests/test_analyzers.py

nikolauspschuetz marked this pull request as ready for review June 27, 2026 17:30

nikolauspschuetz force-pushed the feat/issue-164-distance-analyzer branch from 8368609 to 57cdd69 Compare June 27, 2026 17:43

github-actions Bot reviewed Jun 27, 2026

View reviewed changes

Comment thread pydeequ/analyzers.py

Comment thread pydeequ/analyzers.py

github-actions Bot reviewed Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Distance analyzer for categorical feature drift (L-infinity, chi-squared) (#164)#276

feat: add Distance analyzer for categorical feature drift (L-infinity, chi-squared) (#164)#276
nikolauspschuetz wants to merge 2 commits into
awslabs:masterfrom
nikolauspschuetz:feat/issue-164-distance-analyzer

nikolauspschuetz commented Jun 26, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nikolauspschuetz commented Jun 27, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

nikolauspschuetz commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nikolauspschuetz commented Jun 26, 2026

Problem

What this adds

Design note (please scrutinize)

Tests

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nikolauspschuetz commented Jun 27, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

nikolauspschuetz commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant