Skip to content

feat: add Distance analyzer for categorical feature drift (L-infinity, chi-squared) (#164)#276

Open
nikolauspschuetz wants to merge 2 commits into
awslabs:masterfrom
nikolauspschuetz:feat/issue-164-distance-analyzer
Open

feat: add Distance analyzer for categorical feature drift (L-infinity, chi-squared) (#164)#276
nikolauspschuetz wants to merge 2 commits into
awslabs:masterfrom
nikolauspschuetz:feat/issue-164-distance-analyzer

Conversation

@nikolauspschuetz

Copy link
Copy Markdown

Problem

Deequ's Distance analyzer (com/amazon/deequ/analyzers/Distance.scala) computes distribution distance for feature-drift detection (L-infinity and chi-squared), but it was not exposed in PyDeequ. Requested in #164 (labeled good first issue / help wanted / feature request).

What this adds

A Distance helper plus a CategoricalDistanceMethod enum exposing categorical distance:

from pydeequ.analyzers import Distance, CategoricalDistanceMethod

dist = Distance(spark)
# distribution dicts as produced by the Histogram analyzer: {category: count}
d = dist.categoricalDistance({"a": 5, "b": 5}, {"a": 10},
                             method=CategoricalDistanceMethod.LInfinity)

Supports both L-infinity and chi-squared, with the full alpha / Yates / Cochran parameters.

Design note (please scrutinize)

Deequ's Distance is a plain object with static-style methods, not an Analyzer subclass, so it cannot go through AnalysisRunBuilder.addAnalyzer(...).run(). I exposed it faithfully as a helper rather than faking analyzer integration.

categoricalDistance requires a Scala mutable.Map[String, Long]. py4j auto-unboxes individual java.lang.Long back to Python ints (re-entering as Integer), which makes Deequ's e._2.toDouble throw ClassCastException. To keep values genuinely Long-typed, each dict is round-tripped through a one-row Spark DataFrame with a MapType(StringType, LongType) column and collected JVM-side. Tradeoff: a small Spark job per distribution — acceptable for a drift helper, and the only reliable path found. Open to a lighter approach if reviewers prefer one.

Scope: categorical distance only. numericalDistance requires constructing a JVM QuantileNonSample[Double] with no clean Python path and is intentionally left out (matches the issue).

Tests

Added to tests/test_analyzers.py: test_Distance_categorical_LInfinity and test_Distance_categorical_Chisquare. Validated against the live Deequ 2.0.8 jar on Spark 3.5 (2 passed; full analyzer suite 27 passed / 22 pre-existing xfails, no regressions). Docs updated in docs/analyzers.md.

Closes #164

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

Comment thread pydeequ/analyzers.py
Comment thread pydeequ/analyzers.py Outdated
Comment thread pydeequ/analyzers.py Outdated
Comment thread tests/test_analyzers.py
@nikolauspschuetz nikolauspschuetz marked this pull request as ready for review June 27, 2026 17:30
@nikolauspschuetz

Copy link
Copy Markdown
Author

Ready for review. Adds a Distance helper (categorical L-infinity + chi-squared) for feature-drift detection, exposing Deequ's Distance object faithfully (it's not an Analyzer, so it's a helper rather than addAnalyzer). Validated against the live Deequ 2.0.8 jar on Spark 3.5: test_analyzers.py → 27 passed / 22 xfailed. See the PR body for a reviewer note on the py4j Long-boxing handling. cc @sudsali @chenliu0831 @rdsharma26 — would appreciate your review. Closes #164.

Expose Deequ's com.amazon.deequ.analyzers.Distance categorical distance
(L-infinity and chi-squared) in PyDeequ. Distance is a Scala object with
static-style methods rather than an Analyzer subclass, so it is wrapped as
a Distance helper class (not via addAnalyzer) that bridges two
{category: count} distributions to the JVM and returns the numeric distance.

- Add Distance class with categoricalDistance(distribution1, distribution2,
  correctForLowNumberOfSamples, method, alpha, Yates/Cochran thresholds)
- Add CategoricalDistanceMethod enum (LInfinity, Chisquare)
- Build the required Scala mutable.Map[String, Long] via explicit Long-boxing
  (java.lang.Long[] array -> genericWrapArray -> zip -> toMap), with NO Spark
  job. Array element slots preserve Long boxing JVM-side and no value is read
  back into Python, so the typing survives (py4j otherwise auto-unboxes per-
  value Longs to int, triggering a ClassCastException in Deequ's e._2.toDouble).
  Uses only core Scala 2.12 stdlib (no ambient Java->Scala implicits), so it is
  portable across all supported Spark builds 3.1-3.5.
- Guard against empty distributions with a clear ValueError
- Document in docs/analyzers.md and add tests covering L-infinity, chi-squared,
  single-category, empty-dict ValueError, and the invalid-method path

Numerical distance is intentionally out of scope: it requires a JVM
QuantileNonSample[Double] with no convenient Python construction path.
@nikolauspschuetz nikolauspschuetz force-pushed the feat/issue-164-distance-analyzer branch from 8368609 to 57cdd69 Compare June 27, 2026 17:43

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

Comment thread pydeequ/analyzers.py
Comment thread pydeequ/analyzers.py
The alpha argument to categoricalDistance flows through
LInfinityMethod(scala.Option.apply(alpha)) -- the only Option[Double]
bridge in the Distance wrapper -- and was previously untested. Assert
that two different alpha significance levels yield different distances,
proving the value is genuinely consumed JVM-side.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

@nikolauspschuetz

Copy link
Copy Markdown
Author

Thanks for the automated review pass. The current revision addresses these findings:

  • Map construction_to_scala_mutable_long_map no longer uses createDataFrame/collectAsList. It builds the mutable.HashMap[String, Long] from JVM Long[] arrays via genericWrapArray/zip/toMap, so there is no per-call Spark job and no reliance on a java.util.Map -> Scala implicit conversion.
  • Empty distributions — guarded: categoricalDistance raises ValueError if either distribution dict is empty.
  • correctForLowNumberOfSamples docstring — matches Deequ semantics: True returns the raw/unscaled statistic, False returns the normalized (KS-corrected L-infinity / chi-squared) result.
  • Seq.canBuildFrom (Scala 2.12) — documented as a known constraint in the method docstring; all supported Spark builds (3.1-3.5) ship Scala 2.12.
  • Tests — added test_Distance_categorical_single_category, test_Distance_categorical_empty_dict_raises, and test_Distance_categorical_invalid_method_raises (plus an alpha test) covering the edge/error paths.

Resolving the threads accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Distance analyzer for detecting feature drift with PyDeequ

1 participant