feat: add Distance analyzer for categorical feature drift (L-infinity, chi-squared) (#164)#276
Open
nikolauspschuetz wants to merge 2 commits into
Open
Conversation
Author
|
Ready for review. Adds a |
Expose Deequ's com.amazon.deequ.analyzers.Distance categorical distance
(L-infinity and chi-squared) in PyDeequ. Distance is a Scala object with
static-style methods rather than an Analyzer subclass, so it is wrapped as
a Distance helper class (not via addAnalyzer) that bridges two
{category: count} distributions to the JVM and returns the numeric distance.
- Add Distance class with categoricalDistance(distribution1, distribution2,
correctForLowNumberOfSamples, method, alpha, Yates/Cochran thresholds)
- Add CategoricalDistanceMethod enum (LInfinity, Chisquare)
- Build the required Scala mutable.Map[String, Long] via explicit Long-boxing
(java.lang.Long[] array -> genericWrapArray -> zip -> toMap), with NO Spark
job. Array element slots preserve Long boxing JVM-side and no value is read
back into Python, so the typing survives (py4j otherwise auto-unboxes per-
value Longs to int, triggering a ClassCastException in Deequ's e._2.toDouble).
Uses only core Scala 2.12 stdlib (no ambient Java->Scala implicits), so it is
portable across all supported Spark builds 3.1-3.5.
- Guard against empty distributions with a clear ValueError
- Document in docs/analyzers.md and add tests covering L-infinity, chi-squared,
single-category, empty-dict ValueError, and the invalid-method path
Numerical distance is intentionally out of scope: it requires a JVM
QuantileNonSample[Double] with no convenient Python construction path.
8368609 to
57cdd69
Compare
The alpha argument to categoricalDistance flows through LInfinityMethod(scala.Option.apply(alpha)) -- the only Option[Double] bridge in the Distance wrapper -- and was previously untested. Assert that two different alpha significance levels yield different distances, proving the value is genuinely consumed JVM-side.
Author
|
Thanks for the automated review pass. The current revision addresses these findings:
Resolving the threads accordingly. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Deequ's
Distanceanalyzer (com/amazon/deequ/analyzers/Distance.scala) computes distribution distance for feature-drift detection (L-infinity and chi-squared), but it was not exposed in PyDeequ. Requested in #164 (labeled good first issue / help wanted / feature request).What this adds
A
Distancehelper plus aCategoricalDistanceMethodenum exposing categorical distance:Supports both L-infinity and chi-squared, with the full alpha / Yates / Cochran parameters.
Design note (please scrutinize)
Deequ's
Distanceis a plain object with static-style methods, not anAnalyzersubclass, so it cannot go throughAnalysisRunBuilder.addAnalyzer(...).run(). I exposed it faithfully as a helper rather than faking analyzer integration.categoricalDistancerequires a Scalamutable.Map[String, Long]. py4j auto-unboxes individualjava.lang.Longback to Python ints (re-entering asInteger), which makes Deequ'se._2.toDoublethrowClassCastException. To keep values genuinelyLong-typed, each dict is round-tripped through a one-row Spark DataFrame with aMapType(StringType, LongType)column and collected JVM-side. Tradeoff: a small Spark job per distribution — acceptable for a drift helper, and the only reliable path found. Open to a lighter approach if reviewers prefer one.Scope: categorical distance only.
numericalDistancerequires constructing a JVMQuantileNonSample[Double]with no clean Python path and is intentionally left out (matches the issue).Tests
Added to
tests/test_analyzers.py:test_Distance_categorical_LInfinityandtest_Distance_categorical_Chisquare. Validated against the live Deequ 2.0.8 jar on Spark 3.5 (2 passed; full analyzer suite 27 passed / 22 pre-existing xfails, no regressions). Docs updated indocs/analyzers.md.Closes #164