From 1eabce5e2fa7775b726931c704a49faf3793d2db Mon Sep 17 00:00:00 2001 From: igerber Date: Thu, 14 May 2026 20:51:19 -0400 Subject: [PATCH 01/15] Add Tutorial 22: Survey-Weighted HAD walkthrough Demonstrates the now-fully-supported HeterogeneousAdoptionDiD + did_had_pretest_workflow workflow under SurveyDesign(weights, strata, psu, fpc) end-to-end on a BRFSS-shape state-rollout panel (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post- stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU x period interaction shocks injected so cluster correlation survives DiD first-differencing). Closes the Phase 5 wave 2 second slice tutorial gap that the survey-strata gate lift unblocked. Eight sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with side-by-side ATT/SE/CI table (~10% SE inflation, sign-only direction asserted); a dedicated discussion of why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study with sup-t cband under the survey design; pretest workflow on both overall and event-study paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; communicating-to-leadership two-paragraph template; Extensions + Summary Checklist surfacing the still-deferred lonely_psu='adjust' + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file (25 tests across 5 groups) locks panel composition with deterministic exact pins, naive-vs-survey SE inflation direction (sign-only structural anchor; HAD's IF concentration caps inflation around 1.10x even at large PSU shock SD), design auto-detection to continuous_near_d_lower, event-study cband-vs-pointwise width ordering, _QUG_DEFERRED_SUFFIX substring on report.verdict for both overall and event-study paths, the distinct report.summary() QUG-skip note on the event-study path, deterministic Yatchew sigma2_* pins, and bootstrap p-value tolerance bands at >= 0.25 abs per feedback_strata_bootstrap_path_divergence. Cross-surface updates: T20 and T21 Extensions bullets gain forward-pointers to T22 (T20 also drops the deprecated weights= kwarg phrasing in favor of survey_design=); practitioner decision tree HAD universal-rollout and survey sections each gain a tip cross-link to T22 (adjacent to T20 / T17, not displacing); api/had.rst gains a Survey-aware fit cross-reference; survey-roadmap.md gains a Phase 4.5 C HAD Stute Survey Workflow section; bundled llms.txt and llms-practitioner.txt carry T22 inventory entries; doc-deps.yaml wires T22 as a dependent of both had.py and had_pretests.py; REGISTRY closers L2529 + L2577 flipped to shipped; TODO row L115 marked shipped; CHANGELOG carries the new Unreleased Added entry plus closer flips at the T21 (PR #409) and HAD handlers (PR #402) queued-tutorial closer lines. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 5 +- TODO.md | 2 +- diff_diff/guides/llms-practitioner.txt | 5 +- diff_diff/guides/llms.txt | 1 + docs/api/had.rst | 9 + docs/doc-deps.yaml | 6 + docs/methodology/REGISTRY.md | 4 +- docs/practitioner_decision_tree.rst | 13 +- docs/survey-roadmap.md | 29 + docs/tutorials/20_had_brand_campaign.ipynb | 2 +- docs/tutorials/21_had_pretest_workflow.ipynb | 2 +- docs/tutorials/22_had_survey_design.ipynb | 777 +++++++++++++++++++ docs/tutorials/README.md | 8 + tests/test_t22_had_survey_design_drift.py | 556 +++++++++++++ 14 files changed, 1409 insertions(+), 10 deletions(-) create mode 100644 docs/tutorials/22_had_survey_design.ipynb create mode 100644 tests/test_t22_had_survey_design_drift.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 920bc0c7..5080b296 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,16 +8,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added +- **Tutorial 22: Survey-Weighted HAD** (`docs/tutorials/22_had_survey_design.ipynb`) — end-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` on a BRFSS-shape stratified household-survey panel (5 strata × 6 PSUs/stratum × 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU × period interaction shocks injected so cluster correlation survives DiD first-differencing). Demonstrates the `SurveyDesign(strata=...)` path through the Stute pretest family that the previous `[Unreleased]` entry unblocked. Eight numbered sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with a side-by-side ATT / SE / CI table (~10% SE inflation, sign-only direction asserted); a dedicated section explaining why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study fit with sup-t cband under the survey design (per-horizon table + matplotlib gated plot); pretest workflow on both `aggregate="overall"` and `aggregate="event_study"` paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; "Communicating to Leadership" two-paragraph template (executive + methodologist); Extensions + Summary Checklist surfacing the still-deferred `lonely_psu='adjust'` + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (25 tests across 5 groups: panel + survey composition with deterministic exact pins; naive-vs-survey headline with sign-only SE-inflation anchor; event-study cband-vs-pointwise ordering and post/pre coverage; pretest overall path with `_QUG_DEFERRED_SUFFIX` lock and Yatchew `sigma2_*` deterministic pins; pretest event-study path with the SAME `_QUG_DEFERRED_SUFFIX` lock plus a SEPARATE substring lock on `report.summary()` for the L736 QUG-skip note). Bootstrap p-value pins use abs tolerance bands `>= 0.25` per `feedback_strata_bootstrap_path_divergence` (stratified Mammen multiplier paths reduce effective dofs vs non-strata; PR #432 commit `aef07020` already had to relax bit-equality bands on this code path). T20 and T21 "Extensions" bullets updated with forward-pointers to T22; `docs/practitioner_decision_tree.rst` HAD universal-rollout and survey sections each gain a `.. tip::` cross-link to T22 (adjacent to T20 / T17, NOT displacing); `docs/api/had.rst` gains a "Survey-aware fit" cross-reference; `docs/survey-roadmap.md` gains a "Phase 4.5 C ✅ Shipped" entry; bundled `diff_diff/guides/llms.txt`, `llms-full.txt`, and `llms-practitioner.txt` carry T22 inventory entries; `docs/doc-deps.yaml` wires T22 as a dependent of both `had.py` and `had_pretests.py`. Closes the Phase 5 (wave 2 second slice) tutorial gap; the realistic survey-weighted HAD workflow on BRFSS / CPS / NHANES / ACS-shaped designs is now end-to-end documented for practitioners. - **HAD pretest workflow: stratified survey-design support (Phase 4.5 C continuation).** Lifts the `NotImplementedError` gate on `SurveyDesign(strata=...)` in `stute_test` (`had_pretests.py:1927-1940` pre-PR) and `stute_joint_pretest` (`:3259-3271` pre-PR), and by inheritance in `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` (the wrappers delegate to the joint Stute helper). Implements a documented synthesis of clustered-wild-bootstrap ingredients (Cameron-Gelbach-Miller 2008 cluster-level multipliers; Davidson-Flachaire 2008 wild-bootstrap centering; Djogbenou-MacKinnon-Nielsen 2019 cluster-wild consistency for nonlinear functionals; Kreiss-Lahiri 2012 within-block centering analogy; Wu 1986 / Liu 1988 Bessel small-sample correction) — no single paper covers the exact composition for the stratified Stute CvM functional. The recipe: within-stratum demean + `sqrt(n_h/(n_h-1))` Bessel rescale applied to the PSU multipliers `psu_mults` BEFORE the per-obs broadcast `eta_obs = psu_mults[b, psu_col_idx]` in the wild-residual loop. Bootstrap CvM variance matches the analytical Binder-TSL stratified target `V_S = sum_h (1 - f_h) (n_h / (n_h - 1)) sum_j (psi_hj - psi_h_bar)²` exactly (the `(1 - f_h)` FPC factor was already baked in by `generate_survey_multiplier_weights_batch`; this PR bakes the remaining `(n_h / (n_h - 1))` factor and enforces within-stratum-mean-zero centering). New shared helper `bootstrap_utils.apply_stratum_centering(psu_mults, resolved_survey, psu_ids, psu_axis=...)` is called from both the new Stute path (psu_axis=1 on the multiplier matrix) AND the existing HAD sup-t event-study cband bootstrap (psu_axis=0 on the PSU-aggregated influence tensor; refactored bit-exactly from the inline block previously at `had.py:2172-2204`). Locks the algebraic identity architecturally instead of leaving parallel code blocks to drift. MC oracle consistency validated under a 4-stratum × 6-PSU/stratum stratified null DGP with weights+strata+PSU (200 seeded draws, empirical Type I at α=0.05 in `[0, 0.10]` — 3σ band; the FPC bake-in is covered separately by the helper-unit test `test_fpc_baked_in_helper_is_fpc_agnostic`); MC power validated under a known-alternative stratified DGP (rejection > 0.50). HAD sup-t event-study cband bit-parity preserved (`atol=1e-14, rtol=1e-14` on the refactored helper output + 29 existing cband tests passing post-refactor; that helper-level bit-parity test locks the axis-0 algebra). A separate wired-in regression at `tests/test_had_pretests.py::TestStuteStratifiedSurveyBootstrap::test_stute_call_sites_invoke_apply_stratum_centering` monkey-patches the helper and asserts both Stute call sites (`stute_test` at `had_pretests.py:1985` and `stute_joint_pretest` at `:3312`) invoke it with `psu_axis=1` — that test fails if either call site is disconnected (the axis-0 helper-parity test alone does not catch that case). See `docs/methodology/REGISTRY.md` § HeterogeneousAdoptionDiD — "Note (Stute stratified survey-bootstrap calibration)" for the full derivation. Remaining deferrals: `lonely_psu='adjust'` + singleton-strata (same pseudo-stratum centering gap as the HAD sup-t deviation at REGISTRY:2382) and replicate-weight designs (BRR/Fay/JK1/JKn/SDR — separate Rao-Wu / JKn bootstrap composition). Unblocks the realistic survey-weighted HAD workflow on BRFSS/CPS/NHANES/ACS-shaped designs. - **Conley (1999) Wave A mechanical extensions** on top of the Phase 1+2 sandwich (`diff_diff/conley.py`, `diff_diff/linalg.py`, `diff_diff/estimators.py`, `diff_diff/twfe.py`). **(1) DiD support (#118):** `DifferenceInDifferences(vcov_type="conley").fit(..., unit="")` is now supported. `unit` is a fit-time kwarg (NOT on `__init__`; unused unless Conley is set; not part of `get_params()` / `set_params()`) mirroring `MultiPeriodDiD.fit(unit=...)` / `TwoWayFixedEffects.fit(unit=...)`. DiD inherits the same panel block-decomposed sandwich as MPD/TWFE; on a 2-period panel it matches `MultiPeriodDiD(...).fit(..., post_periods=[1], reference_period=0)` bit-exactly. Missing `unit=`/`conley_lag_cutoff`/`conley_coords`/`conley_cutoff_km` raise `ValueError`; `survey_design=` + Conley raises `NotImplementedError` (Bertanha-Imbens 2014 follow-up); `inference="wild_bootstrap"` + Conley raises `NotImplementedError`. **(2) Combined spatial + cluster product kernel (#119):** `compute_robust_vcov(vcov_type="conley", cluster_ids=...)` / `LinearRegression(vcov_type="conley", cluster_ids=...)` / `TwoWayFixedEffects(vcov_type="conley", cluster="")` / `MultiPeriodDiD(vcov_type="conley", cluster="")` / `DifferenceInDifferences(vcov_type="conley", cluster="")` apply `K_total[i, j] = K_space(d_ij/h) · 1{c_i = c_j}`. On the panel block-decomposed path the cluster indicator multiplies BOTH the spatial sandwich AND the serial sandwich; the validator enforces that `cluster_ids` is constant within each unit across periods (the within-unit serial mask is then trivially all-ones; cross-sectional path has no such constraint). TWFE's default auto-cluster on the Conley path remains silently dropped (combining with unit-level clusters would zero out all between-unit pairs and defeat the spatial pooling); users must pass an explicit above-unit cluster (e.g. region) to opt in. DiD has no auto-cluster — the choice is fully explicit. Two limit fixtures anchor correctness (no R parity — R `conleyreg` does not support combined kernels): all-unique-clusters reduces to HC0; huge-cutoff reduces to pure within-cluster CR1. The huge-cutoff reduction is EXACT only for `conley_kernel="uniform"` (`K(u) = 1` for `|u| ≤ 1`); for `conley_kernel="bartlett"` the identity is asymptotic since `K_bartlett(u) = 1 - |u| < 1` for `u > 0`. The fixture anchor uses uniform for an exact identity check. Per-slice mask construction (NOT full n×n) preserves memory on panel paths. **(3) Sparse k-d-tree fast path (#120):** auto-activates for the spatial Bartlett meat when `n > 5_000` AND metric is `"haversine"` or `"euclidean"` AND kernel is `"bartlett"`. Builds a CSR sparse kernel matrix via `scipy.spatial.cKDTree.query_ball_tree` instead of materializing the full n×n distance matrix; haversine projects to a 3-D unit-sphere chord representation with the exact great-circle recomputed for in-range neighbors only. Bit-identity parity vs the dense path at `atol=1e-10`; R parity at `atol=1e-6` is preserved on the existing 3 panel R fixtures with the sparse path force-enabled. The bartlett-only gate is for boundary correctness — bartlett at `u=1` is exactly 0, so the sparse path safely drops at-cutoff pairs; uniform at `u=1` is 1 and would require a closed-interval query semantic that haversine chord projection cannot reliably preserve. Constants: `_CONLEY_SPARSE_N_THRESHOLD = 5_000` (auto-toggle); `_CONLEY_DENSE_WARN_N` renamed `_CONLEY_DENSE_OOM_WARN_N = 20_000` (memory exhaustion threshold for the dense fallback — independent of the sparse threshold). Private `_conley_sparse: Optional[bool]` kwarg on `_compute_conley_vcov` controls the toggle (`None` = auto, `True` = force, `False` = force dense; `True` with an unsupported kernel/metric raises). The serial component (within-unit Bartlett over time) remains dense regardless — per-unit slices are small. **(4) Callable `conley_metric` validation (#123):** result must satisfy shape `(n, n)`, finite, non-negative, symmetric to `atol=1e-10`, AND zero on the diagonal (`|d(i, i)| ≤ 1e-10`); each failure raises a targeted `ValueError` naming the violated invariant. The zero-diagonal contract is load-bearing for the Conley sandwich: the `i = j` term must reduce to the HC0 diagonal `X_i ε_i² X_i'` via `K(0) = 1`; positive self-distance would silently attenuate the HC0 contribution by `K(d_ii / h) < 1`. Built-in metrics (`"haversine"`, `"euclidean"`) satisfy this by construction. Previously, malformed callables produced opaque BLAS errors deep in the pipeline. **Tests:** `tests/test_conley_vcov.py::TestConleySparse` (12), `::TestConleySparseRParityForced` (3), `::TestConleyCluster` (10), `::TestConleyDistanceMetrics` extended (7 new); existing rejection tests flipped to behavioral; `test_did_conley_matches_mpd_post_periods_1` locks the DiD-vs-MPD bit-exact agreement. **Docs:** REGISTRY `## ConleySpatialHAC` updates: new "Combined spatial + cluster product kernel" + "Performance / scale" subsections, DiD-vs-TWFE cluster asymmetry paragraph, updated panel-API restrictions table. TODO rows 118 / 119 / 120 / 123 removed; rows 121 (Conley + survey_design / weights, Bertanha-Imbens 2014) and 122 (`SyntheticDiD(vcov_type="conley")`, spatial-block bootstrap per Politis-Romano 1994) retained for future waves. - **Conley (1999) spatial-HAC standard errors via `vcov_type="conley"`** on cross-sectional `LinearRegression` / `compute_robust_vcov` plus panel `MultiPeriodDiD` / `TwoWayFixedEffects` (Phases 1 and 2 of the spillover-conley initiative). **Cross-sectional contract:** `conley_coords` (n × 2 array of lat/lon or projected coords), `conley_cutoff_km=` (positive finite bandwidth in km for haversine, or coord units for euclidean — REQUIRED, no default per the no-silent-failures contract), `conley_metric="haversine"|"euclidean"|callable` (default `"haversine"`; great-circle uses Earth's mean radius 6371.01 km matching R `conleyreg`), `conley_kernel="bartlett"|"uniform"` (default `"bartlett"`; both kernels emit a `UserWarning` if the resulting meat has a materially negative eigenvalue — neither the radial 1-D Bartlett nor the uniform kernel is formally PSD-guaranteed; Conley 1999's explicit PSD formula is the 2-D separable lattice product window at Eq 3.14). Cross-sectional variance estimator `Var̂(β) = (X'X)^{-1} · ( Σ_{i,j} K(d_ij/h) · X_i ε_i ε_j X_j' ) · (X'X)^{-1}` (Conley 1999 Eq 4.2). **Panel contract (Phase 2, new):** Three new co-required kwargs `conley_time` (n-length array), `conley_unit` (n-length array), and `conley_lag_cutoff=` (non-negative; 0 means within-period spatial only, no serial component) switch into the **block-decomposed panel sandwich** that matches R `conleyreg` with `lag_cutoff > 0`: `XeeX_total = Σ_t (within-period spatial sandwich) + Σ_u (within-unit Bartlett temporal sandwich, lag ∈ {1..L}, same-time excluded)`. This is NOT a multiplicative product kernel — verified empirically against `conleyreg::time_dist` and `XeeXhC` at ~1e-14 on the panel parity fixtures. The temporal kernel is hardcoded Bartlett `(1 - |lag|/(L+1))` regardless of `conley_kernel`, mirroring `conleyreg::time_dist.cpp`; documented as a `Note (deviation from R-symmetric API)` in REGISTRY. **Panel estimator wire-up (Phase 2):** `MultiPeriodDiD(vcov_type="conley", conley_lag_cutoff=...).fit(..., unit=...)` and `TwoWayFixedEffects(vcov_type="conley", conley_lag_cutoff=...).fit(..., unit=...)` lift the Phase 1 fit-time rejection; the `conley_time` and `conley_unit` arrays are auto-derived from the existing `time` and `unit` column-name arguments at fit-time. `DifferenceInDifferences(vcov_type="conley")` is also supported (Wave A #118 in this release; see the Wave A entry above) — pass `unit=` as a fit-time kwarg to `DiD.fit(...)`. **Other constraints (Phase 1, unchanged):** `SyntheticDiD(vcov_type="conley")` raises `TypeError` (uses bootstrap variance, not analytical sandwich); `set_params` mirrors the constructor rejection. `vcov_type="conley"` + `weights=` / `survey_design=` raises `NotImplementedError` (Bertanha-Imbens 2014 weighted-Conley deferred to a follow-up PR). `vcov_type="conley"` + explicit `cluster_ids=` is supported via the combined spatial + cluster product kernel (Wave A #119; see the Wave A entry above). TWFE's default auto-cluster on the Conley path is silently dropped (combining with unit-level clusters would defeat the spatial pooling); users opt into the combined kernel by passing an explicit above-unit cluster. `inference="wild_bootstrap"` + Conley raises (incompatible inference modes). A sparse k-d-tree fast path auto-activates for the spatial Bartlett meat when `n > 5_000` with bartlett kernel and haversine/euclidean metric (Wave A #120); the dense fallback still emits an OOM `UserWarning` at `n > 20_000`. **Implementation:** Helpers live in `diff_diff/conley.py` (`_haversine_km`, `_pairwise_distance_matrix`, `_bartlett_kernel`, `_uniform_kernel`, `_validate_conley_kwargs`, `_compute_conley_vcov` — the validator and sandwich helper now accept keyword-only `time` / `unit` / `lag_cutoff` for the panel path); `compute_robust_vcov` in `diff_diff/linalg.py` threads the new kwargs through. **R `conleyreg` parity (Düsterhöft 2021, CRAN v0.1.9)** on **six** benchmark fixtures (`benchmarks/data/r_conleyreg_conley_golden.json`, regenerable via `benchmarks/R/generate_conley_golden.R`): 3 cross-sectional (Phase 1) + 3 new panel fixtures (`panel_haversine_lag1`, `panel_haversine_lag2`, `panel_lat_lon_realistic_lag1`; n_units × T = 60×3, 80×5, 100×4 at lag={1,2,1}); observed max abs diff ~5.7e-16. Earth radius 6371.01 km matches `conleyreg::haversine_dist`. Test file `tests/test_conley_vcov.py` skips parity cleanly when the JSON is absent. New REGISTRY section `## ConleySpatialHAC`. Subsequent phases of the spillover-conley initiative (ring-indicator spillover-aware DiD per Butts 2021; survey-design / replicate-weight support; `SyntheticDiD` Conley path) are tracked in `TODO.md` under "Tech Debt from Code Reviews" → spillover-conley rows. -- **Tutorial 21: HAD Pre-test Workflow** (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `HeterogeneousAdoptionDiD` building on Tutorial 20's brand-campaign framing. Uses a 60-DMA × 8-week panel close in shape to T20's but with the dose distribution drawn from `Uniform[$0.01K, $50K]` (vs T20's `[$5K, $50K]`); the true support is strictly positive but very near zero, chosen so the QUG step in `did_had_pretest_workflow` fails-to-reject `H0: d_lower = 0` in this finite sample and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. (HAD's `design="auto"` selector — a separate min/median heuristic at `had.py::_detect_design`, NOT the QUG p-value — independently lands on the `continuous_at_zero` identification path with target `WAS` on this panel because `d.min() < 0.01 * median(|d|)`. The QUG test and the design selector are independent rules that point to the same identification path here.) Walks through three surfaces: (a) `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse, where the verdict explicitly flags Step 2 (Assumption 7 pre-trends) as not run because a single pre-period structurally cannot support a pre-trends test, and the structural fields `pretrends_joint` / `homogeneity_joint` are both `None`; (b) `did_had_pretest_workflow(aggregate="event_study")` on the full multi-period panel, where the verdict reads "TWFE admissible under Section 4 assumptions" because all three testable diagnostics (QUG + joint pre-trends Stute over 3 horizons + joint homogeneity Stute over 4 horizons) fail-to-reject — non-rejection evidence under finite-sample power and test specification, not proof that the identifying assumptions hold; and (c) a side panel exercising both `yatchew_hr_test` null modes — `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) — on the within-pre-period first-difference paired with post-period dose, illustrating the stricter null's larger residual variance (`sigma2_lin` 7.01 vs 6.53) and smaller p-value (0.29 vs 0.49). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (16 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, bootstrap p-value tolerance bands per `feedback_bootstrap_drift_tests_need_backend_tolerance`, and `HAD(design="auto")` resolution to `continuous_at_zero` on this panel). T20's "Composite pretest workflow" Extensions bullet updated with a forward-pointer to T21. T22 weighted/survey HAD tutorial remains queued as a separate notebook PR. +- **Tutorial 21: HAD Pre-test Workflow** (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `HeterogeneousAdoptionDiD` building on Tutorial 20's brand-campaign framing. Uses a 60-DMA × 8-week panel close in shape to T20's but with the dose distribution drawn from `Uniform[$0.01K, $50K]` (vs T20's `[$5K, $50K]`); the true support is strictly positive but very near zero, chosen so the QUG step in `did_had_pretest_workflow` fails-to-reject `H0: d_lower = 0` in this finite sample and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. (HAD's `design="auto"` selector — a separate min/median heuristic at `had.py::_detect_design`, NOT the QUG p-value — independently lands on the `continuous_at_zero` identification path with target `WAS` on this panel because `d.min() < 0.01 * median(|d|)`. The QUG test and the design selector are independent rules that point to the same identification path here.) Walks through three surfaces: (a) `did_had_pretest_workflow(aggregate="overall")` on a two-period collapse, where the verdict explicitly flags Step 2 (Assumption 7 pre-trends) as not run because a single pre-period structurally cannot support a pre-trends test, and the structural fields `pretrends_joint` / `homogeneity_joint` are both `None`; (b) `did_had_pretest_workflow(aggregate="event_study")` on the full multi-period panel, where the verdict reads "TWFE admissible under Section 4 assumptions" because all three testable diagnostics (QUG + joint pre-trends Stute over 3 horizons + joint homogeneity Stute over 4 horizons) fail-to-reject — non-rejection evidence under finite-sample power and test specification, not proof that the identifying assumptions hold; and (c) a side panel exercising both `yatchew_hr_test` null modes — `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) — on the within-pre-period first-difference paired with post-period dose, illustrating the stricter null's larger residual variance (`sigma2_lin` 7.01 vs 6.53) and smaller p-value (0.29 vs 0.49). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (16 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, bootstrap p-value tolerance bands per `feedback_bootstrap_drift_tests_need_backend_tolerance`, and `HAD(design="auto")` resolution to `continuous_at_zero` on this panel). T20's "Composite pretest workflow" Extensions bullet updated with a forward-pointer to T21. T22 weighted/survey HAD tutorial shipped as a follow-up notebook PR (see the T22 entry above). - **`ChaisemartinDHaultfoeuille.by_path` and `paths_of_interest` now compose with `survey_design`** for analytical Binder TSL SE and replicate-weight bootstrap variance. The `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1233-1239` is replaced by a per-path multiplier-bootstrap-only gate (`survey_design + n_bootstrap > 0` under by_path / paths_of_interest still raises, since the survey-aware perturbation pivot for path-restricted IFs is methodologically underived). Per-path SE routes through the existing `_survey_se_from_group_if` cell-period allocator: the per-period IF (`U_pp_l_path`) is built with non-path switcher-side contributions skipped (control contributions are unchanged, matching the joiners/leavers IF convention; preserves the row-sum identity `U_pp.sum(axis=1) == U`), cohort-recentered via `_cohort_recenter_per_period`, then expanded to observations as `psi_i = U_pp[g_i, t_i] · (w_i / W_{g_i, t_i})`. Replicate-weight designs unconditionally use the cell allocator (Class A contract from PR #323). New `_refresh_path_inference` helper post-call refreshes `safe_inference` on every populated entry across `multi_horizon_inference`, `placebo_horizon_inference`, `path_effects`, and `path_placebos` so all four surfaces use the same final `df_survey` after per-path replicate fits append `n_valid` to the shared accumulator. Path-enumeration ranking under `survey_design` remains unweighted (group-cardinality, not population-weight mass). Lonely-PSU policy stays sample-wide, not per-path. Telescope invariant: on a single-path panel, per-path SE matches the global non-by_path survey SE bit-exactly. **No R parity** — R `did_multiplegt_dyn` does not support survey weighting; this is a Python-only methodology extension. The global non-by_path TSL multiplier-bootstrap path is unaffected (anti-regression test `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignAnalytical::test_global_survey_plus_n_bootstrap_still_works` locks the per-path-only scope of the new gate). Cross-surface invariants regression-tested at `TestByPathSurveyDesignAnalytical` (~17 tests across gate / dispatch / analytical SE / replicate-weight SE / per-path placebos / `trends_linear` composition / unobserved-path warnings / final-df refresh regressions) and `TestByPathSurveyDesignTelescope`. See `docs/methodology/REGISTRY.md` §`ChaisemartinDHaultfoeuille` `Note (Phase 3 by_path ...)` → "Per-path survey-design SE" for the full contract. - **Inference-field aliases on staggered result classes** for adapter / external-consumer compatibility. Read-only `@property` aliases expose the flat `att` / `se` / `conf_int` / `p_value` / `t_stat` names (matching `DiDResults` / `TROPResults` / `SyntheticDiDResults` / `TripleDifferenceResults` / `HeterogeneousAdoptionDiDResults`) on every result class that previously only carried prefixed canonical fields: `CallawaySantAnnaResults`, `StackedDiDResults`, `EfficientDiDResults`, `ChaisemartinDHaultfoeuilleResults`, `StaggeredTripleDiffResults`, `WooldridgeDiDResults`, `SunAbrahamResults`, `ImputationDiDResults`, `TwoStageDiDResults` (mapping to `overall_*`); `ContinuousDiDResults` (mapping to `overall_att_*`, ATT-side as the headline, ACRT-side accessible unchanged via `overall_acrt_*`); `MultiPeriodDiDResults` (mapping to `avg_*`). `ContinuousDiDResults` additionally exposes `overall_se` / `overall_conf_int` / `overall_p_value` / `overall_t_stat` aliases for naming consistency with the rest of the staggered family. Aliases are pure read-throughs over the canonical fields — no recomputation, no behavior change — so the `safe_inference()` joint-NaN contract (per CLAUDE.md "Inference computation") is inherited automatically (NaN canonical → NaN alias, locked at `tests/test_result_aliases.py::test_pattern_b_aliases_propagate_nan`). The native `overall_*` / `overall_att_*` / `avg_*` fields remain canonical for documentation and computation. Motivated by the `balance.interop.diff_diff.as_balance_diagnostic()` adapter (`facebookresearch/balance` PR #465) which calls `getattr(res, "se", None)` / `getattr(res, "conf_int", None)` without a fallback chain — pre-alias, every staggered result class returned `None` on those keys, silently dropping `se` and `conf_int` from the adapter's diagnostic dict. 23 alias-mechanic + balance-adapter regression tests at `tests/test_result_aliases.py`. Patch-level (additive on stable surfaces). - **`ChaisemartinDHaultfoeuille.by_path` + non-binary integer treatment** — `by_path=k` now accepts integer-coded discrete treatment (D in Z, e.g. ordinal `{0, 1, 2}`); path tuples become integer-state tuples like `(0, 2, 2, 2)`. The previous `NotImplementedError` gate at `chaisemartin_dhaultfoeuille.py:1870` is replaced by a `ValueError` for continuous D (e.g. `D=1.5`) at fit-time per the no-silent-failures contract — the existing `int(round(float(v)))` cast in `_enumerate_treatment_paths` is now defensive (no-op for integer-coded D). Validated against R `did_multiplegt_dyn(..., by_path)` for D in `{0, 1, 2}` via the new `multi_path_reversible_by_path_non_binary` golden-value scenario (78 switchers, 3 paths, single-baseline custom DGP, F_g >= 4): per-path point estimates match R bit-exactly (rtol ~1e-9 on event horizons; rtol+atol envelope for placebo near-zero values), per-path SE inherits the documented cross-path cohort-sharing deviation (~5% rtol observed; SE_RTOL=0.15 envelope). **Deviation from R for multi-character baseline states (D >= 10 or negative D):** R's `did_multiplegt_by_path` derives the per-path baseline via `path_index$baseline_XX <- substr(path_index$path, 1, 1)`, which captures only the first character of the comma-separated path string. For multi-character baselines this drops the rest of the value: for `path = "12,12,..."` it captures `"1"` instead of `"12"`; for `path = "-1,-1,..."` it captures `"-"` instead of `"-1"`. R's per-path control-pool subset is mis-allocated in both regimes. Python's tuple-key matching is correct — the per-path point estimates we compute are correct; R's per-path subset for the same path is buggy. The shipped R-parity scenarios stay in nonnegative single-digit `D in {0, 1, 2}` to avoid the R bug; negative-integer treatment-state support (paths containing negative D values in non-baseline positions) is regression-tested in Python only at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary::test_negative_integer_D_supported` (no R parity); a dedicated regression for a negative-baseline path (e.g. `(-1, 0, 0, 0)`) is deferred. R-parity test at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathNonBinary`; cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary`. - **New `paths_of_interest` kwarg on `ChaisemartinDHaultfoeuille`** for user-specified treatment-path subsets, alternative to `by_path=k`'s top-k automatic ranking. Mutually exclusive with `by_path`; setting both raises `ValueError` at `__init__` and `set_params` time. Each path tuple must be a list/tuple of `int` of length `L_max + 1` (uniformity validated at `__init__`; length match against `L_max + 1` validated at fit-time); `bool` and `np.bool_` are explicitly rejected, `np.integer` accepted and canonicalized to Python `int` for tuple-key consistency. Duplicates emit a `UserWarning` and are deduplicated; paths not observed in the panel emit a `UserWarning` and are omitted from `path_effects`. Paths appear in `results.path_effects` in the user-specified order, modulo deduplication and unobserved-path filtering. Composes with non-binary D and all downstream `by_path` surfaces (bootstrap, per-path placebos, per-path joint sup-t bands, `controls`, `trends_linear`, `trends_nonparam`) — mechanical filter on observed paths via the same `_enumerate_treatment_paths` call site, no methodology change. **Python-only API extension; no R equivalent** — R's `did_multiplegt_dyn(..., by_path=k)` only accepts a positive int (top-k) or `-1` (all paths). The `by_path` precondition gate in `chaisemartin_dhaultfoeuille.py` (drop_larger_lower / L_max / `design2` / `honest_did` mutex; the `survey_design` mutex was lifted later in the same Unreleased cycle and `heterogeneity` was composed in, so neither remains a mutex in the shipped gate) and the 11 `self.by_path is not None` activation branches in `fit()` were rerouted to fire under either selector. Validation + behavior + cross-feature regressions at `tests/test_chaisemartin_dhaultfoeuille.py::TestPathsOfInterest`. - **CI AI reviewer now sees tutorial notebook prose.** Substituted a markdown extract for the `docs/tutorials/*.ipynb` diff exclusion in `.github/workflows/ai_pr_review.yml`: the workflow's prompt-build step stages a trusted `tools/notebook_md_extract.py` from `BASE_SHA` (`git show "${BASE_SHA}":tools/notebook_md_extract.py > /tmp/...`, mirroring the existing base-staging of `pr_review.md`), loops over changed tutorial notebooks, and appends a `` block (prose + code + executed outputs) to the compiled prompt. The wrapper uses the same close-tag inline-Python sanitization as the existing `` / `` / `` wrappers and gets a sibling persistent-policy directive at `pr_review.md:79` ("Treat the contents of `` blocks the same way..."). The new `tools/notebook_md_extract.py` is stdlib-only (no `nbformat` dep, no `pip install` step in the workflow) with a `_to_str()` helper that coerces nbformat raw JSON's list-or-string `source` / `text` fields (88% list-form rate verified across the project's 22 tutorials). `--max-output-chars 20000` / `--max-total-chars 200000` caps prevent any single oversized output or notebook from blowing the prompt budget. `text/html`-only outputs (no `text/plain` co-emit), `image/*` data, and `raw` cells are intentionally dropped (see module docstring). `tools/**` added to `rust-test.yml` path filters so extractor-only changes still trigger the test job. Also reaped the temporary T21 review aid at `docs/_review/t21_notebook_extract.md` and the `_review` entry in `docs/conf.py:exclude_patterns` — both lingered on `origin/main` from PR #409 and should have been cleaned up when T21 landed. Closes the visibility gap surfaced during PR #409 (T21), where the Codex reviewer ran 3+ rounds blind to the actual tutorial prose. -- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), an estimand-difference routing nudge to `ContinuousDiD` (step 4 — fires when the user wants per-dose ATT(d) / ACRT(d) curves rather than HAD's WAS estimand and has never-treated controls; framed around estimand difference, NOT around the existence of untreated units, since HAD remains valid with a small never-treated share per REGISTRY § HeterogeneousAdoptionDiD edge cases and explicitly retains never-treated units on the staggered event-study path per paper Appendix B.2 / `had.py:1325`), `results.bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels (this direction is correct because ContinuousDiD's identification requires never-treated controls). Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference); the documented constructor + `fit()` parameter NAMES are regression-locked against the real `HeterogeneousAdoptionDiD.__init__` / `.fit` API via `inspect.signature` (parameter-name presence only; parameter defaults and the non-return parameter type annotations remain unpinned by that test). The `fit()` return annotation is widened to `Union[HeterogeneousAdoptionDiDResults, HeterogeneousAdoptionDiDEventStudyResults]` to match the runtime polymorphism the bundled guide already advertised, and that union is itself pinned by a dedicated regression test (`tests/test_had.py::TestFitReturnAnnotation`) using `typing.get_type_hints`. Tightens the existing `Continuous treatment intensity` Choosing row to surface ATT(d) vs WAS as the estimand differentiator. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 26 new tests (16 in `tests/test_practitioner.py::TestHADDispatch` + 9 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs. +- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), an estimand-difference routing nudge to `ContinuousDiD` (step 4 — fires when the user wants per-dose ATT(d) / ACRT(d) curves rather than HAD's WAS estimand and has never-treated controls; framed around estimand difference, NOT around the existence of untreated units, since HAD remains valid with a small never-treated share per REGISTRY § HeterogeneousAdoptionDiD edge cases and explicitly retains never-treated units on the staggered event-study path per paper Appendix B.2 / `had.py:1325`), `results.bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels (this direction is correct because ContinuousDiD's identification requires never-treated controls). Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference); the documented constructor + `fit()` parameter NAMES are regression-locked against the real `HeterogeneousAdoptionDiD.__init__` / `.fit` API via `inspect.signature` (parameter-name presence only; parameter defaults and the non-return parameter type annotations remain unpinned by that test). The `fit()` return annotation is widened to `Union[HeterogeneousAdoptionDiDResults, HeterogeneousAdoptionDiDEventStudyResults]` to match the runtime polymorphism the bundled guide already advertised, and that union is itself pinned by a dedicated regression test (`tests/test_had.py::TestFitReturnAnnotation`) using `typing.get_type_hints`. Tightens the existing `Continuous treatment intensity` Choosing row to surface ATT(d) vs WAS as the estimand differentiator. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 26 new tests (16 in `tests/test_practitioner.py::TestHADDispatch` + 9 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial shipped in PR #409 and T22 weighted/survey tutorial shipped as a follow-up notebook PR (see the T22 entry above). ### Changed - **HAD pretest non-strata bootstrap: small-sample calibration improvement.** The Stute survey-bootstrap on non-strata designs (`SurveyDesign(weights=...)`, `SurveyDesign(weights=..., psu=...)`, `SurveyDesign(weights=..., fpc=...)`) now applies the standard `sqrt(n_psu/(n_psu-1))` Bessel small-sample correction to the PSU multipliers uniformly with a single implicit stratum, mirroring the sibling HAD sup-t event-study cband bootstrap at `had.py:2199-2204`. Pre-PR Phase 4.5 C shipped raw iid multipliers without the centering; the bootstrap CvM variance was under-corrected by exactly the `n_psu/(n_psu-1)` factor relative to the unbiased within-stratum variance estimator. Direction-of-shift: toward correct calibration. Magnitude: approximately `sqrt(n_psu/(n_psu-1)) - 1` ≈ 1.7% for `n_psu=60`, decreasing as `n_psu` grows. Practitioners reproducing pre-PR Stute non-strata bootstrap p-values exactly should pin the prior release; the post-PR p-values are the methodology-true values (`### Added` "HAD pretest workflow: stratified survey-design support" bullet above documents the full derivation). Affects only the Stute family on the `weights=` / `survey_design=SurveyDesign(weights, [psu, fpc])` paths; Yatchew (closed-form weighted-OLS, no bootstrap) is unaffected, as is the unweighted bit-exact path (which has no multipliers to center). diff --git a/TODO.md b/TODO.md index 5128c648..83cebfb0 100644 --- a/TODO.md +++ b/TODO.md @@ -112,7 +112,7 @@ Deferred items from PR reviews that were not addressed before merge. | `HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred. | `tests/test_had_pretests.py` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. | `benchmarks/`, `tests/` | Phase 2a | Low | -| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial (T22 weighted/survey HAD tutorial). T21 HAD pretest workflow notebook landed in PR #409; `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1 (PR #402). | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a | Low | +| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial — SHIPPED. T22 (`docs/tutorials/22_had_survey_design.ipynb` + `tests/test_t22_had_survey_design_drift.py`) landed as the follow-up to PR #432; demonstrates the now-supported `SurveyDesign(strata=...)` path through HAD + `did_had_pretest_workflow` end-to-end on a BRFSS-shape household-panel design. T20 HAD brand-campaign (PR #394), T21 HAD pretest workflow (PR #409), and `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row (Phase 5 wave 1, PR #402) landed earlier. | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a (shipped) | Done | | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low | | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium | | SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low | diff --git a/diff_diff/guides/llms-practitioner.txt b/diff_diff/guides/llms-practitioner.txt index 3e5b8084..274cbe04 100644 --- a/diff_diff/guides/llms-practitioner.txt +++ b/diff_diff/guides/llms-practitioner.txt @@ -193,7 +193,10 @@ see REGISTRY HeterogeneousAdoptionDiD edge cases): | filters to the last cohort + never-treated (Appendix B.2) | and the estimand becomes last-cohort-only WAS — use | ChaisemartinDHaultfoeuille if full multi-cohort staggered -| support under continuous treatment is required. +| support under continuous treatment is required. For the +| survey-weighted HAD workflow on a stratified-PSU design +| (BRFSS / CPS / NHANES shape), see Tutorial 22: +| docs/tutorials/22_had_survey_design.ipynb. | Is treatment adoption staggered (multiple cohorts, different timing)? |-- YES: Do NOT use plain TWFE. Use one of: diff --git a/diff_diff/guides/llms.txt b/diff_diff/guides/llms.txt index 83e791d0..df0a880a 100644 --- a/diff_diff/guides/llms.txt +++ b/diff_diff/guides/llms.txt @@ -97,6 +97,7 @@ Full practitioner guide: call `diff_diff.get_llm_guide("practitioner")` - [15 Efficient DiD](https://diff-diff.readthedocs.io/en/stable/tutorials/15_efficient_did.html): Chen, Sant'Anna & Xie (2025) efficient DiD — optimal weighting, PT-All vs PT-Post, efficiency gains - [16 Survey DiD](https://diff-diff.readthedocs.io/en/stable/tutorials/16_survey_did.html): Survey-weighted DiD — SurveyDesign, strata/PSU/FPC, replicate weights, subpopulation analysis, DEFF diagnostics - [16 Wooldridge ETWFE](https://diff-diff.readthedocs.io/en/stable/tutorials/16_wooldridge_etwfe.html): Wooldridge (2023, 2025) ETWFE — saturated OLS, logit/Poisson (ASF-based ATT), aggregation types +- [22 HAD Survey-Weighted Workflow](https://diff-diff.readthedocs.io/en/stable/tutorials/22_had_survey_design.html): HeterogeneousAdoptionDiD + did_had_pretest_workflow under SurveyDesign(strata, psu, weights, fpc) — BRFSS-shape panel, modest SE inflation explanation, Phase 4.5 C0 QUG-deferred verdict ## Survey Support diff --git a/docs/api/had.rst b/docs/api/had.rst index e6914d42..3e9f08c9 100644 --- a/docs/api/had.rst +++ b/docs/api/had.rst @@ -116,6 +116,15 @@ Unit Remains Untreated" (arXiv:2405.04465v6), which: weighted-CR1 per-horizon), or drop ``cluster=`` (keeps weighted-HC1 sup-t). +.. tip:: + + For an end-to-end walkthrough of the survey-aware HAD workflow on a + BRFSS-shape stratified household-survey panel - including the now- + supported ``SurveyDesign(strata=...)`` path through the Stute pretest + family (lifted in PR #432, 2026-05) - see + `Tutorial 22: Survey-Weighted HAD + <../tutorials/22_had_survey_design.ipynb>`_. + HeterogeneousAdoptionDiD ------------------------ diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml index 61527729..20770069 100644 --- a/docs/doc-deps.yaml +++ b/docs/doc-deps.yaml @@ -391,6 +391,9 @@ sources: - path: docs/tutorials/21_had_pretest_workflow.ipynb type: tutorial note: "Drift-locks `HAD(design=\"auto\")` resolution to `continuous_at_zero` on T21's panel via `tests/test_t21_had_pretest_workflow_drift.py::test_had_design_auto_lands_on_continuous_at_zero`; changes to `_detect_design()` heuristic should re-validate T21" + - path: docs/tutorials/22_had_survey_design.ipynb + type: tutorial + note: "Survey-aware HAD walkthrough; drift-locked at `tests/test_t22_had_survey_design_drift.py`. Drift-locks `HAD(design=\"auto\")` resolution to `continuous_near_d_lower` on T22's panel and the `survey_design=` path's SE/CI behavior." diff_diff/had_pretests.py: drift_risk: medium @@ -410,6 +413,9 @@ sources: - path: docs/tutorials/21_had_pretest_workflow.ipynb type: tutorial note: "Composite pre-test workflow walkthrough; drift-locked at tests/test_t21_had_pretest_workflow_drift.py" + - path: docs/tutorials/22_had_survey_design.ipynb + type: tutorial + note: "Survey-aware pretest workflow walkthrough (overall + event-study under SurveyDesign(strata=...)); drift-locks `_QUG_DEFERRED_SUFFIX`, the event-study summary QUG-skip note, and joint pretrends/homogeneity horizon labels under stratified-clustered Stute bootstrap (PR #432). Drift-locked at tests/test_t22_had_survey_design_drift.py" diff_diff/local_linear.py: drift_risk: low diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 978280eb..943aef80 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2526,7 +2526,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in - **Note:** Horizon labels in `StuteJointResult.horizon_labels` are `str(t)` verbatim and carry STRING IDENTITY ONLY — NOT a chronological ordering key. Callers who need chronological order must preserve the original period values alongside (e.g. from the `pre_periods` / `post_periods` argument). - **Note:** NaN propagation is explicit: when any horizon has NaN in residuals, `cvm_stat_joint=NaN`, `p_value=NaN`, `reject=False`, AND `per_horizon_stats={label: np.nan for every horizon}` (full dict preserved with NaN values — not empty, not partial). -**Phase 3 follow-up delivery:** `stute_joint_pretest()`, `joint_pretrends_test()`, `joint_homogeneity_test()`, `StuteJointResult`, and `did_had_pretest_workflow(aggregate="event_study")` shipped together in PR #353 (2026-04). The `practitioner_next_steps()` HAD handlers landed in Phase 5 wave 1 (PR #402); the T21 HAD pretest workflow tutorial landed in PR #409 (Phase 5 wave 2 first slice). T22 weighted/survey HAD tutorial remains queued. +**Phase 3 follow-up delivery:** `stute_joint_pretest()`, `joint_pretrends_test()`, `joint_homogeneity_test()`, `StuteJointResult`, and `did_had_pretest_workflow(aggregate="event_study")` shipped together in PR #353 (2026-04). The `practitioner_next_steps()` HAD handlers landed in Phase 5 wave 1 (PR #402); the T21 HAD pretest workflow tutorial landed in PR #409 (Phase 5 wave 2 first slice). The T22 survey-weighted HAD tutorial (`docs/tutorials/22_had_survey_design.ipynb`) shipped as the follow-up to PR #432 (2026-05). **Reference implementation(s):** - R: `did_had` (de Chaisemartin, Ciccia, D'Haultfœuille, Knau 2024a); `stute_test` (2024c); `yatchew_test` (Online Appendix, Table 3). @@ -2574,7 +2574,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in - [x] Phase 5 (wave 1, PR #402): `llms-full.txt` HeterogeneousAdoptionDiD section + result-class blocks + `## HAD Pretests` index + Choosing-an-Estimator row landed; constructor / fit() parameter names are regression-locked against `inspect.signature(HeterogeneousAdoptionDiD.__init__)` and `HeterogeneousAdoptionDiD.fit` for parameter-name presence (parameter defaults and the non-return parameter type annotations remain unpinned; the `fit()` return-type union is locked BOTH at the source-code level AND at the test level by `TestFitReturnAnnotation`); result-class field tables enumerate every public dataclass field (regression-tested via `dataclasses.fields()`); `llms-practitioner.txt` Step 4 decision tree distinguishes ContinuousDiD (per-dose ATT(d), needs never-treated) from HeterogeneousAdoptionDiD (WAS, universal-rollout-compatible). - [x] Phase 5 (partial): README catalog one-liner, bundled `llms.txt` `## Estimators` entry, `docs/api/had.rst` (autoclass for the three classes), and `docs/references.rst` citation landed in PR #372 docs refresh. - [x] Phase 5 (wave 2 first slice, PR #409): T21 HAD pretest workflow tutorial (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `did_had_pretest_workflow`. Uses a `Uniform[$0.01K, $50K]` dose-distribution variant of T20's brand-campaign panel (true support strictly positive but near-zero, chosen so QUG fails-to-reject `H0: d_lower = 0` in finite sample). Walks through `aggregate="overall"` (Steps 1 + 3 only, verdict explicitly flags Step 2 deferral) and upgrades to `aggregate="event_study"` (joint pre-trends Stute + joint homogeneity Stute close the gap). Side panel exercises both `yatchew_hr_test` null modes (`linearity` vs `mean_independence`). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (16 tests pinning panel composition, both verdict pivots, structural anchors, deterministic stats, bootstrap p-value tolerance bands per backend, and `HAD(design="auto")` resolution to `continuous_at_zero` on this panel). -- [ ] Phase 5 (remaining): T22 weighted/survey HAD tutorial - tracked in `TODO.md`. +- [x] Phase 5 (wave 2 second slice): T22 weighted/survey HAD tutorial (`docs/tutorials/22_had_survey_design.ipynb`) - shipped as the follow-up to PR #432. End-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` under `SurveyDesign(weights, strata, psu, fpc)` on a BRFSS-shape state-rollout panel (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum). Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (25 tests pinning panel composition, naive-vs-survey SE inflation direction, design auto-detection, event-study cband-vs-pointwise width ordering, `_QUG_DEFERRED_SUFFIX` substring on `report.verdict` for both overall and event-study paths, the distinct `report.summary()` QUG-skip note on the event-study path, deterministic Yatchew sigma2_*, and bootstrap p-value tolerance bands per `feedback_strata_bootstrap_path_divergence` (>= 0.25 abs)). - [ ] Documentation of non-testability of Assumptions 5 and 6. - [ ] Warnings for staggered treatment timing (redirect to `ChaisemartinDHaultfoeuille`). - [ ] `NotImplementedError` phase pointer when `covariates=` is passed (Theorem 6 future work). diff --git a/docs/practitioner_decision_tree.rst b/docs/practitioner_decision_tree.rst index 0e81d20b..72f33ed6 100644 --- a/docs/practitioner_decision_tree.rst +++ b/docs/practitioner_decision_tree.rst @@ -315,7 +315,13 @@ identification rests on stronger structural assumptions (Design 1). For a full walkthrough including data setup, the design auto-detection diagnostic, the multi-week event study, and a stakeholder communication template, see `Tutorial 20: HAD for National Brand Campaign with Regional - Spend Intensity `_. + Spend Intensity `_. For the + composite pre-test diagnostic walkthrough on top of HAD, see + `Tutorial 21: HAD Pre-test Workflow + `_. For the same workflow under + stratified survey weights (BRFSS-shape design), see + `Tutorial 22: Survey-Weighted HAD + `_. .. _section-few-markets: @@ -412,7 +418,10 @@ See :doc:`practitioner_getting_started` for an end-to-end example. For a full walkthrough with brand funnel metrics and staggered rollouts, see `Tutorial 17: Brand Awareness Survey - `_. + `_. For the survey-design path + through HAD (universal-rollout, continuous dose, stratified survey weights), + see `Tutorial 22: Survey-Weighted HAD + `_. At a Glance diff --git a/docs/survey-roadmap.md b/docs/survey-roadmap.md index 1a335718..969a4d49 100644 --- a/docs/survey-roadmap.md +++ b/docs/survey-roadmap.md @@ -140,6 +140,35 @@ Supports both TSL and replicate-weight variance. See `docs/api/prep.rst` for the API reference and `docs/methodology/REGISTRY.md` for the methodology entry. +### Phase 4.5 C: HAD Stute Survey Workflow ✅ Shipped + +The HeterogeneousAdoptionDiD pretest family (`stute_test`, +`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, +and the composite `did_had_pretest_workflow`) gained end-to-end +support for `SurveyDesign(strata=..., psu=..., weights=..., fpc=...)` +in PR #432 (2026-05). The Stute CvM bootstrap on stratified survey +designs uses a documented synthesis of clustered-wild-bootstrap +ingredients (Cameron-Gelbach-Miller 2008 cluster-level multipliers; +Davidson-Flachaire 2008 wild-bootstrap centering; Wu 1986 / Liu 1988 +Bessel small-sample correction; Djogbenou-MacKinnon-Nielsen 2019 +cluster-wild consistency for nonlinear functionals): within-stratum +demean + `sqrt(n_h/(n_h-1))` rescale on the PSU multipliers BEFORE +the per-obs broadcast in the wild-residual loop. The shared helper +`bootstrap_utils.apply_stratum_centering` backs both the new Stute +path and the existing HAD sup-t event-study cband bootstrap. The QUG +step remains permanently deferred under survey/weights (Phase 4.5 +C0); the workflow surfaces this in `report.qug=None` plus the +`_QUG_DEFERRED_SUFFIX` substring on `report.verdict`. Tutorial 22 +(`docs/tutorials/22_had_survey_design.ipynb`) walks the workflow +end-to-end on a BRFSS-shape state-rollout panel. + +Remaining HAD survey-path deferrals (separate follow-up PRs): +`lonely_psu='adjust'` + singleton strata (pseudo-stratum centering +transform not yet derived for the Stute functional — same gap as the +HAD sup-t deviation at REGISTRY:2382); replicate-weight designs +(BRR / Fay / JK1 / JKn / SDR — separate Rao-Wu / JKn bootstrap +composition). + --- ## Phase 10: Academic Grounding (History) diff --git a/docs/tutorials/20_had_brand_campaign.ipynb b/docs/tutorials/20_had_brand_campaign.ipynb index 66b24849..cc6a6e4f 100644 --- a/docs/tutorials/20_had_brand_campaign.ipynb +++ b/docs/tutorials/20_had_brand_campaign.ipynb @@ -388,7 +388,7 @@ "\n", "This tutorial covered HAD's headline workflow: the overall WAS_d_lower fit and the multi-week event study. The library also supports several extensions we did not demonstrate here.\n", "\n", - "- **Population-weighted (survey-aware) inference**: when some markets or regions carry more weight than others - e.g., DMAs weighted by population - HAD accepts a `weights=` array or a `SurveyDesign` object on the same `fit()` interface.\n", + "- **Population-weighted (survey-aware) inference**: when some markets or regions carry more weight than others - e.g., DMAs weighted by population - HAD accepts a `SurveyDesign` object on the same `fit()` interface (the deprecated `weights=` and `survey=` kwarg aliases will be removed in the next minor release; use `survey_design=` going forward). [Tutorial 22](22_had_survey_design.ipynb) walks the BRFSS-shape survey-design path end-to-end including the pretest workflow.\n", "- **Composite pretest workflow**: HAD ships a `did_had_pretest_workflow` that combines the QUG support-infimum test (`H0: d_lower = 0`, which adjudicates between the `continuous_at_zero` and `continuous_near_d_lower` design paths) with linearity tests (Stute and Yatchew-HR). On the two-period (`aggregate='overall'`) path this workflow checks QUG and linearity only; the parallel-trends step is closed by the multi-period (`aggregate='event_study'`) joint variants (`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`). The visual placebo check we used in Section 4 is a parallel-trends sanity check, not a substitute for the formal joint pretests; see [Tutorial 21](21_had_pretest_workflow.ipynb) for an end-to-end pretest walkthrough.\n", "- **`continuous_at_zero` design path**: if the lightest-touch DMA had no regional add-on (spend exactly $0), HAD switches to the Design 1' identification path with target `WAS` instead of `WAS_d_lower`. The auto-detection picks it up.\n", "- **Mass-point design path**: if a meaningful chunk of DMAs sit at exactly the same minimum spend (rather than spread continuously near the boundary), HAD switches to a 2SLS estimator with matching identification logic. Auto-detected as well.\n", diff --git a/docs/tutorials/21_had_pretest_workflow.ipynb b/docs/tutorials/21_had_pretest_workflow.ipynb index 0443e015..4731632a 100644 --- a/docs/tutorials/21_had_pretest_workflow.ipynb +++ b/docs/tutorials/21_had_pretest_workflow.ipynb @@ -576,7 +576,7 @@ "\n", "This tutorial covered the composite pre-test workflow on a single panel where QUG fail-to-reject and HAD's `design=\"auto\"` heuristic both pointed independently to the `continuous_at_zero` (Design 1') identification path. A few directions we did not exercise here:\n", "\n", - "- **Survey-weighted / population-weighted inference** - HAD's pre-test workflow accepts `survey_design=` (or the deprecated `survey=` / `weights=` aliases) for design-based inference. The QUG step is permanently deferred under survey weighting (extreme-value theory under complex sampling is not a settled toolkit); the linearity family runs with PSU-level Mammen multiplier bootstrap (Stute and joint variants) and weighted OLS + weighted variance components (Yatchew). A follow-up tutorial covers this path end-to-end.\n", + "- **Survey-weighted / population-weighted inference** - HAD's pre-test workflow accepts `survey_design=` (or the deprecated `survey=` / `weights=` aliases) for design-based inference. The QUG step is permanently deferred under survey weighting (extreme-value theory under complex sampling is not a settled toolkit); the linearity family runs with PSU-level Mammen multiplier bootstrap (Stute and joint variants) and weighted OLS + weighted variance components (Yatchew). See [Tutorial 22: Survey-Weighted HAD](22_had_survey_design.ipynb) for an end-to-end walkthrough on a BRFSS-shape household-survey panel including the now-supported `SurveyDesign(strata=...)` pretest workflow.\n", "- **`trends_lin=True` (Pierce-Schott Eq 17 / 18 detrending)** - mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)`. Forwards into both joint pre-trends and joint homogeneity wrappers; consumes the placebo at `base_period - 1` and skips Step 2 if no earlier placebo survives the drop. Useful when you suspect linear time trends correlated with dose but want to keep the joint-Stute machinery.\n", "- **Standalone constituent tests** - all four building blocks are exposed for direct calling: `qug_test`, `stute_test`, `yatchew_hr_test` (used in this tutorial's side panel), and the joint variants `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`.\n", "\n", diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb new file mode 100644 index 00000000..f73c7d38 --- /dev/null +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -0,0 +1,777 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "bd4f9026", + "metadata": {}, + "source": [ + "# Tutorial 22: Survey-Weighted HAD - The BRFSS-Shape Rollout\n", + "\n", + "The HAD series so far ran on simple iid panels. T20\n", + "(`docs/tutorials/20_had_brand_campaign.ipynb`) walked the headline workflow;\n", + "T21 (`docs/tutorials/21_had_pretest_workflow.ipynb`) layered in the pretest\n", + "diagnostics. Realistic federal-survey designs (BRFSS / CPS / NHANES shape -\n", + "stratified household samples with PSU clustering, post-stratification\n", + "weights, and a finite population correction) became demonstrable end-to-end\n", + "through the HAD workflow only on 2026-05-14, when the gate on\n", + "`SurveyDesign(strata=...)` was lifted across the Stute pretest family. T22\n", + "shows the full design-aware workflow on a stylized BRFSS-shape state\n", + "rollout: 60 states organized as 5 strata x 6 PSUs/stratum x 2 states/PSU,\n", + "with post-stratification raking weights, an iid PSU x period shock to\n", + "inject real cluster correlation, and a single national health-campaign\n", + "intensity rolled out at week 5.\n", + "\n", + "**Prerequisites.** T16 (`docs/tutorials/16_survey_did.ipynb`) for the\n", + "`SurveyDesign` API and survey-DiD background; T20 for the HAD headline\n", + "fit; T21 for the pretest workflow.\n", + "\n", + "**Sections:**\n", + "\n", + "1. Survey design adds two problems for HAD\n", + "2. The panel: BRFSS-shape state rollout\n", + "3. Naive vs survey-aware headline fit\n", + "4. Why the SE inflation is modest for HAD\n", + "5. Event-study under the survey design\n", + "6. Pretest workflow under the survey design\n", + "7. Communicating the design-based verdict to leadership\n", + "8. Extensions + summary checklist\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a525e43", + "metadata": {}, + "outputs": [], + "source": [ + "import warnings\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "from diff_diff import (\n", + " HAD,\n", + " SurveyDesign,\n", + " did_had_pretest_workflow,\n", + " generate_continuous_did_data,\n", + ")\n", + "\n", + "try:\n", + " import matplotlib.pyplot as plt\n", + " HAS_MATPLOTLIB = True\n", + "except ImportError:\n", + " HAS_MATPLOTLIB = False\n", + "\n", + "# Locked DGP parameters (mirrored in tests/test_t22_had_survey_design_drift.py).\n", + "MAIN_SEED = 87\n", + "N_UNITS = 60\n", + "N_PERIODS = 8\n", + "COHORT_PERIOD = 5\n", + "TRUE_SLOPE = 100.0\n", + "BASELINE_OUTCOME = 35.0\n", + "DOSE_LOW = 5.0\n", + "DOSE_HIGH = 50.0\n", + "\n", + "# Survey-design layer.\n", + "N_STRATA = 5\n", + "PSU_PER_STRATUM = 6\n", + "STATES_PER_PSU = 2\n", + "WEIGHT_CV_TARGET = 0.30\n", + "FPC_PER_STRATUM = 30\n", + "PSU_PERIOD_SHOCK_SD = 1.5\n", + "SD_SEED = 87\n", + "\n", + "# Pretest workflow.\n", + "WORKFLOW_SEED = 22\n", + "N_BOOTSTRAP = 999\n" + ] + }, + { + "cell_type": "markdown", + "id": "239a35f5", + "metadata": {}, + "source": [ + "## 1. Survey design adds two problems for HAD\n", + "\n", + "When the panel comes out of a complex survey instead of a simple random\n", + "sample, HAD's headline numbers acquire two new failure modes:\n", + "\n", + "- **Cluster correlation.** PSUs (households-within-tract,\n", + " individuals-within-PSU, states-within-region) share unobserved shocks.\n", + " Treating them as iid under-states the SE; a 95% pointwise CI written\n", + " off the naive variance under-covers the truth.\n", + "- **Unequal sampling weights.** When the probability of being sampled is\n", + " not flat across the population, the unweighted point estimate is the\n", + " sample-quantity, not the population-quantity it usually has to\n", + " represent for policy.\n", + "\n", + "The survey-design fix is the standard library composition: stratified\n", + "PSU-sandwich variance via Binder-TSL (Binder 1983; Lumley 2004) with a\n", + "finite-population correction, weights as `pweight`, and PSU as the\n", + "clustering unit. T16 walks the API end-to-end on a binary-cohort DiD;\n", + "T22 specializes it for HAD. One headline difference to flag up front:\n", + "HAD's WAS-d_lower estimand reads variance from a **local-linear\n", + "neighborhood at the boundary** rather than from a regression coefficient\n", + "on the full panel - so survey-aware SE inflation manifests differently\n", + "than the rule-of-thumb you would expect from CallawaySantAnna or\n", + "LinearRegression. We come back to this in section 4.\n", + "\n", + "The Phase 4.5 C0 deferral is the second user-facing caveat: the HAD\n", + "QUG step (paper Section 4 step 1; tests `H_0: d_lower = 0` via the\n", + "extreme order statistic of the dose) is **permanently deferred under\n", + "survey/weights**. Extreme order statistics are not Hadamard-\n", + "differentiable functionals of the empirical CDF, so no off-the-shelf\n", + "survey-adjusted EVT fits the slot. The pretest workflow surfaces this\n", + "deferral in the verdict text and on `report.qug` (set to `None`); the\n", + "linearity-conditional remainder of the verdict still runs. Section 6\n", + "walks the verdict text in detail.\n" + ] + }, + { + "cell_type": "markdown", + "id": "f74f76d9", + "metadata": {}, + "source": [ + "## 2. The panel: BRFSS-shape state rollout\n", + "\n", + "The DGP shape is identical to T20 (60 states x 8 weeks; rollout at\n", + "w=5; dose ~ Uniform[$5K, $50K] of supplemental campaign spend per\n", + "state; true ATT linear in dose with slope=100 screening uptake per\n", + "$1K spend; outcome shifted by a baseline of 35 per 1,000 adults). The\n", + "new piece is the survey layer: a permutation-based assignment of\n", + "states to a 5 x 6 stratum x PSU grid (5 census-region x urbanicity\n", + "strata, 6 PSUs/stratum, 2 states/PSU = 60), per-state post-\n", + "stratification raking weights with CV ~ 0.30, and a PSU x period iid\n", + "shock injected into the outcome so cluster correlation **survives DiD\n", + "first-differencing**.\n", + "\n", + "That last point is load-bearing. A time-invariant per-PSU random\n", + "intercept cancels exactly under DiD: the (post - pre) average shifts\n", + "each state by a constant, and the slope on dose is unaffected.\n", + "`generate_survey_did_data` documents this directly\n", + "(`prep_dgp.py:1517-1523`): \"the time-invariant PSU RE cancels in the\n", + "treatment-vs-control time-difference and the cluster-robust / survey\n", + "SE would be smaller than naive OLS SE.\" The fix is a **PSU x period\n", + "interaction shock**: an iid draw per (psu, period) cell. T16's\n", + "reference helper carries the same composition (`psu_re_sd=2.0` paired\n", + "with `psu_period_factor=1.0`).\n", + "\n", + "Helper kept in-notebook so practitioners can copy it into their own\n", + "workflow:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e3ad9844", + "metadata": {}, + "outputs": [], + "source": [ + "def attach_brfss_survey_columns(\n", + " panel,\n", + " *,\n", + " seed,\n", + " n_strata=N_STRATA,\n", + " psu_per_stratum=PSU_PER_STRATUM,\n", + " states_per_psu=STATES_PER_PSU,\n", + " weight_cv=WEIGHT_CV_TARGET,\n", + " fpc_per_stratum=FPC_PER_STRATUM,\n", + " psu_period_shock_sd=PSU_PERIOD_SHOCK_SD,\n", + "):\n", + " \"\"\"Attach a BRFSS-shape survey design to a HAD panel.\n", + "\n", + " Adds: stratum (int), psu_id (int), weight (float), fpc (float).\n", + " Also injects a PSU x period shock into the outcome so cluster\n", + " correlation survives DiD differencing. Constant-within-state on\n", + " the four design columns.\n", + " \"\"\"\n", + " rng = np.random.default_rng(seed)\n", + " state_ids = np.sort(panel[\"state_id\"].unique())\n", + " n_states = len(state_ids)\n", + " n_psu = n_strata * psu_per_stratum\n", + " if n_states != n_psu * states_per_psu:\n", + " raise ValueError(\n", + " f\"state count {n_states} must equal n_strata*psu_per_stratum*\"\n", + " f\"states_per_psu = {n_psu * states_per_psu}\"\n", + " )\n", + " perm = rng.permutation(n_states)\n", + " psu_block = np.repeat(np.arange(n_psu), states_per_psu)\n", + " psu_of_state = psu_block[np.argsort(perm)]\n", + " stratum_of_state = psu_of_state // psu_per_stratum\n", + " base_per_stratum = np.array([0.8, 0.9, 1.0, 1.1, 1.3])\n", + " base_w = base_per_stratum[stratum_of_state]\n", + " sigma = np.sqrt(np.log(1 + weight_cv ** 2))\n", + " pert = rng.lognormal(mean=-0.5 * sigma ** 2, sigma=sigma, size=n_states)\n", + " w_per_state = base_w * pert\n", + " state_lookup = pd.DataFrame({\n", + " \"state_id\": state_ids,\n", + " \"stratum\": stratum_of_state.astype(np.int64),\n", + " \"psu_id\": psu_of_state.astype(np.int64),\n", + " \"weight\": w_per_state,\n", + " \"fpc\": float(fpc_per_stratum),\n", + " })\n", + " out = panel.merge(state_lookup, on=\"state_id\", how=\"left\")\n", + " n_periods_local = int(panel[\"week\"].max() - panel[\"week\"].min() + 1)\n", + " psu_period_shocks = rng.normal(0.0, psu_period_shock_sd, size=(n_psu, n_periods_local))\n", + " week_min = int(panel[\"week\"].min())\n", + " shock_lookup = pd.DataFrame([\n", + " {\"psu_id\": int(p), \"week\": int(w + week_min), \"psu_period_shock\": float(psu_period_shocks[p, w])}\n", + " for p in range(n_psu)\n", + " for w in range(n_periods_local)\n", + " ])\n", + " out = out.merge(shock_lookup, on=[\"psu_id\", \"week\"], how=\"left\")\n", + " out[\"screening_uptake\"] = out[\"screening_uptake\"] + out[\"psu_period_shock\"]\n", + " return out.drop(columns=[\"psu_period_shock\"])\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "00f75a66", + "metadata": {}, + "outputs": [], + "source": [ + "raw = generate_continuous_did_data(\n", + " n_units=N_UNITS,\n", + " n_periods=N_PERIODS,\n", + " cohort_periods=[COHORT_PERIOD],\n", + " never_treated_frac=0.0,\n", + " dose_distribution=\"uniform\",\n", + " dose_params={\"low\": DOSE_LOW, \"high\": DOSE_HIGH},\n", + " att_function=\"linear\",\n", + " att_intercept=0.0,\n", + " att_slope=TRUE_SLOPE,\n", + " unit_fe_sd=8.0,\n", + " time_trend=0.5,\n", + " noise_sd=2.0,\n", + " seed=MAIN_SEED,\n", + ")\n", + "panel = raw.copy()\n", + "panel.loc[panel[\"period\"] < panel[\"first_treat\"], \"dose\"] = 0.0\n", + "panel = panel.rename(columns={\n", + " \"unit\": \"state_id\",\n", + " \"period\": \"week\",\n", + " \"outcome\": \"screening_uptake\",\n", + " \"dose\": \"spend_k\",\n", + "})\n", + "panel[\"screening_uptake\"] = panel[\"screening_uptake\"] + BASELINE_OUTCOME\n", + "\n", + "panel = attach_brfss_survey_columns(panel, seed=SD_SEED)\n", + "\n", + "per_state = panel.groupby(\"state_id\").agg(\n", + " weight=(\"weight\", \"first\"),\n", + " stratum=(\"stratum\", \"first\"),\n", + " psu_id=(\"psu_id\", \"first\"),\n", + ").reset_index()\n", + "print(f\"states = {panel['state_id'].nunique()}\")\n", + "print(f\"strata = {panel['stratum'].nunique()} ({sorted(panel['stratum'].unique())})\")\n", + "print(f\"PSUs = {panel['psu_id'].nunique()}\")\n", + "print(f\"weight CV across states = {per_state['weight'].std() / per_state['weight'].mean():.3f}\")\n", + "print(f\"weight range = [{per_state['weight'].min():.3f}, {per_state['weight'].max():.3f}]\")\n", + "print(f\"FPC (per row, constant) = {panel['fpc'].iloc[0]:.0f}\")\n", + "panel.head()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2056e977", + "metadata": {}, + "outputs": [], + "source": [ + "sd = SurveyDesign(\n", + " weights=\"weight\",\n", + " strata=\"stratum\",\n", + " psu=\"psu_id\",\n", + " fpc=\"fpc\",\n", + ")\n", + "print(sd)\n" + ] + }, + { + "cell_type": "markdown", + "id": "326875a6", + "metadata": {}, + "source": [ + "## 3. Naive vs survey-aware headline fit\n", + "\n", + "T20's headline path collapses to two periods (pre-mean vs post-mean per\n", + "state) and fits HAD with `design=\"auto\"` - the heuristic lands on\n", + "`continuous_near_d_lower` (Design 1) on this dose support, with the\n", + "target estimand `WAS_d_lower` (Weighted Average Slope at the lower\n", + "boundary). T22 fits the same configuration twice: once naive (no\n", + "`survey_design` argument), once survey-aware\n", + "(`survey_design=sd`). The point estimate should be close in both - the\n", + "analytical local-linear at d_lower does not consume the survey weights\n", + "in the slope - but the SE and CI should differ.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d3d14a66", + "metadata": {}, + "outputs": [], + "source": [ + "p2 = panel.copy()\n", + "p2[\"period\"] = (p2[\"week\"] >= COHORT_PERIOD).astype(int) + 1\n", + "panel_2p = p2.groupby([\"state_id\", \"period\"], as_index=False).agg(\n", + " screening_uptake=(\"screening_uptake\", \"mean\"),\n", + " spend_k=(\"spend_k\", \"mean\"),\n", + " stratum=(\"stratum\", \"first\"),\n", + " psu_id=(\"psu_id\", \"first\"),\n", + " weight=(\"weight\", \"first\"),\n", + " fpc=(\"fpc\", \"first\"),\n", + ")\n", + "panel_2p.head()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e875f23", + "metadata": {}, + "outputs": [], + "source": [ + "with warnings.catch_warnings():\n", + " warnings.filterwarnings(\"ignore\", category=UserWarning)\n", + " naive = HAD(design=\"auto\").fit(\n", + " panel_2p,\n", + " outcome_col=\"screening_uptake\",\n", + " dose_col=\"spend_k\",\n", + " time_col=\"period\",\n", + " unit_col=\"state_id\",\n", + " )\n", + " survey = HAD(design=\"auto\").fit(\n", + " panel_2p,\n", + " outcome_col=\"screening_uptake\",\n", + " dose_col=\"spend_k\",\n", + " time_col=\"period\",\n", + " unit_col=\"state_id\",\n", + " survey_design=sd,\n", + " )\n", + "print(f\"design auto-detected: naive={naive.design}, survey={survey.design}\")\n", + "print(f\"target parameter : naive={naive.target_parameter}, survey={survey.target_parameter}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c036649", + "metadata": {}, + "outputs": [], + "source": [ + "def _row(name, r):\n", + " lo, hi = r.conf_int\n", + " return {\n", + " \"fit\": name,\n", + " \"ATT\": round(float(r.att), 3),\n", + " \"SE\": round(float(r.se), 3),\n", + " \"CI low\": round(float(lo), 3),\n", + " \"CI high\": round(float(hi), 3),\n", + " \"CI width\": round(float(hi - lo), 3),\n", + " \"covers truth (slope=100)\": \"YES\" if lo <= TRUE_SLOPE <= hi else \"NO\",\n", + " }\n", + "\n", + "comparison = pd.DataFrame([_row(\"naive\", naive), _row(\"survey\", survey)])\n", + "print(comparison.to_string(index=False))\n", + "print()\n", + "print(f\"se_ratio (survey / naive) = {survey.se / naive.se:.3f}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "8a3a5aa6", + "metadata": {}, + "source": [ + "**Reading the table.** Both fits land on the same identification\n", + "path (`continuous_near_d_lower`), report point estimates very close to\n", + "the true slope of 100, and produce CIs that cover the truth. The\n", + "survey SE is **larger** than the naive SE (the design machinery picks\n", + "up the cluster correlation from the PSU x period shock). The\n", + "inflation is modest - around 1.10x - which is smaller than the\n", + "inflation a similar `weight_cv` and PSU shock would produce on, say,\n", + "a CallawaySantAnna ATT. That's a feature of HAD's local-linear\n", + "estimand, not a bug. We unpack it in the next section.\n" + ] + }, + { + "cell_type": "markdown", + "id": "637dd580", + "metadata": {}, + "source": [ + "## 4. Why the SE inflation is modest for HAD\n", + "\n", + "The HAD `WAS_d_lower` estimand is a weighted average of\n", + "unit-specific slopes in a **local-linear neighborhood at d_lower**.\n", + "The variance reads off the influence functions of the units in that\n", + "neighborhood, not from the full panel. With dose ~ Uniform[5, 50] and\n", + "60 states, only a handful of states sit close to d_lower ~ 5 - and\n", + "those are the units whose IFs dominate `Var(WAS_d_lower)`. The\n", + "PSU-level cluster correlation can amplify the variance only as much\n", + "as those few units are correlated with PSU-mates. With 2 states/PSU\n", + "and only a small share of states near the boundary, the within-PSU\n", + "correlation has a small lever to act on.\n", + "\n", + "Contrast with the event-study path: the post-period horizons aggregate\n", + "across **all post-period observations**, so the variance is read off\n", + "the full panel and the survey-design machinery has more cluster\n", + "correlation to amplify. The per-horizon SE ratio should be larger\n", + "than the overall SE ratio:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e4b6ec75", + "metadata": {}, + "outputs": [], + "source": [ + "with warnings.catch_warnings():\n", + " warnings.filterwarnings(\"ignore\", category=UserWarning)\n", + " naive_es = HAD(design=\"auto\").fit(\n", + " panel,\n", + " outcome_col=\"screening_uptake\",\n", + " dose_col=\"spend_k\",\n", + " time_col=\"week\",\n", + " unit_col=\"state_id\",\n", + " first_treat_col=\"first_treat\",\n", + " aggregate=\"event_study\",\n", + " )\n", + " survey_es_for_ratio = HAD(design=\"auto\").fit(\n", + " panel,\n", + " outcome_col=\"screening_uptake\",\n", + " dose_col=\"spend_k\",\n", + " time_col=\"week\",\n", + " unit_col=\"state_id\",\n", + " first_treat_col=\"first_treat\",\n", + " aggregate=\"event_study\",\n", + " survey_design=sd,\n", + " )\n", + "\n", + "ratios = pd.DataFrame({\n", + " \"event_time\": list(naive_es.event_times),\n", + " \"naive_se\": [round(float(s), 3) for s in naive_es.se],\n", + " \"survey_se\": [round(float(s), 3) for s in survey_es_for_ratio.se],\n", + " \"se_ratio\": [round(float(s_s / s_n), 3) for s_n, s_s in zip(naive_es.se, survey_es_for_ratio.se)],\n", + "})\n", + "print(ratios.to_string(index=False))\n", + "print()\n", + "print(f\"overall SE ratio (from section 3) = {survey.se / naive.se:.3f}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "1200aaee", + "metadata": {}, + "source": [ + "The per-horizon table reflects the IF-spread argument: pre-period\n", + "horizons (where the dose is zero across the board, so the local-linear\n", + "at d_lower has nothing to localize to) read SE off mostly the unit FE\n", + "plus the PSU shock variance, and the survey machinery has more\n", + "correlation to amplify. Post-period horizons aggregate across all\n", + "states' post-period contributions and produce a moderate inflation as\n", + "well. The takeaway for the practitioner: **for HAD specifically, do\n", + "not eyeball SE inflation against your DEFF expectation from\n", + "regression-coefficient examples**. Read the design-aware SE off the\n", + "fit object and report it; the magnitude of the survey-vs-naive ratio\n", + "is determined by the IF-weighting of your particular estimand.\n" + ] + }, + { + "cell_type": "markdown", + "id": "fc9b05ca", + "metadata": {}, + "source": [ + "## 5. Event-study under the survey design\n", + "\n", + "Refit with `aggregate=\"event_study\"` and `cband=True` to get\n", + "per-horizon ATT estimates plus a sup-t confidence band that adjusts\n", + "for the multiple-horizon comparison. The cband is computed via a\n", + "multiplier bootstrap that aggregates per-PSU IF tensor under the\n", + "survey design (Phase 4.5 C composition).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c35c6e78", + "metadata": {}, + "outputs": [], + "source": [ + "with warnings.catch_warnings():\n", + " warnings.filterwarnings(\"ignore\", category=UserWarning)\n", + " es = HAD(design=\"auto\").fit(\n", + " panel,\n", + " outcome_col=\"screening_uptake\",\n", + " dose_col=\"spend_k\",\n", + " time_col=\"week\",\n", + " unit_col=\"state_id\",\n", + " first_treat_col=\"first_treat\",\n", + " aggregate=\"event_study\",\n", + " survey_design=sd,\n", + " cband=True,\n", + " )\n", + "\n", + "es_table = pd.DataFrame({\n", + " \"event_time\": list(es.event_times),\n", + " \"att\": [round(float(a), 2) for a in es.att],\n", + " \"se\": [round(float(s), 3) for s in es.se],\n", + " \"ci_low\": [round(float(c), 2) for c in es.conf_int_low],\n", + " \"ci_high\": [round(float(c), 2) for c in es.conf_int_high],\n", + " \"cband_low\": [round(float(c), 2) for c in es.cband_low],\n", + " \"cband_high\": [round(float(c), 2) for c in es.cband_high],\n", + "})\n", + "print(es_table.to_string(index=False))\n", + "print()\n", + "print(f\"cband critical value = {es.cband_crit_value:.4f} ({es.cband_method})\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "450c47b4", + "metadata": {}, + "outputs": [], + "source": [ + "if HAS_MATPLOTLIB:\n", + " fig, ax = plt.subplots(figsize=(8, 4.2))\n", + " et = np.asarray(es.event_times, dtype=float)\n", + " ax.axhline(0.0, color=\"black\", linewidth=0.5)\n", + " ax.axhline(TRUE_SLOPE, color=\"gray\", linewidth=0.6, linestyle=\":\", label=\"true slope = 100\")\n", + " ax.fill_between(et, np.asarray(es.cband_low), np.asarray(es.cband_high), alpha=0.15, label=\"sup-t cband (survey)\")\n", + " ax.errorbar(et, np.asarray(es.att), yerr=1.96 * np.asarray(es.se), fmt=\"o\", color=\"#1f77b4\", capsize=3, label=\"point + pointwise CI\")\n", + " ax.axvline(-0.5, color=\"red\", linewidth=0.6, linestyle=\"--\")\n", + " ax.set_xlabel(\"event time (weeks since campaign launch)\")\n", + " ax.set_ylabel(\"per-$1K WAS-d_lower\")\n", + " ax.set_title(\"HAD event-study under SurveyDesign(weights, strata, psu, fpc)\")\n", + " ax.legend(loc=\"center right\", fontsize=8)\n", + " fig.tight_layout()\n", + " plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "id": "c7eff930", + "metadata": {}, + "source": [ + "**Reading the event-study.** Pre-launch horizons (e in\n", + "{-4, -3, -2}) cover zero - no pre-trends - and the post-launch\n", + "horizons (e in {0, 1, 2, 3}) sit on the true per-$1K slope of 100\n", + "with both pointwise and sup-t coverage. The sup-t band is a few\n", + "percent wider than the pointwise interval; that gap is the multiple-\n", + "horizon multiplicity correction. Per-horizon SEs are noticeably\n", + "larger post-launch than pre - the post-period dose variation is what\n", + "drives variance under HAD's identification.\n" + ] + }, + { + "cell_type": "markdown", + "id": "95252100", + "metadata": {}, + "source": [ + "## 6. Pretest workflow under the survey design\n", + "\n", + "`did_had_pretest_workflow` runs the three-step diagnostic battery\n", + "(QUG; Stute or joint Stute; Yatchew) and composes a single verdict\n", + "text. Under `survey_design`, the QUG step is **permanently deferred**\n", + "(Phase 4.5 C0); the workflow returns `report.qug=None` and inserts a\n", + "suffix on the verdict string flagging the deferral. The Stute and\n", + "Yatchew steps run normally, with the Stute multiplier bootstrap\n", + "operating under the now-supported stratified-clustered wild\n", + "construction (lifted by PR #432).\n", + "\n", + "**Overall path** (two-period collapse):\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1dc9e20e", + "metadata": {}, + "outputs": [], + "source": [ + "with warnings.catch_warnings():\n", + " warnings.filterwarnings(\"ignore\", category=UserWarning)\n", + " overall_report = did_had_pretest_workflow(\n", + " panel_2p,\n", + " outcome_col=\"screening_uptake\",\n", + " dose_col=\"spend_k\",\n", + " time_col=\"period\",\n", + " unit_col=\"state_id\",\n", + " survey_design=sd,\n", + " aggregate=\"overall\",\n", + " n_bootstrap=N_BOOTSTRAP,\n", + " seed=WORKFLOW_SEED,\n", + " )\n", + "print(\"verdict :\", overall_report.verdict)\n", + "print(\"all_pass:\", overall_report.all_pass)\n", + "print(\"qug :\", overall_report.qug)\n", + "print()\n", + "print(overall_report.summary())\n" + ] + }, + { + "cell_type": "markdown", + "id": "dbcaf022", + "metadata": {}, + "source": [ + "The verdict ends in\n", + "`(linearity-conditional verdict; QUG-under-survey deferred per Phase\n", + "4.5 C0)` - that suffix is **the** caveat the workflow owes the\n", + "practitioner under survey: read the linearity verdict knowing the\n", + "boundary-presence diagnostic was not run. `report.qug` is `None`.\n", + "The Stute and Yatchew rows in the formatted summary populate as\n", + "usual.\n", + "\n", + "**Event-study path** (full panel + joint diagnostics):\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "94b0abb9", + "metadata": {}, + "outputs": [], + "source": [ + "with warnings.catch_warnings():\n", + " warnings.filterwarnings(\"ignore\", category=UserWarning)\n", + " es_report = did_had_pretest_workflow(\n", + " panel,\n", + " outcome_col=\"screening_uptake\",\n", + " dose_col=\"spend_k\",\n", + " time_col=\"week\",\n", + " unit_col=\"state_id\",\n", + " first_treat_col=\"first_treat\",\n", + " survey_design=sd,\n", + " aggregate=\"event_study\",\n", + " n_bootstrap=N_BOOTSTRAP,\n", + " seed=WORKFLOW_SEED,\n", + " )\n", + "print(\"verdict :\", es_report.verdict)\n", + "print(\"all_pass :\", es_report.all_pass)\n", + "print()\n", + "print(\"pretrends_joint :\", es_report.pretrends_joint)\n", + "print(\"homogeneity_joint:\", es_report.homogeneity_joint)\n", + "print()\n", + "print(es_report.summary())\n" + ] + }, + { + "cell_type": "markdown", + "id": "211fab22", + "metadata": {}, + "source": [ + "The event-study `verdict` ends in the same C0 suffix; the\n", + "formatted `summary()` block additionally renders a `(QUG step skipped\n", + "- permanently deferred under survey/weights per Phase 4.5 C0)` note\n", + "in place of the normal QUG row. `pretrends_joint` runs across the\n", + "three pre-launch horizons (1, 2, 3) and `homogeneity_joint` runs\n", + "across the four post-launch horizons (5, 6, 7, 8); both fail-to-\n", + "reject under the linear-DGP null.\n", + "\n", + "The Stute family on the survey path is a **documented synthesis** of\n", + "ingredients from several papers, not a direct port: cluster-level\n", + "multipliers (Cameron-Gelbach-Miller 2008), wild-bootstrap centering\n", + "(Davidson-Flachaire 2008), within-stratum demeaning + Bessel rescale\n", + "(Wu 1986; Liu 1988; Kreiss-Lahiri 2012), and the cluster-wild\n", + "consistency of nonlinear functionals (Djogbenou-MacKinnon-Nielsen\n", + "2019), composed for the Hlavka-Huskova 2020 Stute residual\n", + "bootstrap. No single paper covers the exact composition for HAD\n", + "under stratified PSU sampling; the library's `apply_stratum_centering`\n", + "helper (`bootstrap_utils.py:657-786`) is what makes the composition\n", + "turn-key. See `docs/methodology/REGISTRY.md` § HAD - Stute\n", + "stratified survey-bootstrap calibration for the derivation.\n" + ] + }, + { + "cell_type": "markdown", + "id": "a40a113e", + "metadata": {}, + "source": [ + "## 7. Communicating the design-based verdict to leadership\n", + "\n", + "> **For the executive.** The campaign moved screening uptake by\n", + "> **about 100 per 1,000 adults per $1K of supplemental spend**, with\n", + "> a 95% confidence interval that comfortably excludes zero. The\n", + "> survey-design analysis - which accounts for the way states were\n", + "> sampled within strata and clustered into PSUs - lands on the same\n", + "> bottom line as the simpler analysis, but with a slightly wider\n", + "> confidence interval. Both intervals cover the methodological\n", + "> ground truth (slope = 100). Pre-launch placebo periods showed no\n", + "> drift; the per-week post-launch effect is consistent across the\n", + "> rollout.\n", + "\n", + "> **For the methodologist.** The HAD pretest workflow runs all three\n", + "> linearity diagnostics (QUG step deferred under survey/weights per\n", + "> the Phase 4.5 C0 deferral; this is the load-bearing caveat we owe\n", + "> the audience). Joint pre-trends and joint homogeneity Stute\n", + "> diagnostics fail-to-reject across the three pre- and four post-\n", + "> horizons. Yatchew-HR fails-to-reject under both null modes. The\n", + "> verdict reads \"Stute and Yatchew linearity diagnostics fail-to-\n", + "> reject (linearity-conditional verdict; QUG-under-survey deferred\n", + "> per Phase 4.5 C0)\". The design-based SE is ~10% larger than the\n", + "> naive SE - smaller than the inflation a CS or LinearRegression\n", + "> coefficient would see at this PSU correlation, because HAD's\n", + "> WAS-d_lower estimand is a local-linear at d_lower (variance is\n", + "> dominated by the few states near the boundary, not by the full\n", + "> panel; see section 4).\n" + ] + }, + { + "cell_type": "markdown", + "id": "d79ad79a", + "metadata": {}, + "source": [ + "## 8. Extensions + summary checklist\n", + "\n", + "**Extensions to keep on the radar:**\n", + "\n", + "- **`lonely_psu='adjust'` + singleton strata** still raises\n", + " `NotImplementedError` for the Stute family on the survey path\n", + " (`had.py:2139`, `had_pretests.py:1927`). The pseudo-stratum\n", + " centering transform that would make this case work is a separate\n", + " derivation from PR #432; same gap as the HAD sup-t deviation\n", + " documented at REGISTRY:2382. Stay with `lonely_psu='remove'` or\n", + " `'certainty'` for now.\n", + "- **Replicate-weight designs** (BRR / Fay / JK1 / JKn / SDR) are\n", + " defensively rejected at every survey-aware HAD entry point. The\n", + " Rao-Wu / JKn bootstrap composition is a separate slot of work.\n", + " T22 stays inside the FPC / Mammen-multiplier path; full replicate-\n", + " weight HAD is a follow-up tutorial.\n", + "- **QUG under survey / weights is permanently deferred** (Phase 4.5\n", + " C0). The smooth functionals route through the Stute family per\n", + " Krieger-Pfeffermann (1997); QUG's extreme order statistic does\n", + " not. There is no intended migration path.\n", + "- **Trends-linear refit** under the survey design is also currently\n", + " rejected by `HeterogeneousAdoptionDiD.fit`; the trends_lin +\n", + " survey_design composition is open as a separate slot.\n", + "\n", + "**What you can ship to leadership:**\n", + "\n", + "- Single design-aware ATT with a survey-design CI (section 3, 5).\n", + "- Verdict text that flags the C0 deferral verbatim - do not summarize\n", + " it away (section 6).\n", + "- Per-horizon ATT and sup-t cband when you want a rollout chart\n", + " (section 5).\n", + "- A side note on the modest SE inflation magnitude (section 4) so the\n", + " audience does not over-correct against the simpler analysis.\n", + "- The workflow's `report.summary()` block as the methodologist-facing\n", + " artifact; `report.verdict` as the executive-facing artifact.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/tutorials/README.md b/docs/tutorials/README.md index a7b495bb..73f965c9 100644 --- a/docs/tutorials/README.md +++ b/docs/tutorials/README.md @@ -111,6 +111,14 @@ Composite pre-test walkthrough for `HeterogeneousAdoptionDiD`, building on Tutor - Side panel comparing `yatchew_hr_test` `null="linearity"` (default, paper Theorem 7) vs `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`) - Companion drift-test file (`tests/test_t21_had_pretest_workflow_drift.py`) +### 22. HAD Survey-Weighted Workflow (`22_had_survey_design.ipynb`) +End-to-end HAD walkthrough on a BRFSS-shape stratified survey design (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum). Demonstrates the `SurveyDesign(strata=...)` path through the Stute pretest family that PR #432 (2026-05-14) unblocked: +- Naive vs survey-aware HAD headline fit on a two-period collapse, with side-by-side ATT / SE / CI table +- Why the SE inflation is modest for HAD (local-linear at d_lower IF concentration vs full-panel regression coefficients) +- Event-study fit with sup-t cband under the survey design +- `did_had_pretest_workflow` on both overall and event-study paths under `survey_design=`, walking the Phase 4.5 C0 QUG-deferred verdict suffix and the stratified-clustered Stute multiplier bootstrap +- Companion drift-test file (`tests/test_t22_had_survey_design_drift.py`) + ## Running the Notebooks 1. Install diff-diff with dependencies: diff --git a/tests/test_t22_had_survey_design_drift.py b/tests/test_t22_had_survey_design_drift.py new file mode 100644 index 00000000..53062efa --- /dev/null +++ b/tests/test_t22_had_survey_design_drift.py @@ -0,0 +1,556 @@ +"""Drift detection for Tutorial 22 (`docs/tutorials/22_had_survey_design.ipynb`). + +The tutorial narrative quotes seed-specific numbers (panel composition, +naive vs. survey-design SE inflation, event-study cband behavior, pretest +workflow verdicts under SurveyDesign(strata=...), QUG-under-survey +deferral substring). If library numerics drift (estimator changes, RNG +path changes, BLAS path changes), the prose can go stale silently while +``pytest --nbmake`` still passes - it only checks that the cells execute +without error. + +These asserts re-derive the same numbers using the locked T22 DGP and +seeds the notebook uses, then check them against the values quoted in +the tutorial markdown. If a future change moves any number outside its +tolerance band, this test fails and a maintainer is forced to either +update the prose or investigate the methodology shift before merge. + +T22 is the third tutorial in the HAD series (after T20 headline and T21 +pretest workflow). It demonstrates the now-fully-supported survey-design +path through ``HeterogeneousAdoptionDiD`` and ``did_had_pretest_workflow``, +unblocked by PR #432 (2026-05-14, merge ``d5e5021f``) which lifted the +``NotImplementedError`` gate on ``SurveyDesign(strata=...)`` for the +Stute family. T22's DGP layers a BRFSS-shape survey design (5 strata x 6 +PSUs/stratum x 2 states/PSU = 60 states; weights ~ post-stratification +raking with CV ~ 0.30; FPC = 30 PSUs/stratum) onto the same +continuous-dose HAD panel shape T20 uses (Design 1, dose ~ Uniform[$5K, +$50K], att_slope=100). DGP and seed locked at ``_scratch/t22/dev.py``. + +**Bootstrap p-value pins use abs tolerance bands >= 0.25** per +``feedback_bootstrap_drift_tests_need_backend_tolerance`` and +``feedback_strata_bootstrap_path_divergence``. Stratified Mammen +multiplier paths (PR #432) reduce effective dofs vs non-strata; PR #432 +commit ``aef07020`` already had to relax bit-equality bands on this +code path. Deterministic statistics (Yatchew sigma2_*, t_hr, design +auto-detection, horizon labels, panel composition, weight CV under +locked numpy default_rng) get exact pins. + +**Verdict-substring discipline** — both the overall and the event-study +``HADPretestReport.verdict`` terminate in ``_QUG_DEFERRED_SUFFIX`` +(``had_pretests.py:4300``: ``" (linearity-conditional verdict; +QUG-under-survey deferred per Phase 4.5 C0)"``). The DIFFERENT message +at ``had_pretests.py:736`` (``"(QUG step skipped - permanently deferred +under survey/weights per Phase 4.5 C0)"``) is rendered in the +formatted ``report.summary()`` block, NOT in ``report.verdict``. Tests +lock each substring on the correct field. +""" + +from __future__ import annotations + +import warnings + +import numpy as np +import pandas as pd +import pytest + +from diff_diff import ( + HAD, + SurveyDesign, + did_had_pretest_workflow, + generate_continuous_did_data, +) + +# Locked T22 DGP parameters (must stay in sync with the notebook). +MAIN_SEED = 87 +N_UNITS = 60 +N_PERIODS = 8 +COHORT_PERIOD = 5 +TRUE_SLOPE = 100.0 +BASELINE_OUTCOME = 35.0 +DOSE_LOW = 5.0 +DOSE_HIGH = 50.0 + +# Survey-design layer (in-notebook helper). +N_STRATA = 5 +PSU_PER_STRATUM = 6 +STATES_PER_PSU = 2 +WEIGHT_CV_TARGET = 0.30 +FPC_PER_STRATUM = 30 +PSU_PERIOD_SHOCK_SD = 1.5 # See plan Risk 1; HAD WAS IF concentration caps inflation ~1.25 +SD_SEED = 87 + +# Pretest workflow. +WORKFLOW_SEED = 22 +N_BOOTSTRAP = 999 + +# Substrings the notebook prose depends on. +QUG_DEFERRED_SUFFIX_VERDICT = ( + "linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0" +) +QUG_SKIP_SUMMARY_NOTE = ( + "QUG step skipped - permanently deferred under survey/weights per Phase 4.5 C0" +) + + +def _attach_brfss_survey_columns( + panel: pd.DataFrame, + *, + seed: int, + n_strata: int = N_STRATA, + psu_per_stratum: int = PSU_PER_STRATUM, + states_per_psu: int = STATES_PER_PSU, + weight_cv: float = WEIGHT_CV_TARGET, + fpc_per_stratum: int = FPC_PER_STRATUM, + psu_period_shock_sd: float = PSU_PERIOD_SHOCK_SD, +) -> pd.DataFrame: + """Drift-test-local copy of T22's in-notebook survey-attach helper. + + Lives here (not as a library helper) because it is a TUTORIAL element + demonstrating how a practitioner attaches survey design to their HAD + panel. The drift test inlines it so changes to the notebook helper + do not silently change the locked numerical anchors. If this helper + diverges from the notebook helper, the panel-composition tests + (weight CV, stratum/PSU counts) catch the drift. + """ + rng = np.random.default_rng(seed) + state_ids = np.sort(panel["state_id"].unique()) + n_states = len(state_ids) + n_psu = n_strata * psu_per_stratum + if n_states != n_psu * states_per_psu: + raise ValueError( + f"state count {n_states} must equal n_strata*psu_per_stratum*" + f"states_per_psu = {n_psu * states_per_psu}" + ) + perm = rng.permutation(n_states) + psu_block = np.repeat(np.arange(n_psu), states_per_psu) + psu_of_state = psu_block[np.argsort(perm)] + stratum_of_state = psu_of_state // psu_per_stratum + base_per_stratum = np.array([0.8, 0.9, 1.0, 1.1, 1.3]) + base_w = base_per_stratum[stratum_of_state] + sigma = np.sqrt(np.log(1 + weight_cv**2)) + pert = rng.lognormal(mean=-0.5 * sigma**2, sigma=sigma, size=n_states) + w_per_state = base_w * pert + state_lookup = pd.DataFrame( + { + "state_id": state_ids, + "stratum": stratum_of_state.astype(np.int64), + "psu_id": psu_of_state.astype(np.int64), + "weight": w_per_state, + "fpc": float(fpc_per_stratum), + } + ) + panel_attached = panel.merge(state_lookup, on="state_id", how="left") + n_periods = int(panel["week"].max() - panel["week"].min() + 1) + psu_period_shocks = rng.normal(0.0, psu_period_shock_sd, size=(n_psu, n_periods)) + week_min = int(panel["week"].min()) + shock_lookup = pd.DataFrame( + [ + { + "psu_id": int(p), + "week": int(w + week_min), + "psu_period_shock": float(psu_period_shocks[p, w]), + } + for p in range(n_psu) + for w in range(n_periods) + ] + ) + panel_attached = panel_attached.merge(shock_lookup, on=["psu_id", "week"], how="left") + panel_attached["screening_uptake"] = ( + panel_attached["screening_uptake"] + panel_attached["psu_period_shock"] + ) + return panel_attached.drop(columns=["psu_period_shock"]) + + +@pytest.fixture(scope="module") +def panel() -> pd.DataFrame: + raw = generate_continuous_did_data( + n_units=N_UNITS, + n_periods=N_PERIODS, + cohort_periods=[COHORT_PERIOD], + never_treated_frac=0.0, + dose_distribution="uniform", + dose_params={"low": DOSE_LOW, "high": DOSE_HIGH}, + att_function="linear", + att_intercept=0.0, + att_slope=TRUE_SLOPE, + unit_fe_sd=8.0, + time_trend=0.5, + noise_sd=2.0, + seed=MAIN_SEED, + ) + p = raw.copy() + p.loc[p["period"] < p["first_treat"], "dose"] = 0.0 + p = p.rename( + columns={ + "unit": "state_id", + "period": "week", + "outcome": "screening_uptake", + "dose": "spend_k", + } + ) + p["screening_uptake"] = p["screening_uptake"] + BASELINE_OUTCOME + return _attach_brfss_survey_columns(p, seed=SD_SEED) + + +@pytest.fixture(scope="module") +def panel_2p(panel: pd.DataFrame) -> pd.DataFrame: + p = panel.copy() + p["period"] = (p["week"] >= COHORT_PERIOD).astype(int) + 1 + collapsed = p.groupby(["state_id", "period"], as_index=False).agg( + screening_uptake=("screening_uptake", "mean"), + spend_k=("spend_k", "mean"), + stratum=("stratum", "first"), + psu_id=("psu_id", "first"), + weight=("weight", "first"), + fpc=("fpc", "first"), + ) + return pd.DataFrame(collapsed) + + +@pytest.fixture(scope="module") +def survey_design() -> SurveyDesign: + return SurveyDesign(weights="weight", strata="stratum", psu="psu_id", fpc="fpc") + + +@pytest.fixture(scope="module") +def naive_overall_result(panel_2p: pd.DataFrame): + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=UserWarning) + return HAD(design="auto").fit( + panel_2p, + outcome_col="screening_uptake", + dose_col="spend_k", + time_col="period", + unit_col="state_id", + ) + + +@pytest.fixture(scope="module") +def survey_overall_result(panel_2p: pd.DataFrame, survey_design: SurveyDesign): + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=UserWarning) + return HAD(design="auto").fit( + panel_2p, + outcome_col="screening_uptake", + dose_col="spend_k", + time_col="period", + unit_col="state_id", + survey_design=survey_design, + ) + + +@pytest.fixture(scope="module") +def survey_event_study_result(panel: pd.DataFrame, survey_design: SurveyDesign): + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=UserWarning) + return HAD(design="auto").fit( + panel, + outcome_col="screening_uptake", + dose_col="spend_k", + time_col="week", + unit_col="state_id", + first_treat_col="first_treat", + aggregate="event_study", + survey_design=survey_design, + cband=True, + ) + + +@pytest.fixture(scope="module") +def overall_report(panel_2p: pd.DataFrame, survey_design: SurveyDesign): + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=UserWarning) + return did_had_pretest_workflow( + panel_2p, + outcome_col="screening_uptake", + dose_col="spend_k", + time_col="period", + unit_col="state_id", + survey_design=survey_design, + aggregate="overall", + n_bootstrap=N_BOOTSTRAP, + seed=WORKFLOW_SEED, + ) + + +@pytest.fixture(scope="module") +def event_study_report(panel: pd.DataFrame, survey_design: SurveyDesign): + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=UserWarning) + return did_had_pretest_workflow( + panel, + outcome_col="screening_uptake", + dose_col="spend_k", + time_col="week", + unit_col="state_id", + first_treat_col="first_treat", + survey_design=survey_design, + aggregate="event_study", + n_bootstrap=N_BOOTSTRAP, + seed=WORKFLOW_SEED, + ) + + +# ============================================================================ +# Group A — Panel & survey composition (deterministic exact pins) +# ============================================================================ + + +def test_panel_matches_t22_locked_dgp(panel: pd.DataFrame): + """Locks panel size, columns, and the dose/outcome ranges quoted in §2. + + If T22 mutates the locked DGP parameters, panel shape/columns shift + and this test fails before downstream numerical pins fail.""" + assert panel.shape == (480, 10), panel.shape + assert set(panel.columns) >= { + "state_id", + "week", + "screening_uptake", + "first_treat", + "spend_k", + "stratum", + "psu_id", + "weight", + "fpc", + }, panel.columns.tolist() + assert panel["state_id"].nunique() == N_UNITS + assert panel["week"].min() == 1 + assert panel["week"].max() == N_PERIODS + post_dose = panel.loc[panel["week"] >= COHORT_PERIOD, "spend_k"] + assert DOSE_LOW <= post_dose.min() <= post_dose.max() <= DOSE_HIGH + + +def test_survey_design_attachment_shape(panel: pd.DataFrame): + """Locks the BRFSS-shape design dimensions narrated in §2.""" + assert panel["stratum"].nunique() == N_STRATA + assert panel["psu_id"].nunique() == N_STRATA * PSU_PER_STRATUM + psu_state_count = panel.groupby("psu_id")["state_id"].nunique() + assert (psu_state_count == STATES_PER_PSU).all(), psu_state_count.value_counts().to_dict() + + +def test_survey_weight_cv_in_band(panel: pd.DataFrame): + """Locks the ~0.30 weight CV at seed=87. Bit-stable under numpy + default_rng; if numpy upgrades change RNG semantics this catches it.""" + weights_per_state = panel.groupby("state_id")["weight"].first() + cv = float(weights_per_state.std() / weights_per_state.mean()) + assert abs(cv - 0.327) < 0.03, cv + + +def test_survey_design_fpc_constant_per_stratum(panel: pd.DataFrame): + """FPC is scalar-per-stratum (==30) in the helper; the SurveyDesign + object treats fpc as a column, so per-row constancy is the contract.""" + assert panel["fpc"].nunique() == 1 + assert float(panel["fpc"].iloc[0]) == FPC_PER_STRATUM + + +def test_panel_constant_within_state_invariant(panel: pd.DataFrame): + """HAD per-unit aggregation requires weight/stratum/psu_id/fpc to be + constant within state; resolve_survey_design enforces this. If the + helper accidentally injects per-period variation in these columns, + HAD fails downstream — test catches the helper-side bug first.""" + for col in ("weight", "stratum", "psu_id", "fpc"): + per_state_unique = panel.groupby("state_id")[col].nunique() + assert per_state_unique.max() == 1, (col, per_state_unique.value_counts().to_dict()) + + +# ============================================================================ +# Group B — Naive vs survey-aware headline fit +# ============================================================================ + + +def test_naive_overall_design_auto_continuous_near_d_lower(naive_overall_result): + """Locks T22's design auto-detection: at dose ~ Uniform[5, 50], + `d.min() / median(|d|) > 0.01`, so the heuristic resolves to + ``continuous_near_d_lower`` (Design 1) targeting WAS_d_lower.""" + assert naive_overall_result.design == "continuous_near_d_lower" + assert naive_overall_result.target_parameter == "WAS_d_lower" + + +def test_survey_overall_design_auto_continuous_near_d_lower(survey_overall_result): + """Survey path picks the same design — design auto-detection is + sample-based and does not consume survey weights.""" + assert survey_overall_result.design == "continuous_near_d_lower" + assert survey_overall_result.target_parameter == "WAS_d_lower" + + +def test_survey_att_close_to_truth(survey_overall_result): + """Survey-aware HAD recovers slope=100 within analytical noise on + this DGP. Tight pin (round to int) — the local-linear estimator is + analytical, no Rust RNG path.""" + assert round(survey_overall_result.att, 0) == 100, survey_overall_result.att + + +def test_survey_se_strictly_inflated_vs_naive(naive_overall_result, survey_overall_result): + """Sign-only structural anchor: survey SE > naive SE on this DGP. + The magnitude of inflation is modest (~10%) because HAD's WAS_d_lower + has IF concentrated near d_lower (few units), capping how much the + PSU x period shock injection can amplify cluster correlation. The + sign holds robustly across reasonable shock SDs (per dev script + sweep).""" + assert survey_overall_result.se > naive_overall_result.se, ( + naive_overall_result.se, + survey_overall_result.se, + ) + + +def test_survey_ci_covers_truth(survey_overall_result): + """Survey-aware CI covers the true slope=100.""" + lo, hi = survey_overall_result.conf_int + assert lo <= TRUE_SLOPE <= hi, (lo, hi) + + +# ============================================================================ +# Group C — Event-study under survey +# ============================================================================ + + +def test_event_study_horizons_complete(survey_event_study_result): + """Locks the same horizon set T20 produces on this DGP shape.""" + horizons = list(survey_event_study_result.event_times) + assert horizons == [-4, -3, -2, 0, 1, 2, 3], horizons + + +def test_event_study_post_horizons_cover_truth_under_survey(survey_event_study_result): + """All four post-launch horizons cover the true slope=100 under + survey-aware pointwise CIs.""" + es = survey_event_study_result + post_mask = np.asarray(es.event_times) >= 0 + lows = np.asarray(es.conf_int_low)[post_mask] + highs = np.asarray(es.conf_int_high)[post_mask] + for lo, hi in zip(lows, highs): + assert lo <= TRUE_SLOPE <= hi, (lo, hi) + + +def test_event_study_pre_horizons_cover_zero_under_survey(survey_event_study_result): + """Pre-launch placebo horizons cover zero under survey-aware + pointwise CIs (no pre-trends in this DGP).""" + es = survey_event_study_result + pre_mask = np.asarray(es.event_times) < 0 + lows = np.asarray(es.conf_int_low)[pre_mask] + highs = np.asarray(es.conf_int_high)[pre_mask] + for lo, hi in zip(lows, highs): + assert lo <= 0.0 <= hi, (lo, hi) + + +def test_event_study_cband_is_wider_or_equal_pointwise(survey_event_study_result): + """sup-t band is at least as wide as pointwise per horizon (cross- + horizon multiplicity correction; never tighter than pointwise). + Locks that ``cband_low`` and ``cband_high`` are populated and + obey the cross-horizon multiplicity ordering.""" + es = survey_event_study_result + assert es.cband_low is not None + assert es.cband_high is not None + cband_widths = np.asarray(es.cband_high) - np.asarray(es.cband_low) + pointwise_widths = np.asarray(es.conf_int_high) - np.asarray(es.conf_int_low) + # Allow tiny numerical noise (atol=1e-9) but otherwise require >=. + assert (cband_widths + 1e-9 >= pointwise_widths).all(), { + "cband": cband_widths.tolist(), + "pointwise": pointwise_widths.tolist(), + } + + +# ============================================================================ +# Group D — Pretest workflow under survey: overall path +# ============================================================================ + + +def test_overall_report_qug_is_none_under_survey(overall_report): + """Phase 4.5 C0 contract: QUG step is permanently deferred under + survey/weights; ``report.qug`` is ``None`` on the overall path.""" + assert overall_report.qug is None + + +def test_overall_report_verdict_carries_qug_deferred_suffix(overall_report): + """Locks ``_QUG_DEFERRED_SUFFIX`` substring (``had_pretests.py:4300``) + on ``report.verdict``. This is the load-bearing pivot the §6 leadership + paragraph depends on.""" + assert QUG_DEFERRED_SUFFIX_VERDICT in overall_report.verdict, overall_report.verdict + + +def test_overall_report_all_pass_under_null(overall_report): + """``all_pass=True`` under the linear-DGP null (no pre-trends, no + heterogeneity).""" + assert overall_report.all_pass is True + + +def test_overall_report_stute_fails_to_reject(overall_report): + """Stute CvM fails-to-reject linearity. Bootstrap-derived p-value + pinned with abs band >= 0.25 per + ``feedback_strata_bootstrap_path_divergence`` (stratified Mammen + multiplier paths reduce effective dofs vs non-strata).""" + assert overall_report.stute is not None + assert overall_report.stute.reject is False + p = float(overall_report.stute.p_value) + assert 0.10 <= p <= 0.95, p + + +def test_overall_report_yatchew_fails_to_reject(overall_report): + """Yatchew-HR fails-to-reject linearity. Yatchew is closed-form + weighted-OLS (no bootstrap), so sigma2_lin and sigma2_diff are + deterministic — exact pin to 4 decimals.""" + y = overall_report.yatchew + assert y is not None + assert y.reject is False + assert round(float(y.sigma2_lin), 4) == 2.7270, y.sigma2_lin + assert round(float(y.sigma2_diff), 4) == 5148.3208, y.sigma2_diff + + +# ============================================================================ +# Group E — Pretest workflow under survey: event-study path +# ============================================================================ + + +def test_event_study_report_qug_is_none_under_survey(event_study_report): + """Phase 4.5 C0 contract holds on the event-study path too.""" + assert event_study_report.qug is None + + +def test_event_study_report_verdict_carries_qug_deferred_suffix(event_study_report): + """Both the overall AND event-study verdicts share + ``_QUG_DEFERRED_SUFFIX`` (per + ``_compose_verdict_event_study_survey`` at + ``had_pretests.py:4368-4406`` — all three return branches end in + the suffix). Distinct prefix; identical suffix.""" + assert QUG_DEFERRED_SUFFIX_VERDICT in event_study_report.verdict, event_study_report.verdict + + +def test_event_study_report_summary_contains_qug_skip_note(event_study_report): + """Separate from the verdict suffix above: ``report.summary()`` (the + formatted multi-line block at ``had_pretests.py:736``) renders a + distinct QUG-skip note. Locked here because the §6 walkthrough + quotes ``report.summary()`` output as well as the verdict string.""" + summary = event_study_report.summary() + assert QUG_SKIP_SUMMARY_NOTE in summary, summary[:400] + + +def test_event_study_report_pretrends_horizons_correct(event_study_report): + """Locks the joint pretrends horizon set: weeks 1, 2, 3 (the + pre-treatment placebo periods upgraded to a joint cusum).""" + pj = event_study_report.pretrends_joint + assert pj is not None + assert pj.n_horizons == 3 + assert list(pj.horizon_labels) == ["1", "2", "3"], pj.horizon_labels + + +def test_event_study_report_homogeneity_horizons_correct(event_study_report): + """Locks the joint homogeneity horizon set: weeks 5, 6, 7, 8 (the + post-treatment periods on which dose-response heterogeneity is + tested).""" + hj = event_study_report.homogeneity_joint + assert hj is not None + assert hj.n_horizons == 4 + assert list(hj.horizon_labels) == ["5", "6", "7", "8"], hj.horizon_labels + + +def test_event_study_report_pretrends_and_homogeneity_fail_to_reject(event_study_report): + """Both joint pretrends and joint homogeneity fail-to-reject under + the linear-DGP null. Bootstrap-derived p-values pinned with abs + band >= 0.25 (same rationale as Stute overall).""" + pj = event_study_report.pretrends_joint + hj = event_study_report.homogeneity_joint + assert pj is not None and hj is not None + assert pj.reject is False + assert hj.reject is False + p_pre = float(pj.p_value) + p_hom = float(hj.p_value) + assert 0.10 <= p_pre <= 0.95, p_pre + assert 0.10 <= p_hom <= 0.95, p_hom From ed7d5e2ce3a26d740bdfcda64fc9bce9898e94c6 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 06:19:17 -0400 Subject: [PATCH 02/15] Address PR #440 R1 CI review (2 P1 + 1 P3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P1 #1 — T22 §3 prose contradicted the implementation. Said the analytical local-linear at d_lower "does not consume the survey weights in the slope". The weighted continuous path (diff_diff/had.py:3744-3810) consumes weights in (a) the local- linear `tau_bc` boundary fit, (b) the numerator `np.average(dy_arr, weights=weights_arr)`, AND (c) the denominator `np.average(d_reg, weights=weights_arr)`. Rewrote §3 to say the two ATTs are close on this DGP because the weight CV (~0.30) and the dose-distribution shape do not co-vary strongly enough to shift the boundary slope materially — NOT because weights are ignored. Added two drift tests for the weighted point-estimation contract: `test_survey_att_differs_from_naive_att` (sign-only — if weights were ignored the values would be bit-identical) and `test_survey_att_matches_weighted_denominator_contract` (verifies the algebraic identity `att = (dy_mean_w - tau_bc) / den_w` from `_fit_continuous`). P1 #2 — T22 §7 leadership block conflated overall and event-study pretest paths. Said "all three linearity diagnostics", "Yatchew-HR fails-to-reject under both null modes" (T22 doesn't run the side panel — that's T21), and quoted the overall-path verdict string while describing event-study joint diagnostics. Split the methodologist write-up by path: overall = `Stute + Yatchew`; event-study = `joint pre-trends + joint linearity` with explicit `report.yatchew is None` and `report.stute is None` callouts. Added three drift tests to lock the per-path workflow surfaces: `test_overall_report_pretrends_joint_is_none` (overall has no joint diagnostics), `test_event_study_report_stute_and_yatchew_are_none` (event-study has no single-horizon Stute or Yatchew), and `test_overall_and_event_study_verdict_prefixes_distinct` (the two paths share `_QUG_DEFERRED_SUFFIX` but have distinct verdict prefixes; locks the §7 prose against re-conflating). P3 — CHANGELOG L11 claimed `diff_diff/guides/llms-full.txt` got a T22 inventory entry; the file was intentionally scoped out per the plan-review feedback (would expand scope beyond T22 to T17-T21 backfill). Updated the closer to reflect the actual scope and flag llms-full.txt as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 2 +- docs/tutorials/22_had_survey_design.ipynb | 54 +++++++---- tests/test_t22_had_survey_design_drift.py | 113 ++++++++++++++++++++++ 3 files changed, 151 insertions(+), 18 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 5080b296..9c7f4db3 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added -- **Tutorial 22: Survey-Weighted HAD** (`docs/tutorials/22_had_survey_design.ipynb`) — end-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` on a BRFSS-shape stratified household-survey panel (5 strata × 6 PSUs/stratum × 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU × period interaction shocks injected so cluster correlation survives DiD first-differencing). Demonstrates the `SurveyDesign(strata=...)` path through the Stute pretest family that the previous `[Unreleased]` entry unblocked. Eight numbered sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with a side-by-side ATT / SE / CI table (~10% SE inflation, sign-only direction asserted); a dedicated section explaining why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study fit with sup-t cband under the survey design (per-horizon table + matplotlib gated plot); pretest workflow on both `aggregate="overall"` and `aggregate="event_study"` paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; "Communicating to Leadership" two-paragraph template (executive + methodologist); Extensions + Summary Checklist surfacing the still-deferred `lonely_psu='adjust'` + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (25 tests across 5 groups: panel + survey composition with deterministic exact pins; naive-vs-survey headline with sign-only SE-inflation anchor; event-study cband-vs-pointwise ordering and post/pre coverage; pretest overall path with `_QUG_DEFERRED_SUFFIX` lock and Yatchew `sigma2_*` deterministic pins; pretest event-study path with the SAME `_QUG_DEFERRED_SUFFIX` lock plus a SEPARATE substring lock on `report.summary()` for the L736 QUG-skip note). Bootstrap p-value pins use abs tolerance bands `>= 0.25` per `feedback_strata_bootstrap_path_divergence` (stratified Mammen multiplier paths reduce effective dofs vs non-strata; PR #432 commit `aef07020` already had to relax bit-equality bands on this code path). T20 and T21 "Extensions" bullets updated with forward-pointers to T22; `docs/practitioner_decision_tree.rst` HAD universal-rollout and survey sections each gain a `.. tip::` cross-link to T22 (adjacent to T20 / T17, NOT displacing); `docs/api/had.rst` gains a "Survey-aware fit" cross-reference; `docs/survey-roadmap.md` gains a "Phase 4.5 C ✅ Shipped" entry; bundled `diff_diff/guides/llms.txt`, `llms-full.txt`, and `llms-practitioner.txt` carry T22 inventory entries; `docs/doc-deps.yaml` wires T22 as a dependent of both `had.py` and `had_pretests.py`. Closes the Phase 5 (wave 2 second slice) tutorial gap; the realistic survey-weighted HAD workflow on BRFSS / CPS / NHANES / ACS-shaped designs is now end-to-end documented for practitioners. +- **Tutorial 22: Survey-Weighted HAD** (`docs/tutorials/22_had_survey_design.ipynb`) — end-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` on a BRFSS-shape stratified household-survey panel (5 strata × 6 PSUs/stratum × 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU × period interaction shocks injected so cluster correlation survives DiD first-differencing). Demonstrates the `SurveyDesign(strata=...)` path through the Stute pretest family that the previous `[Unreleased]` entry unblocked. Eight numbered sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with a side-by-side ATT / SE / CI table (~10% SE inflation, sign-only direction asserted); a dedicated section explaining why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study fit with sup-t cband under the survey design (per-horizon table + matplotlib gated plot); pretest workflow on both `aggregate="overall"` and `aggregate="event_study"` paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; "Communicating to Leadership" two-paragraph template (executive + methodologist); Extensions + Summary Checklist surfacing the still-deferred `lonely_psu='adjust'` + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (25 tests across 5 groups: panel + survey composition with deterministic exact pins; naive-vs-survey headline with sign-only SE-inflation anchor; event-study cband-vs-pointwise ordering and post/pre coverage; pretest overall path with `_QUG_DEFERRED_SUFFIX` lock and Yatchew `sigma2_*` deterministic pins; pretest event-study path with the SAME `_QUG_DEFERRED_SUFFIX` lock plus a SEPARATE substring lock on `report.summary()` for the L736 QUG-skip note). Bootstrap p-value pins use abs tolerance bands `>= 0.25` per `feedback_strata_bootstrap_path_divergence` (stratified Mammen multiplier paths reduce effective dofs vs non-strata; PR #432 commit `aef07020` already had to relax bit-equality bands on this code path). T20 and T21 "Extensions" bullets updated with forward-pointers to T22; `docs/practitioner_decision_tree.rst` HAD universal-rollout and survey sections each gain a `.. tip::` cross-link to T22 (adjacent to T20 / T17, NOT displacing); `docs/api/had.rst` gains a "Survey-aware fit" cross-reference; `docs/survey-roadmap.md` gains a "Phase 4.5 C ✅ Shipped" entry; bundled `diff_diff/guides/llms.txt` and `llms-practitioner.txt` carry T22 inventory entries (the `llms-full.txt` reference guide is left as a follow-up to keep T22 PR scope tight); `docs/doc-deps.yaml` wires T22 as a dependent of both `had.py` and `had_pretests.py`. Closes the Phase 5 (wave 2 second slice) tutorial gap; the realistic survey-weighted HAD workflow on BRFSS / CPS / NHANES / ACS-shaped designs is now end-to-end documented for practitioners. - **HAD pretest workflow: stratified survey-design support (Phase 4.5 C continuation).** Lifts the `NotImplementedError` gate on `SurveyDesign(strata=...)` in `stute_test` (`had_pretests.py:1927-1940` pre-PR) and `stute_joint_pretest` (`:3259-3271` pre-PR), and by inheritance in `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` (the wrappers delegate to the joint Stute helper). Implements a documented synthesis of clustered-wild-bootstrap ingredients (Cameron-Gelbach-Miller 2008 cluster-level multipliers; Davidson-Flachaire 2008 wild-bootstrap centering; Djogbenou-MacKinnon-Nielsen 2019 cluster-wild consistency for nonlinear functionals; Kreiss-Lahiri 2012 within-block centering analogy; Wu 1986 / Liu 1988 Bessel small-sample correction) — no single paper covers the exact composition for the stratified Stute CvM functional. The recipe: within-stratum demean + `sqrt(n_h/(n_h-1))` Bessel rescale applied to the PSU multipliers `psu_mults` BEFORE the per-obs broadcast `eta_obs = psu_mults[b, psu_col_idx]` in the wild-residual loop. Bootstrap CvM variance matches the analytical Binder-TSL stratified target `V_S = sum_h (1 - f_h) (n_h / (n_h - 1)) sum_j (psi_hj - psi_h_bar)²` exactly (the `(1 - f_h)` FPC factor was already baked in by `generate_survey_multiplier_weights_batch`; this PR bakes the remaining `(n_h / (n_h - 1))` factor and enforces within-stratum-mean-zero centering). New shared helper `bootstrap_utils.apply_stratum_centering(psu_mults, resolved_survey, psu_ids, psu_axis=...)` is called from both the new Stute path (psu_axis=1 on the multiplier matrix) AND the existing HAD sup-t event-study cband bootstrap (psu_axis=0 on the PSU-aggregated influence tensor; refactored bit-exactly from the inline block previously at `had.py:2172-2204`). Locks the algebraic identity architecturally instead of leaving parallel code blocks to drift. MC oracle consistency validated under a 4-stratum × 6-PSU/stratum stratified null DGP with weights+strata+PSU (200 seeded draws, empirical Type I at α=0.05 in `[0, 0.10]` — 3σ band; the FPC bake-in is covered separately by the helper-unit test `test_fpc_baked_in_helper_is_fpc_agnostic`); MC power validated under a known-alternative stratified DGP (rejection > 0.50). HAD sup-t event-study cband bit-parity preserved (`atol=1e-14, rtol=1e-14` on the refactored helper output + 29 existing cband tests passing post-refactor; that helper-level bit-parity test locks the axis-0 algebra). A separate wired-in regression at `tests/test_had_pretests.py::TestStuteStratifiedSurveyBootstrap::test_stute_call_sites_invoke_apply_stratum_centering` monkey-patches the helper and asserts both Stute call sites (`stute_test` at `had_pretests.py:1985` and `stute_joint_pretest` at `:3312`) invoke it with `psu_axis=1` — that test fails if either call site is disconnected (the axis-0 helper-parity test alone does not catch that case). See `docs/methodology/REGISTRY.md` § HeterogeneousAdoptionDiD — "Note (Stute stratified survey-bootstrap calibration)" for the full derivation. Remaining deferrals: `lonely_psu='adjust'` + singleton-strata (same pseudo-stratum centering gap as the HAD sup-t deviation at REGISTRY:2382) and replicate-weight designs (BRR/Fay/JK1/JKn/SDR — separate Rao-Wu / JKn bootstrap composition). Unblocks the realistic survey-weighted HAD workflow on BRFSS/CPS/NHANES/ACS-shaped designs. - **Conley (1999) Wave A mechanical extensions** on top of the Phase 1+2 sandwich (`diff_diff/conley.py`, `diff_diff/linalg.py`, `diff_diff/estimators.py`, `diff_diff/twfe.py`). **(1) DiD support (#118):** `DifferenceInDifferences(vcov_type="conley").fit(..., unit="")` is now supported. `unit` is a fit-time kwarg (NOT on `__init__`; unused unless Conley is set; not part of `get_params()` / `set_params()`) mirroring `MultiPeriodDiD.fit(unit=...)` / `TwoWayFixedEffects.fit(unit=...)`. DiD inherits the same panel block-decomposed sandwich as MPD/TWFE; on a 2-period panel it matches `MultiPeriodDiD(...).fit(..., post_periods=[1], reference_period=0)` bit-exactly. Missing `unit=`/`conley_lag_cutoff`/`conley_coords`/`conley_cutoff_km` raise `ValueError`; `survey_design=` + Conley raises `NotImplementedError` (Bertanha-Imbens 2014 follow-up); `inference="wild_bootstrap"` + Conley raises `NotImplementedError`. **(2) Combined spatial + cluster product kernel (#119):** `compute_robust_vcov(vcov_type="conley", cluster_ids=...)` / `LinearRegression(vcov_type="conley", cluster_ids=...)` / `TwoWayFixedEffects(vcov_type="conley", cluster="")` / `MultiPeriodDiD(vcov_type="conley", cluster="")` / `DifferenceInDifferences(vcov_type="conley", cluster="")` apply `K_total[i, j] = K_space(d_ij/h) · 1{c_i = c_j}`. On the panel block-decomposed path the cluster indicator multiplies BOTH the spatial sandwich AND the serial sandwich; the validator enforces that `cluster_ids` is constant within each unit across periods (the within-unit serial mask is then trivially all-ones; cross-sectional path has no such constraint). TWFE's default auto-cluster on the Conley path remains silently dropped (combining with unit-level clusters would zero out all between-unit pairs and defeat the spatial pooling); users must pass an explicit above-unit cluster (e.g. region) to opt in. DiD has no auto-cluster — the choice is fully explicit. Two limit fixtures anchor correctness (no R parity — R `conleyreg` does not support combined kernels): all-unique-clusters reduces to HC0; huge-cutoff reduces to pure within-cluster CR1. The huge-cutoff reduction is EXACT only for `conley_kernel="uniform"` (`K(u) = 1` for `|u| ≤ 1`); for `conley_kernel="bartlett"` the identity is asymptotic since `K_bartlett(u) = 1 - |u| < 1` for `u > 0`. The fixture anchor uses uniform for an exact identity check. Per-slice mask construction (NOT full n×n) preserves memory on panel paths. **(3) Sparse k-d-tree fast path (#120):** auto-activates for the spatial Bartlett meat when `n > 5_000` AND metric is `"haversine"` or `"euclidean"` AND kernel is `"bartlett"`. Builds a CSR sparse kernel matrix via `scipy.spatial.cKDTree.query_ball_tree` instead of materializing the full n×n distance matrix; haversine projects to a 3-D unit-sphere chord representation with the exact great-circle recomputed for in-range neighbors only. Bit-identity parity vs the dense path at `atol=1e-10`; R parity at `atol=1e-6` is preserved on the existing 3 panel R fixtures with the sparse path force-enabled. The bartlett-only gate is for boundary correctness — bartlett at `u=1` is exactly 0, so the sparse path safely drops at-cutoff pairs; uniform at `u=1` is 1 and would require a closed-interval query semantic that haversine chord projection cannot reliably preserve. Constants: `_CONLEY_SPARSE_N_THRESHOLD = 5_000` (auto-toggle); `_CONLEY_DENSE_WARN_N` renamed `_CONLEY_DENSE_OOM_WARN_N = 20_000` (memory exhaustion threshold for the dense fallback — independent of the sparse threshold). Private `_conley_sparse: Optional[bool]` kwarg on `_compute_conley_vcov` controls the toggle (`None` = auto, `True` = force, `False` = force dense; `True` with an unsupported kernel/metric raises). The serial component (within-unit Bartlett over time) remains dense regardless — per-unit slices are small. **(4) Callable `conley_metric` validation (#123):** result must satisfy shape `(n, n)`, finite, non-negative, symmetric to `atol=1e-10`, AND zero on the diagonal (`|d(i, i)| ≤ 1e-10`); each failure raises a targeted `ValueError` naming the violated invariant. The zero-diagonal contract is load-bearing for the Conley sandwich: the `i = j` term must reduce to the HC0 diagonal `X_i ε_i² X_i'` via `K(0) = 1`; positive self-distance would silently attenuate the HC0 contribution by `K(d_ii / h) < 1`. Built-in metrics (`"haversine"`, `"euclidean"`) satisfy this by construction. Previously, malformed callables produced opaque BLAS errors deep in the pipeline. **Tests:** `tests/test_conley_vcov.py::TestConleySparse` (12), `::TestConleySparseRParityForced` (3), `::TestConleyCluster` (10), `::TestConleyDistanceMetrics` extended (7 new); existing rejection tests flipped to behavioral; `test_did_conley_matches_mpd_post_periods_1` locks the DiD-vs-MPD bit-exact agreement. **Docs:** REGISTRY `## ConleySpatialHAC` updates: new "Combined spatial + cluster product kernel" + "Performance / scale" subsections, DiD-vs-TWFE cluster asymmetry paragraph, updated panel-API restrictions table. TODO rows 118 / 119 / 120 / 123 removed; rows 121 (Conley + survey_design / weights, Bertanha-Imbens 2014) and 122 (`SyntheticDiD(vcov_type="conley")`, spatial-block bootstrap per Politis-Romano 1994) retained for future waves. - **Conley (1999) spatial-HAC standard errors via `vcov_type="conley"`** on cross-sectional `LinearRegression` / `compute_robust_vcov` plus panel `MultiPeriodDiD` / `TwoWayFixedEffects` (Phases 1 and 2 of the spillover-conley initiative). **Cross-sectional contract:** `conley_coords` (n × 2 array of lat/lon or projected coords), `conley_cutoff_km=` (positive finite bandwidth in km for haversine, or coord units for euclidean — REQUIRED, no default per the no-silent-failures contract), `conley_metric="haversine"|"euclidean"|callable` (default `"haversine"`; great-circle uses Earth's mean radius 6371.01 km matching R `conleyreg`), `conley_kernel="bartlett"|"uniform"` (default `"bartlett"`; both kernels emit a `UserWarning` if the resulting meat has a materially negative eigenvalue — neither the radial 1-D Bartlett nor the uniform kernel is formally PSD-guaranteed; Conley 1999's explicit PSD formula is the 2-D separable lattice product window at Eq 3.14). Cross-sectional variance estimator `Var̂(β) = (X'X)^{-1} · ( Σ_{i,j} K(d_ij/h) · X_i ε_i ε_j X_j' ) · (X'X)^{-1}` (Conley 1999 Eq 4.2). **Panel contract (Phase 2, new):** Three new co-required kwargs `conley_time` (n-length array), `conley_unit` (n-length array), and `conley_lag_cutoff=` (non-negative; 0 means within-period spatial only, no serial component) switch into the **block-decomposed panel sandwich** that matches R `conleyreg` with `lag_cutoff > 0`: `XeeX_total = Σ_t (within-period spatial sandwich) + Σ_u (within-unit Bartlett temporal sandwich, lag ∈ {1..L}, same-time excluded)`. This is NOT a multiplicative product kernel — verified empirically against `conleyreg::time_dist` and `XeeXhC` at ~1e-14 on the panel parity fixtures. The temporal kernel is hardcoded Bartlett `(1 - |lag|/(L+1))` regardless of `conley_kernel`, mirroring `conleyreg::time_dist.cpp`; documented as a `Note (deviation from R-symmetric API)` in REGISTRY. **Panel estimator wire-up (Phase 2):** `MultiPeriodDiD(vcov_type="conley", conley_lag_cutoff=...).fit(..., unit=...)` and `TwoWayFixedEffects(vcov_type="conley", conley_lag_cutoff=...).fit(..., unit=...)` lift the Phase 1 fit-time rejection; the `conley_time` and `conley_unit` arrays are auto-derived from the existing `time` and `unit` column-name arguments at fit-time. `DifferenceInDifferences(vcov_type="conley")` is also supported (Wave A #118 in this release; see the Wave A entry above) — pass `unit=` as a fit-time kwarg to `DiD.fit(...)`. **Other constraints (Phase 1, unchanged):** `SyntheticDiD(vcov_type="conley")` raises `TypeError` (uses bootstrap variance, not analytical sandwich); `set_params` mirrors the constructor rejection. `vcov_type="conley"` + `weights=` / `survey_design=` raises `NotImplementedError` (Bertanha-Imbens 2014 weighted-Conley deferred to a follow-up PR). `vcov_type="conley"` + explicit `cluster_ids=` is supported via the combined spatial + cluster product kernel (Wave A #119; see the Wave A entry above). TWFE's default auto-cluster on the Conley path is silently dropped (combining with unit-level clusters would defeat the spatial pooling); users opt into the combined kernel by passing an explicit above-unit cluster. `inference="wild_bootstrap"` + Conley raises (incompatible inference modes). A sparse k-d-tree fast path auto-activates for the spatial Bartlett meat when `n > 5_000` with bartlett kernel and haversine/euclidean metric (Wave A #120); the dense fallback still emits an OOM `UserWarning` at `n > 20_000`. **Implementation:** Helpers live in `diff_diff/conley.py` (`_haversine_km`, `_pairwise_distance_matrix`, `_bartlett_kernel`, `_uniform_kernel`, `_validate_conley_kwargs`, `_compute_conley_vcov` — the validator and sandwich helper now accept keyword-only `time` / `unit` / `lag_cutoff` for the panel path); `compute_robust_vcov` in `diff_diff/linalg.py` threads the new kwargs through. **R `conleyreg` parity (Düsterhöft 2021, CRAN v0.1.9)** on **six** benchmark fixtures (`benchmarks/data/r_conleyreg_conley_golden.json`, regenerable via `benchmarks/R/generate_conley_golden.R`): 3 cross-sectional (Phase 1) + 3 new panel fixtures (`panel_haversine_lag1`, `panel_haversine_lag2`, `panel_lat_lon_realistic_lag1`; n_units × T = 60×3, 80×5, 100×4 at lag={1,2,1}); observed max abs diff ~5.7e-16. Earth radius 6371.01 km matches `conleyreg::haversine_dist`. Test file `tests/test_conley_vcov.py` skips parity cleanly when the JSON is absent. New REGISTRY section `## ConleySpatialHAC`. Subsequent phases of the spillover-conley initiative (ring-indicator spillover-aware DiD per Butts 2021; survey-design / replicate-weight support; `SyntheticDiD` Conley path) are tracked in `TODO.md` under "Tech Debt from Code Reviews" → spillover-conley rows. diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb index f73c7d38..01e9cbe2 100644 --- a/docs/tutorials/22_had_survey_design.ipynb +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -301,9 +301,15 @@ "target estimand `WAS_d_lower` (Weighted Average Slope at the lower\n", "boundary). T22 fits the same configuration twice: once naive (no\n", "`survey_design` argument), once survey-aware\n", - "(`survey_design=sd`). The point estimate should be close in both - the\n", - "analytical local-linear at d_lower does not consume the survey weights\n", - "in the slope - but the SE and CI should differ.\n" + "(`survey_design=sd`). Both fits use the same weighted local-linear\n", + "estimator at d_lower - the survey path consumes the weights in\n", + "the local-linear `tau_bc` boundary fit AND in the numerator's\n", + "weighted `dy_mean` AND in the denominator `E_w[D - d_lower]`. On\n", + "this DGP the weight CV (~0.30) and the dose-distribution shape do\n", + "not co-vary strongly enough to shift the boundary slope materially,\n", + "so the two ATTs land within bootstrap noise of each other. The SE\n", + "and CI differ because the survey path additionally folds the PSU\n", + "clustering and FPC into the variance via the Binder TSL composition.\n" ] }, { @@ -702,20 +708,34 @@ "> drift; the per-week post-launch effect is consistent across the\n", "> rollout.\n", "\n", - "> **For the methodologist.** The HAD pretest workflow runs all three\n", - "> linearity diagnostics (QUG step deferred under survey/weights per\n", - "> the Phase 4.5 C0 deferral; this is the load-bearing caveat we owe\n", - "> the audience). Joint pre-trends and joint homogeneity Stute\n", - "> diagnostics fail-to-reject across the three pre- and four post-\n", - "> horizons. Yatchew-HR fails-to-reject under both null modes. The\n", - "> verdict reads \"Stute and Yatchew linearity diagnostics fail-to-\n", - "> reject (linearity-conditional verdict; QUG-under-survey deferred\n", - "> per Phase 4.5 C0)\". The design-based SE is ~10% larger than the\n", - "> naive SE - smaller than the inflation a CS or LinearRegression\n", - "> coefficient would see at this PSU correlation, because HAD's\n", - "> WAS-d_lower estimand is a local-linear at d_lower (variance is\n", - "> dominated by the few states near the boundary, not by the full\n", - "> panel; see section 4).\n" + "> **For the methodologist.** The HAD pretest workflow runs two\n", + "> diagnostic passes; the QUG step is deferred under survey/weights\n", + "> per Phase 4.5 C0 (the load-bearing caveat we owe the audience).\n", + ">\n", + "> 1. **Overall (two-period) path:** `Stute` (CvM linearity test on\n", + "> residuals) + `Yatchew-HR` (closed-form weighted-OLS sandwich,\n", + "> `null=\"linearity\"` only - T22 does not exercise the\n", + "> `mean_independence` mode). Both fail-to-reject; verdict reads\n", + "> `\"Stute and Yatchew linearity diagnostics fail-to-reject\n", + "> (linearity-conditional verdict; QUG-under-survey deferred per\n", + "> Phase 4.5 C0)\"`.\n", + ">\n", + "> 2. **Event-study path:** `joint pre-trends` (joint-Stute over the\n", + "> three pre-launch placebo horizons) + `joint homogeneity`\n", + "> (joint-Stute over the four post-launch horizons). Both\n", + "> fail-to-reject; verdict reads `\"joint pre-trends and joint\n", + "> linearity diagnostics fail-to-reject (linearity-conditional\n", + "> verdict; QUG-under-survey deferred per Phase 4.5 C0)\"`.\n", + "> `report.yatchew is None` and `report.stute is None` on this\n", + "> path - those single-horizon tests are overall-only.\n", + ">\n", + "> Both paths share the QUG-under-survey deferral suffix. The\n", + "> design-based SE on the headline fit is ~10% larger than the naive\n", + "> SE - smaller than the inflation a CallawaySantAnna or\n", + "> LinearRegression coefficient would see at this PSU correlation,\n", + "> because HAD's WAS-d_lower estimand is a local-linear at d_lower\n", + "> (variance is dominated by the few states near the boundary, not\n", + "> by the full panel; see section 4).\n" ] }, { diff --git a/tests/test_t22_had_survey_design_drift.py b/tests/test_t22_had_survey_design_drift.py index 53062efa..02e5b10d 100644 --- a/tests/test_t22_had_survey_design_drift.py +++ b/tests/test_t22_had_survey_design_drift.py @@ -554,3 +554,116 @@ def test_event_study_report_pretrends_and_homogeneity_fail_to_reject(event_study p_hom = float(hj.p_value) assert 0.10 <= p_pre <= 0.95, p_pre assert 0.10 <= p_hom <= 0.95, p_hom + + +# ============================================================================ +# Group F — Workflow-surface separation (overall vs event-study) +# ============================================================================ +# These tests lock the per-path diagnostic surfaces so that §6 / §7 prose +# cannot drift back into the conflated form that quotes Yatchew + Stute on +# the event-study path. Per CI AI review R1 P1 #2. + + +def test_overall_report_pretrends_joint_is_none(overall_report): + """Overall (two-period) path has no joint diagnostics — only Stute + + Yatchew. The joint pretrends / homogeneity fields are populated + only on the event-study path.""" + assert overall_report.pretrends_joint is None + assert overall_report.homogeneity_joint is None + + +def test_event_study_report_stute_and_yatchew_are_none(event_study_report): + """Event-study path has no single-horizon Stute / Yatchew — those + are overall-only. The event-study workflow runs joint pretrends + + joint homogeneity instead. §7 leadership prose must not claim + Yatchew ran on this path.""" + assert event_study_report.stute is None + assert event_study_report.yatchew is None + + +def test_overall_and_event_study_verdict_prefixes_distinct(overall_report, event_study_report): + """Overall and event-study verdicts terminate in the same + `_QUG_DEFERRED_SUFFIX` but have DISTINCT prefixes (overall uses + `_compose_verdict_survey` -> `Stute and Yatchew ... fail-to-reject`; + event-study uses `_compose_verdict_event_study_survey` -> + `joint pre-trends and joint linearity ... fail-to-reject`). Locks + the §7 prose against re-conflating the two surfaces.""" + assert "Stute and Yatchew" in overall_report.verdict, overall_report.verdict + assert ( + "joint pre-trends and joint linearity" in event_study_report.verdict + ), event_study_report.verdict + assert "Stute and Yatchew" not in event_study_report.verdict, event_study_report.verdict + assert ( + "joint pre-trends and joint linearity" not in overall_report.verdict + ), overall_report.verdict + + +# ============================================================================ +# Group G — Weighted point-estimation contract +# ============================================================================ +# Per CI AI review R1 P1 #1: §3 prose previously claimed "the analytical +# local-linear at d_lower does not consume the survey weights". That is +# false — `_fit_continuous` (`diff_diff/had.py:3744-3810`) consumes +# `weights_arr` in (a) the local-linear `tau_bc` boundary fit, (b) the +# numerator `np.average(dy_arr, weights=weights_arr)`, AND (c) the +# denominator `np.average(d_reg, weights=weights_arr)`. The two ATTs are +# close on this DGP because the weight CV (~0.30) and dose distribution +# do not co-vary strongly, not because weights are ignored. + + +def test_survey_att_differs_from_naive_att(naive_overall_result, survey_overall_result): + """Sign-only lock that the survey-aware ATT differs from the naive + ATT. If weights were ignored in the slope (which §3 previously + incorrectly claimed), the ATTs would be bit-identical. They are not + — confirming the weighted contract is honored end-to-end.""" + assert naive_overall_result.att != survey_overall_result.att, ( + naive_overall_result.att, + survey_overall_result.att, + ) + + +def test_survey_att_matches_weighted_denominator_contract( + panel_2p: pd.DataFrame, survey_overall_result +): + """The survey-aware ATT divided by the (weighted) sample weighted- + denominator should recover the bias-corrected `tau_bc` boundary + limit (`mean_w(dy) - tau_bc = att * den_w`). This is the + algebraic identity from `_fit_continuous` (`had.py:3804-3810`): + `att = (dy_mean - tau_bc) / den`. Locks the weighted-numerator / + weighted-denominator contract that §3 narrative now correctly + describes.""" + # Re-build the per-state first-difference table the estimator sees + # internally, then compute the weighted denominator and check the + # algebraic identity holds. d_lower is the min observed dose on + # the post-period. + p2 = panel_2p.copy() + pre = p2[p2["period"] == 1].set_index("state_id") + post = p2[p2["period"] == 2].set_index("state_id") + common = sorted(set(pre.index) & set(post.index)) + pre = pre.loc[common] + post = post.loc[common] + dy = (post["screening_uptake"] - pre["screening_uptake"]).to_numpy(dtype=float) + d_post = post["spend_k"].to_numpy(dtype=float) + weights = post["weight"].to_numpy(dtype=float) + d_lower = float(d_post.min()) + d_shifted = d_post - d_lower + den_w = float(np.average(d_shifted, weights=weights)) + # ATT * den_w should equal dy_mean_w - tau_bc + dy_mean_w = float(np.average(dy, weights=weights)) + implied_tau_bc = dy_mean_w - survey_overall_result.att * den_w + # tau_bc is the bias-corrected boundary limit at d=d_lower; on the + # linear DGP with att_intercept=0 this is near zero (the + # pre-vs-post difference at the lightest-touch state is mostly + # time trend + noise). Lock it to a finite range — the exact value + # depends on the local-linear bandwidth selection at seed=87 but + # is a deterministic function of the panel. + assert np.isfinite(implied_tau_bc), implied_tau_bc + # The locked algebraic identity: ATT must satisfy + # ATT = (dy_mean_w - tau_bc) / den_w. The implied tau_bc here is + # what the estimator computed. Sanity-check the order of magnitude: + # |tau_bc| should be on the scale of |dy_mean_w| (both are + # outcome-scale, post - pre differences). + assert abs(implied_tau_bc) < 2.0 * abs(dy_mean_w) + 100.0, ( + implied_tau_bc, + dy_mean_w, + ) From cf438fa355588e71340f0134f8f74f739a7e9f18 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 06:24:39 -0400 Subject: [PATCH 03/15] Address PR #440 R2 CI review (2 P3 informational) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P3 #1 — T22 §3 prose said the two ATTs land "within bootstrap noise of each other", but the headline path uses analytical Binder/TSL Taylor-linearized SE, not a bootstrap. Replaced with "numerically close on this DGP" to avoid misstating the inference mechanism. P3 #2 — CHANGELOG L11 and REGISTRY L2577 said "25 tests across 5 groups" for the T22 drift suite, but the R1 fix added 5 more tests (2 in Group F: workflow-surface separation; 3 in Group G: weighted point-estimation contract). Updated both summaries to "30 tests across 7 groups" with one-line descriptions of the two new groups. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 2 +- docs/methodology/REGISTRY.md | 2 +- docs/tutorials/22_had_survey_design.ipynb | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 9c7f4db3..4632946a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added -- **Tutorial 22: Survey-Weighted HAD** (`docs/tutorials/22_had_survey_design.ipynb`) — end-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` on a BRFSS-shape stratified household-survey panel (5 strata × 6 PSUs/stratum × 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU × period interaction shocks injected so cluster correlation survives DiD first-differencing). Demonstrates the `SurveyDesign(strata=...)` path through the Stute pretest family that the previous `[Unreleased]` entry unblocked. Eight numbered sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with a side-by-side ATT / SE / CI table (~10% SE inflation, sign-only direction asserted); a dedicated section explaining why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study fit with sup-t cband under the survey design (per-horizon table + matplotlib gated plot); pretest workflow on both `aggregate="overall"` and `aggregate="event_study"` paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; "Communicating to Leadership" two-paragraph template (executive + methodologist); Extensions + Summary Checklist surfacing the still-deferred `lonely_psu='adjust'` + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (25 tests across 5 groups: panel + survey composition with deterministic exact pins; naive-vs-survey headline with sign-only SE-inflation anchor; event-study cband-vs-pointwise ordering and post/pre coverage; pretest overall path with `_QUG_DEFERRED_SUFFIX` lock and Yatchew `sigma2_*` deterministic pins; pretest event-study path with the SAME `_QUG_DEFERRED_SUFFIX` lock plus a SEPARATE substring lock on `report.summary()` for the L736 QUG-skip note). Bootstrap p-value pins use abs tolerance bands `>= 0.25` per `feedback_strata_bootstrap_path_divergence` (stratified Mammen multiplier paths reduce effective dofs vs non-strata; PR #432 commit `aef07020` already had to relax bit-equality bands on this code path). T20 and T21 "Extensions" bullets updated with forward-pointers to T22; `docs/practitioner_decision_tree.rst` HAD universal-rollout and survey sections each gain a `.. tip::` cross-link to T22 (adjacent to T20 / T17, NOT displacing); `docs/api/had.rst` gains a "Survey-aware fit" cross-reference; `docs/survey-roadmap.md` gains a "Phase 4.5 C ✅ Shipped" entry; bundled `diff_diff/guides/llms.txt` and `llms-practitioner.txt` carry T22 inventory entries (the `llms-full.txt` reference guide is left as a follow-up to keep T22 PR scope tight); `docs/doc-deps.yaml` wires T22 as a dependent of both `had.py` and `had_pretests.py`. Closes the Phase 5 (wave 2 second slice) tutorial gap; the realistic survey-weighted HAD workflow on BRFSS / CPS / NHANES / ACS-shaped designs is now end-to-end documented for practitioners. +- **Tutorial 22: Survey-Weighted HAD** (`docs/tutorials/22_had_survey_design.ipynb`) — end-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` on a BRFSS-shape stratified household-survey panel (5 strata × 6 PSUs/stratum × 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU × period interaction shocks injected so cluster correlation survives DiD first-differencing). Demonstrates the `SurveyDesign(strata=...)` path through the Stute pretest family that the previous `[Unreleased]` entry unblocked. Eight numbered sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with a side-by-side ATT / SE / CI table (~10% SE inflation, sign-only direction asserted); a dedicated section explaining why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study fit with sup-t cband under the survey design (per-horizon table + matplotlib gated plot); pretest workflow on both `aggregate="overall"` and `aggregate="event_study"` paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; "Communicating to Leadership" two-paragraph template (executive + methodologist); Extensions + Summary Checklist surfacing the still-deferred `lonely_psu='adjust'` + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (30 tests across 7 groups: panel + survey composition with deterministic exact pins; naive-vs-survey headline with sign-only SE-inflation anchor; event-study cband-vs-pointwise ordering and post/pre coverage; pretest overall path with `_QUG_DEFERRED_SUFFIX` lock and Yatchew `sigma2_*` deterministic pins; pretest event-study path with the SAME `_QUG_DEFERRED_SUFFIX` lock plus a SEPARATE substring lock on `report.summary()` for the L736 QUG-skip note; workflow-surface separation locking that overall has Stute+Yatchew while event-study has joint pretrends/joint linearity with `yatchew=None` and `stute=None`; and weighted point-estimation contract anchoring `survey.att != naive.att` plus the algebraic identity `att = (dy_mean_w - tau_bc) / den_w` from `_fit_continuous`). Bootstrap p-value pins use abs tolerance bands `>= 0.25` per `feedback_strata_bootstrap_path_divergence` (stratified Mammen multiplier paths reduce effective dofs vs non-strata; PR #432 commit `aef07020` already had to relax bit-equality bands on this code path). T20 and T21 "Extensions" bullets updated with forward-pointers to T22; `docs/practitioner_decision_tree.rst` HAD universal-rollout and survey sections each gain a `.. tip::` cross-link to T22 (adjacent to T20 / T17, NOT displacing); `docs/api/had.rst` gains a "Survey-aware fit" cross-reference; `docs/survey-roadmap.md` gains a "Phase 4.5 C ✅ Shipped" entry; bundled `diff_diff/guides/llms.txt` and `llms-practitioner.txt` carry T22 inventory entries (the `llms-full.txt` reference guide is left as a follow-up to keep T22 PR scope tight); `docs/doc-deps.yaml` wires T22 as a dependent of both `had.py` and `had_pretests.py`. Closes the Phase 5 (wave 2 second slice) tutorial gap; the realistic survey-weighted HAD workflow on BRFSS / CPS / NHANES / ACS-shaped designs is now end-to-end documented for practitioners. - **HAD pretest workflow: stratified survey-design support (Phase 4.5 C continuation).** Lifts the `NotImplementedError` gate on `SurveyDesign(strata=...)` in `stute_test` (`had_pretests.py:1927-1940` pre-PR) and `stute_joint_pretest` (`:3259-3271` pre-PR), and by inheritance in `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` (the wrappers delegate to the joint Stute helper). Implements a documented synthesis of clustered-wild-bootstrap ingredients (Cameron-Gelbach-Miller 2008 cluster-level multipliers; Davidson-Flachaire 2008 wild-bootstrap centering; Djogbenou-MacKinnon-Nielsen 2019 cluster-wild consistency for nonlinear functionals; Kreiss-Lahiri 2012 within-block centering analogy; Wu 1986 / Liu 1988 Bessel small-sample correction) — no single paper covers the exact composition for the stratified Stute CvM functional. The recipe: within-stratum demean + `sqrt(n_h/(n_h-1))` Bessel rescale applied to the PSU multipliers `psu_mults` BEFORE the per-obs broadcast `eta_obs = psu_mults[b, psu_col_idx]` in the wild-residual loop. Bootstrap CvM variance matches the analytical Binder-TSL stratified target `V_S = sum_h (1 - f_h) (n_h / (n_h - 1)) sum_j (psi_hj - psi_h_bar)²` exactly (the `(1 - f_h)` FPC factor was already baked in by `generate_survey_multiplier_weights_batch`; this PR bakes the remaining `(n_h / (n_h - 1))` factor and enforces within-stratum-mean-zero centering). New shared helper `bootstrap_utils.apply_stratum_centering(psu_mults, resolved_survey, psu_ids, psu_axis=...)` is called from both the new Stute path (psu_axis=1 on the multiplier matrix) AND the existing HAD sup-t event-study cband bootstrap (psu_axis=0 on the PSU-aggregated influence tensor; refactored bit-exactly from the inline block previously at `had.py:2172-2204`). Locks the algebraic identity architecturally instead of leaving parallel code blocks to drift. MC oracle consistency validated under a 4-stratum × 6-PSU/stratum stratified null DGP with weights+strata+PSU (200 seeded draws, empirical Type I at α=0.05 in `[0, 0.10]` — 3σ band; the FPC bake-in is covered separately by the helper-unit test `test_fpc_baked_in_helper_is_fpc_agnostic`); MC power validated under a known-alternative stratified DGP (rejection > 0.50). HAD sup-t event-study cband bit-parity preserved (`atol=1e-14, rtol=1e-14` on the refactored helper output + 29 existing cband tests passing post-refactor; that helper-level bit-parity test locks the axis-0 algebra). A separate wired-in regression at `tests/test_had_pretests.py::TestStuteStratifiedSurveyBootstrap::test_stute_call_sites_invoke_apply_stratum_centering` monkey-patches the helper and asserts both Stute call sites (`stute_test` at `had_pretests.py:1985` and `stute_joint_pretest` at `:3312`) invoke it with `psu_axis=1` — that test fails if either call site is disconnected (the axis-0 helper-parity test alone does not catch that case). See `docs/methodology/REGISTRY.md` § HeterogeneousAdoptionDiD — "Note (Stute stratified survey-bootstrap calibration)" for the full derivation. Remaining deferrals: `lonely_psu='adjust'` + singleton-strata (same pseudo-stratum centering gap as the HAD sup-t deviation at REGISTRY:2382) and replicate-weight designs (BRR/Fay/JK1/JKn/SDR — separate Rao-Wu / JKn bootstrap composition). Unblocks the realistic survey-weighted HAD workflow on BRFSS/CPS/NHANES/ACS-shaped designs. - **Conley (1999) Wave A mechanical extensions** on top of the Phase 1+2 sandwich (`diff_diff/conley.py`, `diff_diff/linalg.py`, `diff_diff/estimators.py`, `diff_diff/twfe.py`). **(1) DiD support (#118):** `DifferenceInDifferences(vcov_type="conley").fit(..., unit="")` is now supported. `unit` is a fit-time kwarg (NOT on `__init__`; unused unless Conley is set; not part of `get_params()` / `set_params()`) mirroring `MultiPeriodDiD.fit(unit=...)` / `TwoWayFixedEffects.fit(unit=...)`. DiD inherits the same panel block-decomposed sandwich as MPD/TWFE; on a 2-period panel it matches `MultiPeriodDiD(...).fit(..., post_periods=[1], reference_period=0)` bit-exactly. Missing `unit=`/`conley_lag_cutoff`/`conley_coords`/`conley_cutoff_km` raise `ValueError`; `survey_design=` + Conley raises `NotImplementedError` (Bertanha-Imbens 2014 follow-up); `inference="wild_bootstrap"` + Conley raises `NotImplementedError`. **(2) Combined spatial + cluster product kernel (#119):** `compute_robust_vcov(vcov_type="conley", cluster_ids=...)` / `LinearRegression(vcov_type="conley", cluster_ids=...)` / `TwoWayFixedEffects(vcov_type="conley", cluster="")` / `MultiPeriodDiD(vcov_type="conley", cluster="")` / `DifferenceInDifferences(vcov_type="conley", cluster="")` apply `K_total[i, j] = K_space(d_ij/h) · 1{c_i = c_j}`. On the panel block-decomposed path the cluster indicator multiplies BOTH the spatial sandwich AND the serial sandwich; the validator enforces that `cluster_ids` is constant within each unit across periods (the within-unit serial mask is then trivially all-ones; cross-sectional path has no such constraint). TWFE's default auto-cluster on the Conley path remains silently dropped (combining with unit-level clusters would zero out all between-unit pairs and defeat the spatial pooling); users must pass an explicit above-unit cluster (e.g. region) to opt in. DiD has no auto-cluster — the choice is fully explicit. Two limit fixtures anchor correctness (no R parity — R `conleyreg` does not support combined kernels): all-unique-clusters reduces to HC0; huge-cutoff reduces to pure within-cluster CR1. The huge-cutoff reduction is EXACT only for `conley_kernel="uniform"` (`K(u) = 1` for `|u| ≤ 1`); for `conley_kernel="bartlett"` the identity is asymptotic since `K_bartlett(u) = 1 - |u| < 1` for `u > 0`. The fixture anchor uses uniform for an exact identity check. Per-slice mask construction (NOT full n×n) preserves memory on panel paths. **(3) Sparse k-d-tree fast path (#120):** auto-activates for the spatial Bartlett meat when `n > 5_000` AND metric is `"haversine"` or `"euclidean"` AND kernel is `"bartlett"`. Builds a CSR sparse kernel matrix via `scipy.spatial.cKDTree.query_ball_tree` instead of materializing the full n×n distance matrix; haversine projects to a 3-D unit-sphere chord representation with the exact great-circle recomputed for in-range neighbors only. Bit-identity parity vs the dense path at `atol=1e-10`; R parity at `atol=1e-6` is preserved on the existing 3 panel R fixtures with the sparse path force-enabled. The bartlett-only gate is for boundary correctness — bartlett at `u=1` is exactly 0, so the sparse path safely drops at-cutoff pairs; uniform at `u=1` is 1 and would require a closed-interval query semantic that haversine chord projection cannot reliably preserve. Constants: `_CONLEY_SPARSE_N_THRESHOLD = 5_000` (auto-toggle); `_CONLEY_DENSE_WARN_N` renamed `_CONLEY_DENSE_OOM_WARN_N = 20_000` (memory exhaustion threshold for the dense fallback — independent of the sparse threshold). Private `_conley_sparse: Optional[bool]` kwarg on `_compute_conley_vcov` controls the toggle (`None` = auto, `True` = force, `False` = force dense; `True` with an unsupported kernel/metric raises). The serial component (within-unit Bartlett over time) remains dense regardless — per-unit slices are small. **(4) Callable `conley_metric` validation (#123):** result must satisfy shape `(n, n)`, finite, non-negative, symmetric to `atol=1e-10`, AND zero on the diagonal (`|d(i, i)| ≤ 1e-10`); each failure raises a targeted `ValueError` naming the violated invariant. The zero-diagonal contract is load-bearing for the Conley sandwich: the `i = j` term must reduce to the HC0 diagonal `X_i ε_i² X_i'` via `K(0) = 1`; positive self-distance would silently attenuate the HC0 contribution by `K(d_ii / h) < 1`. Built-in metrics (`"haversine"`, `"euclidean"`) satisfy this by construction. Previously, malformed callables produced opaque BLAS errors deep in the pipeline. **Tests:** `tests/test_conley_vcov.py::TestConleySparse` (12), `::TestConleySparseRParityForced` (3), `::TestConleyCluster` (10), `::TestConleyDistanceMetrics` extended (7 new); existing rejection tests flipped to behavioral; `test_did_conley_matches_mpd_post_periods_1` locks the DiD-vs-MPD bit-exact agreement. **Docs:** REGISTRY `## ConleySpatialHAC` updates: new "Combined spatial + cluster product kernel" + "Performance / scale" subsections, DiD-vs-TWFE cluster asymmetry paragraph, updated panel-API restrictions table. TODO rows 118 / 119 / 120 / 123 removed; rows 121 (Conley + survey_design / weights, Bertanha-Imbens 2014) and 122 (`SyntheticDiD(vcov_type="conley")`, spatial-block bootstrap per Politis-Romano 1994) retained for future waves. - **Conley (1999) spatial-HAC standard errors via `vcov_type="conley"`** on cross-sectional `LinearRegression` / `compute_robust_vcov` plus panel `MultiPeriodDiD` / `TwoWayFixedEffects` (Phases 1 and 2 of the spillover-conley initiative). **Cross-sectional contract:** `conley_coords` (n × 2 array of lat/lon or projected coords), `conley_cutoff_km=` (positive finite bandwidth in km for haversine, or coord units for euclidean — REQUIRED, no default per the no-silent-failures contract), `conley_metric="haversine"|"euclidean"|callable` (default `"haversine"`; great-circle uses Earth's mean radius 6371.01 km matching R `conleyreg`), `conley_kernel="bartlett"|"uniform"` (default `"bartlett"`; both kernels emit a `UserWarning` if the resulting meat has a materially negative eigenvalue — neither the radial 1-D Bartlett nor the uniform kernel is formally PSD-guaranteed; Conley 1999's explicit PSD formula is the 2-D separable lattice product window at Eq 3.14). Cross-sectional variance estimator `Var̂(β) = (X'X)^{-1} · ( Σ_{i,j} K(d_ij/h) · X_i ε_i ε_j X_j' ) · (X'X)^{-1}` (Conley 1999 Eq 4.2). **Panel contract (Phase 2, new):** Three new co-required kwargs `conley_time` (n-length array), `conley_unit` (n-length array), and `conley_lag_cutoff=` (non-negative; 0 means within-period spatial only, no serial component) switch into the **block-decomposed panel sandwich** that matches R `conleyreg` with `lag_cutoff > 0`: `XeeX_total = Σ_t (within-period spatial sandwich) + Σ_u (within-unit Bartlett temporal sandwich, lag ∈ {1..L}, same-time excluded)`. This is NOT a multiplicative product kernel — verified empirically against `conleyreg::time_dist` and `XeeXhC` at ~1e-14 on the panel parity fixtures. The temporal kernel is hardcoded Bartlett `(1 - |lag|/(L+1))` regardless of `conley_kernel`, mirroring `conleyreg::time_dist.cpp`; documented as a `Note (deviation from R-symmetric API)` in REGISTRY. **Panel estimator wire-up (Phase 2):** `MultiPeriodDiD(vcov_type="conley", conley_lag_cutoff=...).fit(..., unit=...)` and `TwoWayFixedEffects(vcov_type="conley", conley_lag_cutoff=...).fit(..., unit=...)` lift the Phase 1 fit-time rejection; the `conley_time` and `conley_unit` arrays are auto-derived from the existing `time` and `unit` column-name arguments at fit-time. `DifferenceInDifferences(vcov_type="conley")` is also supported (Wave A #118 in this release; see the Wave A entry above) — pass `unit=` as a fit-time kwarg to `DiD.fit(...)`. **Other constraints (Phase 1, unchanged):** `SyntheticDiD(vcov_type="conley")` raises `TypeError` (uses bootstrap variance, not analytical sandwich); `set_params` mirrors the constructor rejection. `vcov_type="conley"` + `weights=` / `survey_design=` raises `NotImplementedError` (Bertanha-Imbens 2014 weighted-Conley deferred to a follow-up PR). `vcov_type="conley"` + explicit `cluster_ids=` is supported via the combined spatial + cluster product kernel (Wave A #119; see the Wave A entry above). TWFE's default auto-cluster on the Conley path is silently dropped (combining with unit-level clusters would defeat the spatial pooling); users opt into the combined kernel by passing an explicit above-unit cluster. `inference="wild_bootstrap"` + Conley raises (incompatible inference modes). A sparse k-d-tree fast path auto-activates for the spatial Bartlett meat when `n > 5_000` with bartlett kernel and haversine/euclidean metric (Wave A #120); the dense fallback still emits an OOM `UserWarning` at `n > 20_000`. **Implementation:** Helpers live in `diff_diff/conley.py` (`_haversine_km`, `_pairwise_distance_matrix`, `_bartlett_kernel`, `_uniform_kernel`, `_validate_conley_kwargs`, `_compute_conley_vcov` — the validator and sandwich helper now accept keyword-only `time` / `unit` / `lag_cutoff` for the panel path); `compute_robust_vcov` in `diff_diff/linalg.py` threads the new kwargs through. **R `conleyreg` parity (Düsterhöft 2021, CRAN v0.1.9)** on **six** benchmark fixtures (`benchmarks/data/r_conleyreg_conley_golden.json`, regenerable via `benchmarks/R/generate_conley_golden.R`): 3 cross-sectional (Phase 1) + 3 new panel fixtures (`panel_haversine_lag1`, `panel_haversine_lag2`, `panel_lat_lon_realistic_lag1`; n_units × T = 60×3, 80×5, 100×4 at lag={1,2,1}); observed max abs diff ~5.7e-16. Earth radius 6371.01 km matches `conleyreg::haversine_dist`. Test file `tests/test_conley_vcov.py` skips parity cleanly when the JSON is absent. New REGISTRY section `## ConleySpatialHAC`. Subsequent phases of the spillover-conley initiative (ring-indicator spillover-aware DiD per Butts 2021; survey-design / replicate-weight support; `SyntheticDiD` Conley path) are tracked in `TODO.md` under "Tech Debt from Code Reviews" → spillover-conley rows. diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 943aef80..da7eb0e3 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2574,7 +2574,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in - [x] Phase 5 (wave 1, PR #402): `llms-full.txt` HeterogeneousAdoptionDiD section + result-class blocks + `## HAD Pretests` index + Choosing-an-Estimator row landed; constructor / fit() parameter names are regression-locked against `inspect.signature(HeterogeneousAdoptionDiD.__init__)` and `HeterogeneousAdoptionDiD.fit` for parameter-name presence (parameter defaults and the non-return parameter type annotations remain unpinned; the `fit()` return-type union is locked BOTH at the source-code level AND at the test level by `TestFitReturnAnnotation`); result-class field tables enumerate every public dataclass field (regression-tested via `dataclasses.fields()`); `llms-practitioner.txt` Step 4 decision tree distinguishes ContinuousDiD (per-dose ATT(d), needs never-treated) from HeterogeneousAdoptionDiD (WAS, universal-rollout-compatible). - [x] Phase 5 (partial): README catalog one-liner, bundled `llms.txt` `## Estimators` entry, `docs/api/had.rst` (autoclass for the three classes), and `docs/references.rst` citation landed in PR #372 docs refresh. - [x] Phase 5 (wave 2 first slice, PR #409): T21 HAD pretest workflow tutorial (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `did_had_pretest_workflow`. Uses a `Uniform[$0.01K, $50K]` dose-distribution variant of T20's brand-campaign panel (true support strictly positive but near-zero, chosen so QUG fails-to-reject `H0: d_lower = 0` in finite sample). Walks through `aggregate="overall"` (Steps 1 + 3 only, verdict explicitly flags Step 2 deferral) and upgrades to `aggregate="event_study"` (joint pre-trends Stute + joint homogeneity Stute close the gap). Side panel exercises both `yatchew_hr_test` null modes (`linearity` vs `mean_independence`). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (16 tests pinning panel composition, both verdict pivots, structural anchors, deterministic stats, bootstrap p-value tolerance bands per backend, and `HAD(design="auto")` resolution to `continuous_at_zero` on this panel). -- [x] Phase 5 (wave 2 second slice): T22 weighted/survey HAD tutorial (`docs/tutorials/22_had_survey_design.ipynb`) - shipped as the follow-up to PR #432. End-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` under `SurveyDesign(weights, strata, psu, fpc)` on a BRFSS-shape state-rollout panel (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum). Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (25 tests pinning panel composition, naive-vs-survey SE inflation direction, design auto-detection, event-study cband-vs-pointwise width ordering, `_QUG_DEFERRED_SUFFIX` substring on `report.verdict` for both overall and event-study paths, the distinct `report.summary()` QUG-skip note on the event-study path, deterministic Yatchew sigma2_*, and bootstrap p-value tolerance bands per `feedback_strata_bootstrap_path_divergence` (>= 0.25 abs)). +- [x] Phase 5 (wave 2 second slice): T22 weighted/survey HAD tutorial (`docs/tutorials/22_had_survey_design.ipynb`) - shipped as the follow-up to PR #432. End-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` under `SurveyDesign(weights, strata, psu, fpc)` on a BRFSS-shape state-rollout panel (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum). Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (30 tests pinning panel composition, naive-vs-survey SE inflation direction, design auto-detection, event-study cband-vs-pointwise width ordering, `_QUG_DEFERRED_SUFFIX` substring on `report.verdict` for both overall and event-study paths, the distinct `report.summary()` QUG-skip note on the event-study path, deterministic Yatchew sigma2_*, bootstrap p-value tolerance bands per `feedback_strata_bootstrap_path_divergence` (>= 0.25 abs), workflow-surface separation between overall and event-study paths, and the weighted point-estimation contract via the `_fit_continuous` algebraic identity). - [ ] Documentation of non-testability of Assumptions 5 and 6. - [ ] Warnings for staggered treatment timing (redirect to `ChaisemartinDHaultfoeuille`). - [ ] `NotImplementedError` phase pointer when `covariates=` is passed (Theorem 6 future work). diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb index 01e9cbe2..c0b479c8 100644 --- a/docs/tutorials/22_had_survey_design.ipynb +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -307,7 +307,7 @@ "weighted `dy_mean` AND in the denominator `E_w[D - d_lower]`. On\n", "this DGP the weight CV (~0.30) and the dose-distribution shape do\n", "not co-vary strongly enough to shift the boundary slope materially,\n", - "so the two ATTs land within bootstrap noise of each other. The SE\n", + "so the two ATTs are numerically close on this DGP. The SE\n", "and CI differ because the survey path additionally folds the PSU\n", "clustering and FPC into the variance via the Binder TSL composition.\n" ] From ba99bc98212d9f7e52e983cbdac2a3e0e4c39618 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 06:54:16 -0400 Subject: [PATCH 04/15] Address PR #440 R2 review (1 P1 + 1 P3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P1 — T22 §4 + §5 prose claimed pre-period event-study horizons have "dose ... zero across the board" and used that premise to explain the SE pattern. Per paper Appendix B.2 (`had.py:1827-1838`) the event-study contract reuses `D_{g,F}` as the SINGLE dose regressor for every horizon, including pre-period placebos — the regressor is NOT zero on the pre-period. Rewrote both passages to base the SE-pattern explanation on the outcome-side `ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}`: placebos have small ΔY (within-pre noise only) so the local-linear fits low residual variance and reads small SEs; post-period horizons have ΔY that scales with `slope * D_{g,F}` plus noise, so residual variance is larger and SEs are larger. P3 — `test_survey_att_matches_weighted_denominator_contract` previously only constructed an implied `tau_bc` from the fitted att/den_w and checked finiteness + scale, which is too weak to be called an "identity lock" per the CHANGELOG/REGISTRY prose. Renamed to `test_survey_att_matches_weighted_local_linear_identity` and strengthened to three concrete identities: 1. `survey.effective_dose_mean == np.average(d - d_lower, w)` bit-equal (1e-10). 2. Direct call to `bias_corrected_local_linear` with HAD defaults (`kernel="epanechnikov"`, `alpha=0.05`, `boundary=0`) recovers the SAME `tau_bc` boundary limit the estimator used. 3. `att = (mean_w(dy) - tau_bc) / den_w` matches the fitted `survey.att` to ~1e-13 (verified locally; same float ops on the same inputs). The CHANGELOG/REGISTRY prose now genuinely matches the test coverage. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/22_had_survey_design.ipynb | 26 +++++--- tests/test_t22_had_survey_design_drift.py | 78 ++++++++++++++--------- 2 files changed, 64 insertions(+), 40 deletions(-) diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb index c0b479c8..49fec864 100644 --- a/docs/tutorials/22_had_survey_design.ipynb +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -471,13 +471,18 @@ "id": "1200aaee", "metadata": {}, "source": [ - "The per-horizon table reflects the IF-spread argument: pre-period\n", - "horizons (where the dose is zero across the board, so the local-linear\n", - "at d_lower has nothing to localize to) read SE off mostly the unit FE\n", - "plus the PSU shock variance, and the survey machinery has more\n", - "correlation to amplify. Post-period horizons aggregate across all\n", - "states' post-period contributions and produce a moderate inflation as\n", - "well. The takeaway for the practitioner: **for HAD specifically, do\n", + "The per-horizon table reflects the IF-spread argument. Per paper\n", + "Appendix B.2 the event-study uses `D_{g,F}` (the period-F dose) as\n", + "the SAME dose regressor for every horizon - pre-period placebos and\n", + "post-period horizons alike. Per-horizon SEs differ because the\n", + "outcome side `ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}` does. Pre-period\n", + "placebos have small `ΔY` (no treatment signal — just within-pre noise\n", + "between t and F-1) so the local-linear at d_lower fits low residual\n", + "variance and reads small SEs. Post-period horizons have `ΔY` that\n", + "scales with the treatment effect (slope * `D_{g,F}` plus noise), so\n", + "the local-linear residual variance is larger and the SE is larger.\n", + "The survey machinery folds PSU clustering into both surfaces and\n", + "produces a moderate per-horizon inflation on top. The takeaway for the practitioner: **for HAD specifically, do\n", "not eyeball SE inflation against your DEFF expectation from\n", "regression-coefficient examples**. Read the design-aware SE off the\n", "fit object and report it; the magnitude of the survey-vs-naive ratio\n", @@ -567,8 +572,11 @@ "with both pointwise and sup-t coverage. The sup-t band is a few\n", "percent wider than the pointwise interval; that gap is the multiple-\n", "horizon multiplicity correction. Per-horizon SEs are noticeably\n", - "larger post-launch than pre - the post-period dose variation is what\n", - "drives variance under HAD's identification.\n" + "larger post-launch than pre because `ΔY_{g,t}` post-launch carries\n", + "the treatment-effect contribution (cross-unit variation in\n", + "`slope * D_{g,F}`), driving larger local-linear residual variance\n", + "than the pre-period placebos see (placebos use the same `D_{g,F}`\n", + "regressor, but `ΔY_{g,t}` is noise-only — see section 4).\n" ] }, { diff --git a/tests/test_t22_had_survey_design_drift.py b/tests/test_t22_had_survey_design_drift.py index 02e5b10d..9c65e5ce 100644 --- a/tests/test_t22_had_survey_design_drift.py +++ b/tests/test_t22_had_survey_design_drift.py @@ -622,20 +622,28 @@ def test_survey_att_differs_from_naive_att(naive_overall_result, survey_overall_ ) -def test_survey_att_matches_weighted_denominator_contract( +def test_survey_att_matches_weighted_local_linear_identity( panel_2p: pd.DataFrame, survey_overall_result ): - """The survey-aware ATT divided by the (weighted) sample weighted- - denominator should recover the bias-corrected `tau_bc` boundary - limit (`mean_w(dy) - tau_bc = att * den_w`). This is the - algebraic identity from `_fit_continuous` (`had.py:3804-3810`): - `att = (dy_mean - tau_bc) / den`. Locks the weighted-numerator / - weighted-denominator contract that §3 narrative now correctly - describes.""" - # Re-build the per-state first-difference table the estimator sees - # internally, then compute the weighted denominator and check the - # algebraic identity holds. d_lower is the min observed dose on - # the post-period. + """Locks the actual `_fit_continuous` algebraic identity end-to-end + by recomputing every intermediate against the shipped estimator: + + 1. `effective_dose_mean` on the result equals + `np.average(d - d_lower, weights=w)` exactly. + 2. Calling `bias_corrected_local_linear` directly with the same + inputs the HAD class uses (`d_reg = d_post - d_lower`, `dy`, + `weights`, default `kernel="epanechnikov"`, `alpha=0.05`, + `boundary=0.0`) recovers the SAME `tau_bc` boundary limit the + estimator used. + 3. `att = (mean_w(dy) - tau_bc) / den_w` matches the fitted ATT + to ~1e-13 (FP precision; same float ops, same data). + + Per CI AI review R2 P3: prior version of this test only checked + finiteness + scale of an inverted `implied_tau_bc`; that is too + weak to be called an "identity lock." This version actually + re-derives every step.""" + from diff_diff.local_linear import bias_corrected_local_linear + p2 = panel_2p.copy() pre = p2[p2["period"] == 1].set_index("state_id") post = p2[p2["period"] == 2].set_index("state_id") @@ -646,24 +654,32 @@ def test_survey_att_matches_weighted_denominator_contract( d_post = post["spend_k"].to_numpy(dtype=float) weights = post["weight"].to_numpy(dtype=float) d_lower = float(d_post.min()) - d_shifted = d_post - d_lower - den_w = float(np.average(d_shifted, weights=weights)) - # ATT * den_w should equal dy_mean_w - tau_bc + d_reg = d_post - d_lower + + # Identity 1: effective_dose_mean == weighted mean of d - d_lower + den_w = float(np.average(d_reg, weights=weights)) + assert abs(den_w - survey_overall_result.effective_dose_mean) < 1e-10, ( + den_w, + survey_overall_result.effective_dose_mean, + ) + assert abs(survey_overall_result.d_lower - d_lower) < 1e-12, ( + survey_overall_result.d_lower, + d_lower, + ) + + # Identity 2: direct bias_corrected_local_linear call recovers the + # SAME tau_bc the estimator used (HAD defaults: kernel epanechnikov, + # alpha 0.05, boundary 0). + bc = bias_corrected_local_linear(d_reg, dy, weights=weights, alpha=0.05, kernel="epanechnikov") + tau_bc = float(bc.estimate_bias_corrected) + assert np.isfinite(tau_bc), tau_bc + + # Identity 3: att = (dy_mean_w - tau_bc) / den_w, bit-equal modulo + # FP precision. dy_mean_w = float(np.average(dy, weights=weights)) - implied_tau_bc = dy_mean_w - survey_overall_result.att * den_w - # tau_bc is the bias-corrected boundary limit at d=d_lower; on the - # linear DGP with att_intercept=0 this is near zero (the - # pre-vs-post difference at the lightest-touch state is mostly - # time trend + noise). Lock it to a finite range — the exact value - # depends on the local-linear bandwidth selection at seed=87 but - # is a deterministic function of the panel. - assert np.isfinite(implied_tau_bc), implied_tau_bc - # The locked algebraic identity: ATT must satisfy - # ATT = (dy_mean_w - tau_bc) / den_w. The implied tau_bc here is - # what the estimator computed. Sanity-check the order of magnitude: - # |tau_bc| should be on the scale of |dy_mean_w| (both are - # outcome-scale, post - pre differences). - assert abs(implied_tau_bc) < 2.0 * abs(dy_mean_w) + 100.0, ( - implied_tau_bc, - dy_mean_w, + manual_att = (dy_mean_w - tau_bc) / den_w + assert abs(manual_att - survey_overall_result.att) < 1e-10, ( + manual_att, + survey_overall_result.att, + manual_att - survey_overall_result.att, ) From 16d7792e81bf1e3d5f8aaa6194beaecdf58bcc19 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 07:00:18 -0400 Subject: [PATCH 05/15] =?UTF-8?q?Address=20PR=20#440=20R3=20review=20(1=20?= =?UTF-8?q?P1=20=E2=80=94=20same=20class,=20different=20=C2=A74=20site)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit R2 fixed two §4/§5 sites that misframed event-study mechanics. R3 reviewer caught a third site at the §4 cell that introduces the per-horizon SE-ratio table (`L422-426`). It said the post-period horizons "aggregate across all post-period observations" and the variance is "read off the full panel" — wrong. HAD event-study fits each horizon as a SEPARATE local-linear on that horizon's first differences (`ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}`) against the common `D_{g,F}` regressor (paper Appendix B.2; `diff_diff/had.py:4298-4451`); the pointwise per-horizon SE reads each horizon's own residual variance, not a panel aggregate. Also softened the "per-horizon SE ratio should be larger than the overall ratio" claim — the empirical per-horizon ratios on the locked seed are mixed ([1.087, 1.0, 0.816, 1.203, 1.127, 1.085, 1.126]), some larger and some smaller than the overall ~1.10x. The revised text frames this as an empirical observation of how PSU correlation interacts with each horizon's `ΔY` distribution, NOT a methodological guarantee. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/22_had_survey_design.ipynb | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb index 49fec864..b503ed86 100644 --- a/docs/tutorials/22_had_survey_design.ipynb +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -419,11 +419,15 @@ "and only a small share of states near the boundary, the within-PSU\n", "correlation has a small lever to act on.\n", "\n", - "Contrast with the event-study path: the post-period horizons aggregate\n", - "across **all post-period observations**, so the variance is read off\n", - "the full panel and the survey-design machinery has more cluster\n", - "correlation to amplify. The per-horizon SE ratio should be larger\n", - "than the overall SE ratio:\n" + "Contrast with the event-study path: each event-time horizon is a\n", + "**separate** local-linear fit on that horizon's first differences\n", + "`ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}` against the common regressor\n", + "`D_{g,F}` (paper Appendix B.2). The pointwise per-horizon SE is\n", + "computed from each horizon's own residual variance — not aggregated\n", + "across post-period observations. So the per-horizon survey-vs-naive\n", + "ratio is an empirical property of how PSU correlation interacts with\n", + "each horizon's `ΔY` distribution on this seeded DGP, not a\n", + "method-contract guarantee. We compute it below to inspect:\n" ] }, { From 9e2b1477f1292a4d0649d2968d1724a97c79cf91 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 07:06:51 -0400 Subject: [PATCH 06/15] Address PR #440 R3 P3: tighten bootstrap p-value drift bands R3 reviewer flagged that CHANGELOG/REGISTRY prose says T22 drift suite "pins bootstrap p-values with >= 0.25 abs tolerance bands", but the actual assertions were `0.10 <= p <= 0.95` (width 0.85). The prose was stronger than the test coverage. Tightened to T21-style anchored bands centered on the seed=22 captured values: - stute (overall): `0.27 <= p <= 0.57` (was 0.10-0.95; captured ~0.42) - pretrends_joint: `0.24 <= p <= 0.54` (captured ~0.39) - homogeneity_joint: `0.26 <= p <= 0.56` (captured ~0.41) Each band has abs width ~0.30, satisfying the >= 0.25 abs band contract from `feedback_strata_bootstrap_path_divergence` while catching drift in either direction (toward rejection or toward an even cleaner pass) rather than only rejecting on cross-the-line moves. Aligns the CHANGELOG/REGISTRY claim with the test coverage. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/test_t22_had_survey_design_drift.py | 26 +++++++++++++++-------- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/tests/test_t22_had_survey_design_drift.py b/tests/test_t22_had_survey_design_drift.py index 9c65e5ce..eef8f9c8 100644 --- a/tests/test_t22_had_survey_design_drift.py +++ b/tests/test_t22_had_survey_design_drift.py @@ -473,14 +473,18 @@ def test_overall_report_all_pass_under_null(overall_report): def test_overall_report_stute_fails_to_reject(overall_report): - """Stute CvM fails-to-reject linearity. Bootstrap-derived p-value - pinned with abs band >= 0.25 per - ``feedback_strata_bootstrap_path_divergence`` (stratified Mammen - multiplier paths reduce effective dofs vs non-strata).""" + """Stute CvM fails-to-reject linearity. Anchored bootstrap-p band + centered on the seed=22 captured value (~0.42) with abs width + ~0.30 per ``feedback_strata_bootstrap_path_divergence`` + (stratified Mammen multiplier reduces effective dofs vs + non-strata; PR #432 commit ``aef07020`` had to relax bit-equality + on this code path). Drift either toward rejection or toward an + even cleaner pass flags the §6 prose as stale rather than + silently passing.""" assert overall_report.stute is not None assert overall_report.stute.reject is False p = float(overall_report.stute.p_value) - assert 0.10 <= p <= 0.95, p + assert 0.27 <= p <= 0.57, p def test_overall_report_yatchew_fails_to_reject(overall_report): @@ -543,8 +547,12 @@ def test_event_study_report_homogeneity_horizons_correct(event_study_report): def test_event_study_report_pretrends_and_homogeneity_fail_to_reject(event_study_report): """Both joint pretrends and joint homogeneity fail-to-reject under - the linear-DGP null. Bootstrap-derived p-values pinned with abs - band >= 0.25 (same rationale as Stute overall).""" + the linear-DGP null. Anchored bootstrap-p bands centered on the + seed=22 captured values (pretrends ~0.39, homogeneity ~0.41) with + abs width ~0.30 per + ``feedback_strata_bootstrap_path_divergence`` (same rationale as + Stute overall). Tighter than 0.10-0.95: catches drift in either + direction rather than only rejecting on cross-the-line moves.""" pj = event_study_report.pretrends_joint hj = event_study_report.homogeneity_joint assert pj is not None and hj is not None @@ -552,8 +560,8 @@ def test_event_study_report_pretrends_and_homogeneity_fail_to_reject(event_study assert hj.reject is False p_pre = float(pj.p_value) p_hom = float(hj.p_value) - assert 0.10 <= p_pre <= 0.95, p_pre - assert 0.10 <= p_hom <= 0.95, p_hom + assert 0.24 <= p_pre <= 0.54, p_pre + assert 0.26 <= p_hom <= 0.56, p_hom # ============================================================================ From 573200cf2bc23391313ccfc0b3dd511ab9c4af27 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 07:13:53 -0400 Subject: [PATCH 07/15] Address PR #440 R4 review (1 P2 + 2 P3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P2 — T22 §3 said "Both fits use the same weighted local-linear estimator at d_lower". `_fit_continuous` only switches to weighted moments when `weights_arr is not None` (`had.py:3747-3760`, `:3803-3808`); the naive fit uses unweighted moments. Rewrote §3 to say "same local-linear estimator family" with explicit per-fit moment-form distinction (naive: unweighted; survey: weighted via `bias_corrected_local_linear(..., weights=weights_arr)` plus weighted `np.average` for `dy_mean` and the denominator). P3 #1 — CHANGELOG/REGISTRY/test docstring/comments said the bootstrap p-value pins use ">= 0.25 abs tolerance bands", but the R3 tightened bands are width 0.30 total (± 0.15 around seeded centers). The "abs tolerance >= 0.25" wording is ambiguous between half-width and total-width interpretations. Updated all three surfaces to "anchored windows of total width 0.30 (± 0.15 around seeded centers)" so the prose is unambiguous and matches the actual assertions. P3 #2 — T22 §3 quotes "around 1.10x" SE inflation but the drift suite only checked direction (`survey.se > naive.se`). The seeded ratio could drift to 1.05 or 1.20 silently. Added `test_survey_se_inflation_ratio_in_band` asserting `1.00 <= ratio <= 1.20` — locks the seed=87 captured ratio (~1.0985) tightly enough to flag drift but loosely enough to not flake on RNG-path differences. Bumped CHANGELOG/REGISTRY test count from 30 → 31 to match. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 2 +- docs/methodology/REGISTRY.md | 2 +- docs/tutorials/22_had_survey_design.ipynb | 13 +++++++++---- tests/test_t22_had_survey_design_drift.py | 20 ++++++++++++++++---- 4 files changed, 27 insertions(+), 10 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 4632946a..23f5e1f9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added -- **Tutorial 22: Survey-Weighted HAD** (`docs/tutorials/22_had_survey_design.ipynb`) — end-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` on a BRFSS-shape stratified household-survey panel (5 strata × 6 PSUs/stratum × 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU × period interaction shocks injected so cluster correlation survives DiD first-differencing). Demonstrates the `SurveyDesign(strata=...)` path through the Stute pretest family that the previous `[Unreleased]` entry unblocked. Eight numbered sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with a side-by-side ATT / SE / CI table (~10% SE inflation, sign-only direction asserted); a dedicated section explaining why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study fit with sup-t cband under the survey design (per-horizon table + matplotlib gated plot); pretest workflow on both `aggregate="overall"` and `aggregate="event_study"` paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; "Communicating to Leadership" two-paragraph template (executive + methodologist); Extensions + Summary Checklist surfacing the still-deferred `lonely_psu='adjust'` + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (30 tests across 7 groups: panel + survey composition with deterministic exact pins; naive-vs-survey headline with sign-only SE-inflation anchor; event-study cband-vs-pointwise ordering and post/pre coverage; pretest overall path with `_QUG_DEFERRED_SUFFIX` lock and Yatchew `sigma2_*` deterministic pins; pretest event-study path with the SAME `_QUG_DEFERRED_SUFFIX` lock plus a SEPARATE substring lock on `report.summary()` for the L736 QUG-skip note; workflow-surface separation locking that overall has Stute+Yatchew while event-study has joint pretrends/joint linearity with `yatchew=None` and `stute=None`; and weighted point-estimation contract anchoring `survey.att != naive.att` plus the algebraic identity `att = (dy_mean_w - tau_bc) / den_w` from `_fit_continuous`). Bootstrap p-value pins use abs tolerance bands `>= 0.25` per `feedback_strata_bootstrap_path_divergence` (stratified Mammen multiplier paths reduce effective dofs vs non-strata; PR #432 commit `aef07020` already had to relax bit-equality bands on this code path). T20 and T21 "Extensions" bullets updated with forward-pointers to T22; `docs/practitioner_decision_tree.rst` HAD universal-rollout and survey sections each gain a `.. tip::` cross-link to T22 (adjacent to T20 / T17, NOT displacing); `docs/api/had.rst` gains a "Survey-aware fit" cross-reference; `docs/survey-roadmap.md` gains a "Phase 4.5 C ✅ Shipped" entry; bundled `diff_diff/guides/llms.txt` and `llms-practitioner.txt` carry T22 inventory entries (the `llms-full.txt` reference guide is left as a follow-up to keep T22 PR scope tight); `docs/doc-deps.yaml` wires T22 as a dependent of both `had.py` and `had_pretests.py`. Closes the Phase 5 (wave 2 second slice) tutorial gap; the realistic survey-weighted HAD workflow on BRFSS / CPS / NHANES / ACS-shaped designs is now end-to-end documented for practitioners. +- **Tutorial 22: Survey-Weighted HAD** (`docs/tutorials/22_had_survey_design.ipynb`) — end-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` on a BRFSS-shape stratified household-survey panel (5 strata × 6 PSUs/stratum × 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU × period interaction shocks injected so cluster correlation survives DiD first-differencing). Demonstrates the `SurveyDesign(strata=...)` path through the Stute pretest family that the previous `[Unreleased]` entry unblocked. Eight numbered sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with a side-by-side ATT / SE / CI table (~10% SE inflation, sign-only direction asserted); a dedicated section explaining why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study fit with sup-t cband under the survey design (per-horizon table + matplotlib gated plot); pretest workflow on both `aggregate="overall"` and `aggregate="event_study"` paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; "Communicating to Leadership" two-paragraph template (executive + methodologist); Extensions + Summary Checklist surfacing the still-deferred `lonely_psu='adjust'` + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (31 tests across 7 groups: panel + survey composition with deterministic exact pins; naive-vs-survey headline with sign-only SE-inflation anchor; event-study cband-vs-pointwise ordering and post/pre coverage; pretest overall path with `_QUG_DEFERRED_SUFFIX` lock and Yatchew `sigma2_*` deterministic pins; pretest event-study path with the SAME `_QUG_DEFERRED_SUFFIX` lock plus a SEPARATE substring lock on `report.summary()` for the L736 QUG-skip note; workflow-surface separation locking that overall has Stute+Yatchew while event-study has joint pretrends/joint linearity with `yatchew=None` and `stute=None`; and weighted point-estimation contract anchoring `survey.att != naive.att` plus the algebraic identity `att = (dy_mean_w - tau_bc) / den_w` from `_fit_continuous`). Bootstrap p-value pins use anchored windows of total width 0.30 (± 0.15 around seeded centers) per `feedback_strata_bootstrap_path_divergence` (stratified Mammen multiplier paths reduce effective dofs vs non-strata; PR #432 commit `aef07020` already had to relax bit-equality bands on this code path). T20 and T21 "Extensions" bullets updated with forward-pointers to T22; `docs/practitioner_decision_tree.rst` HAD universal-rollout and survey sections each gain a `.. tip::` cross-link to T22 (adjacent to T20 / T17, NOT displacing); `docs/api/had.rst` gains a "Survey-aware fit" cross-reference; `docs/survey-roadmap.md` gains a "Phase 4.5 C ✅ Shipped" entry; bundled `diff_diff/guides/llms.txt` and `llms-practitioner.txt` carry T22 inventory entries (the `llms-full.txt` reference guide is left as a follow-up to keep T22 PR scope tight); `docs/doc-deps.yaml` wires T22 as a dependent of both `had.py` and `had_pretests.py`. Closes the Phase 5 (wave 2 second slice) tutorial gap; the realistic survey-weighted HAD workflow on BRFSS / CPS / NHANES / ACS-shaped designs is now end-to-end documented for practitioners. - **HAD pretest workflow: stratified survey-design support (Phase 4.5 C continuation).** Lifts the `NotImplementedError` gate on `SurveyDesign(strata=...)` in `stute_test` (`had_pretests.py:1927-1940` pre-PR) and `stute_joint_pretest` (`:3259-3271` pre-PR), and by inheritance in `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` (the wrappers delegate to the joint Stute helper). Implements a documented synthesis of clustered-wild-bootstrap ingredients (Cameron-Gelbach-Miller 2008 cluster-level multipliers; Davidson-Flachaire 2008 wild-bootstrap centering; Djogbenou-MacKinnon-Nielsen 2019 cluster-wild consistency for nonlinear functionals; Kreiss-Lahiri 2012 within-block centering analogy; Wu 1986 / Liu 1988 Bessel small-sample correction) — no single paper covers the exact composition for the stratified Stute CvM functional. The recipe: within-stratum demean + `sqrt(n_h/(n_h-1))` Bessel rescale applied to the PSU multipliers `psu_mults` BEFORE the per-obs broadcast `eta_obs = psu_mults[b, psu_col_idx]` in the wild-residual loop. Bootstrap CvM variance matches the analytical Binder-TSL stratified target `V_S = sum_h (1 - f_h) (n_h / (n_h - 1)) sum_j (psi_hj - psi_h_bar)²` exactly (the `(1 - f_h)` FPC factor was already baked in by `generate_survey_multiplier_weights_batch`; this PR bakes the remaining `(n_h / (n_h - 1))` factor and enforces within-stratum-mean-zero centering). New shared helper `bootstrap_utils.apply_stratum_centering(psu_mults, resolved_survey, psu_ids, psu_axis=...)` is called from both the new Stute path (psu_axis=1 on the multiplier matrix) AND the existing HAD sup-t event-study cband bootstrap (psu_axis=0 on the PSU-aggregated influence tensor; refactored bit-exactly from the inline block previously at `had.py:2172-2204`). Locks the algebraic identity architecturally instead of leaving parallel code blocks to drift. MC oracle consistency validated under a 4-stratum × 6-PSU/stratum stratified null DGP with weights+strata+PSU (200 seeded draws, empirical Type I at α=0.05 in `[0, 0.10]` — 3σ band; the FPC bake-in is covered separately by the helper-unit test `test_fpc_baked_in_helper_is_fpc_agnostic`); MC power validated under a known-alternative stratified DGP (rejection > 0.50). HAD sup-t event-study cband bit-parity preserved (`atol=1e-14, rtol=1e-14` on the refactored helper output + 29 existing cband tests passing post-refactor; that helper-level bit-parity test locks the axis-0 algebra). A separate wired-in regression at `tests/test_had_pretests.py::TestStuteStratifiedSurveyBootstrap::test_stute_call_sites_invoke_apply_stratum_centering` monkey-patches the helper and asserts both Stute call sites (`stute_test` at `had_pretests.py:1985` and `stute_joint_pretest` at `:3312`) invoke it with `psu_axis=1` — that test fails if either call site is disconnected (the axis-0 helper-parity test alone does not catch that case). See `docs/methodology/REGISTRY.md` § HeterogeneousAdoptionDiD — "Note (Stute stratified survey-bootstrap calibration)" for the full derivation. Remaining deferrals: `lonely_psu='adjust'` + singleton-strata (same pseudo-stratum centering gap as the HAD sup-t deviation at REGISTRY:2382) and replicate-weight designs (BRR/Fay/JK1/JKn/SDR — separate Rao-Wu / JKn bootstrap composition). Unblocks the realistic survey-weighted HAD workflow on BRFSS/CPS/NHANES/ACS-shaped designs. - **Conley (1999) Wave A mechanical extensions** on top of the Phase 1+2 sandwich (`diff_diff/conley.py`, `diff_diff/linalg.py`, `diff_diff/estimators.py`, `diff_diff/twfe.py`). **(1) DiD support (#118):** `DifferenceInDifferences(vcov_type="conley").fit(..., unit="")` is now supported. `unit` is a fit-time kwarg (NOT on `__init__`; unused unless Conley is set; not part of `get_params()` / `set_params()`) mirroring `MultiPeriodDiD.fit(unit=...)` / `TwoWayFixedEffects.fit(unit=...)`. DiD inherits the same panel block-decomposed sandwich as MPD/TWFE; on a 2-period panel it matches `MultiPeriodDiD(...).fit(..., post_periods=[1], reference_period=0)` bit-exactly. Missing `unit=`/`conley_lag_cutoff`/`conley_coords`/`conley_cutoff_km` raise `ValueError`; `survey_design=` + Conley raises `NotImplementedError` (Bertanha-Imbens 2014 follow-up); `inference="wild_bootstrap"` + Conley raises `NotImplementedError`. **(2) Combined spatial + cluster product kernel (#119):** `compute_robust_vcov(vcov_type="conley", cluster_ids=...)` / `LinearRegression(vcov_type="conley", cluster_ids=...)` / `TwoWayFixedEffects(vcov_type="conley", cluster="")` / `MultiPeriodDiD(vcov_type="conley", cluster="")` / `DifferenceInDifferences(vcov_type="conley", cluster="")` apply `K_total[i, j] = K_space(d_ij/h) · 1{c_i = c_j}`. On the panel block-decomposed path the cluster indicator multiplies BOTH the spatial sandwich AND the serial sandwich; the validator enforces that `cluster_ids` is constant within each unit across periods (the within-unit serial mask is then trivially all-ones; cross-sectional path has no such constraint). TWFE's default auto-cluster on the Conley path remains silently dropped (combining with unit-level clusters would zero out all between-unit pairs and defeat the spatial pooling); users must pass an explicit above-unit cluster (e.g. region) to opt in. DiD has no auto-cluster — the choice is fully explicit. Two limit fixtures anchor correctness (no R parity — R `conleyreg` does not support combined kernels): all-unique-clusters reduces to HC0; huge-cutoff reduces to pure within-cluster CR1. The huge-cutoff reduction is EXACT only for `conley_kernel="uniform"` (`K(u) = 1` for `|u| ≤ 1`); for `conley_kernel="bartlett"` the identity is asymptotic since `K_bartlett(u) = 1 - |u| < 1` for `u > 0`. The fixture anchor uses uniform for an exact identity check. Per-slice mask construction (NOT full n×n) preserves memory on panel paths. **(3) Sparse k-d-tree fast path (#120):** auto-activates for the spatial Bartlett meat when `n > 5_000` AND metric is `"haversine"` or `"euclidean"` AND kernel is `"bartlett"`. Builds a CSR sparse kernel matrix via `scipy.spatial.cKDTree.query_ball_tree` instead of materializing the full n×n distance matrix; haversine projects to a 3-D unit-sphere chord representation with the exact great-circle recomputed for in-range neighbors only. Bit-identity parity vs the dense path at `atol=1e-10`; R parity at `atol=1e-6` is preserved on the existing 3 panel R fixtures with the sparse path force-enabled. The bartlett-only gate is for boundary correctness — bartlett at `u=1` is exactly 0, so the sparse path safely drops at-cutoff pairs; uniform at `u=1` is 1 and would require a closed-interval query semantic that haversine chord projection cannot reliably preserve. Constants: `_CONLEY_SPARSE_N_THRESHOLD = 5_000` (auto-toggle); `_CONLEY_DENSE_WARN_N` renamed `_CONLEY_DENSE_OOM_WARN_N = 20_000` (memory exhaustion threshold for the dense fallback — independent of the sparse threshold). Private `_conley_sparse: Optional[bool]` kwarg on `_compute_conley_vcov` controls the toggle (`None` = auto, `True` = force, `False` = force dense; `True` with an unsupported kernel/metric raises). The serial component (within-unit Bartlett over time) remains dense regardless — per-unit slices are small. **(4) Callable `conley_metric` validation (#123):** result must satisfy shape `(n, n)`, finite, non-negative, symmetric to `atol=1e-10`, AND zero on the diagonal (`|d(i, i)| ≤ 1e-10`); each failure raises a targeted `ValueError` naming the violated invariant. The zero-diagonal contract is load-bearing for the Conley sandwich: the `i = j` term must reduce to the HC0 diagonal `X_i ε_i² X_i'` via `K(0) = 1`; positive self-distance would silently attenuate the HC0 contribution by `K(d_ii / h) < 1`. Built-in metrics (`"haversine"`, `"euclidean"`) satisfy this by construction. Previously, malformed callables produced opaque BLAS errors deep in the pipeline. **Tests:** `tests/test_conley_vcov.py::TestConleySparse` (12), `::TestConleySparseRParityForced` (3), `::TestConleyCluster` (10), `::TestConleyDistanceMetrics` extended (7 new); existing rejection tests flipped to behavioral; `test_did_conley_matches_mpd_post_periods_1` locks the DiD-vs-MPD bit-exact agreement. **Docs:** REGISTRY `## ConleySpatialHAC` updates: new "Combined spatial + cluster product kernel" + "Performance / scale" subsections, DiD-vs-TWFE cluster asymmetry paragraph, updated panel-API restrictions table. TODO rows 118 / 119 / 120 / 123 removed; rows 121 (Conley + survey_design / weights, Bertanha-Imbens 2014) and 122 (`SyntheticDiD(vcov_type="conley")`, spatial-block bootstrap per Politis-Romano 1994) retained for future waves. - **Conley (1999) spatial-HAC standard errors via `vcov_type="conley"`** on cross-sectional `LinearRegression` / `compute_robust_vcov` plus panel `MultiPeriodDiD` / `TwoWayFixedEffects` (Phases 1 and 2 of the spillover-conley initiative). **Cross-sectional contract:** `conley_coords` (n × 2 array of lat/lon or projected coords), `conley_cutoff_km=` (positive finite bandwidth in km for haversine, or coord units for euclidean — REQUIRED, no default per the no-silent-failures contract), `conley_metric="haversine"|"euclidean"|callable` (default `"haversine"`; great-circle uses Earth's mean radius 6371.01 km matching R `conleyreg`), `conley_kernel="bartlett"|"uniform"` (default `"bartlett"`; both kernels emit a `UserWarning` if the resulting meat has a materially negative eigenvalue — neither the radial 1-D Bartlett nor the uniform kernel is formally PSD-guaranteed; Conley 1999's explicit PSD formula is the 2-D separable lattice product window at Eq 3.14). Cross-sectional variance estimator `Var̂(β) = (X'X)^{-1} · ( Σ_{i,j} K(d_ij/h) · X_i ε_i ε_j X_j' ) · (X'X)^{-1}` (Conley 1999 Eq 4.2). **Panel contract (Phase 2, new):** Three new co-required kwargs `conley_time` (n-length array), `conley_unit` (n-length array), and `conley_lag_cutoff=` (non-negative; 0 means within-period spatial only, no serial component) switch into the **block-decomposed panel sandwich** that matches R `conleyreg` with `lag_cutoff > 0`: `XeeX_total = Σ_t (within-period spatial sandwich) + Σ_u (within-unit Bartlett temporal sandwich, lag ∈ {1..L}, same-time excluded)`. This is NOT a multiplicative product kernel — verified empirically against `conleyreg::time_dist` and `XeeXhC` at ~1e-14 on the panel parity fixtures. The temporal kernel is hardcoded Bartlett `(1 - |lag|/(L+1))` regardless of `conley_kernel`, mirroring `conleyreg::time_dist.cpp`; documented as a `Note (deviation from R-symmetric API)` in REGISTRY. **Panel estimator wire-up (Phase 2):** `MultiPeriodDiD(vcov_type="conley", conley_lag_cutoff=...).fit(..., unit=...)` and `TwoWayFixedEffects(vcov_type="conley", conley_lag_cutoff=...).fit(..., unit=...)` lift the Phase 1 fit-time rejection; the `conley_time` and `conley_unit` arrays are auto-derived from the existing `time` and `unit` column-name arguments at fit-time. `DifferenceInDifferences(vcov_type="conley")` is also supported (Wave A #118 in this release; see the Wave A entry above) — pass `unit=` as a fit-time kwarg to `DiD.fit(...)`. **Other constraints (Phase 1, unchanged):** `SyntheticDiD(vcov_type="conley")` raises `TypeError` (uses bootstrap variance, not analytical sandwich); `set_params` mirrors the constructor rejection. `vcov_type="conley"` + `weights=` / `survey_design=` raises `NotImplementedError` (Bertanha-Imbens 2014 weighted-Conley deferred to a follow-up PR). `vcov_type="conley"` + explicit `cluster_ids=` is supported via the combined spatial + cluster product kernel (Wave A #119; see the Wave A entry above). TWFE's default auto-cluster on the Conley path is silently dropped (combining with unit-level clusters would defeat the spatial pooling); users opt into the combined kernel by passing an explicit above-unit cluster. `inference="wild_bootstrap"` + Conley raises (incompatible inference modes). A sparse k-d-tree fast path auto-activates for the spatial Bartlett meat when `n > 5_000` with bartlett kernel and haversine/euclidean metric (Wave A #120); the dense fallback still emits an OOM `UserWarning` at `n > 20_000`. **Implementation:** Helpers live in `diff_diff/conley.py` (`_haversine_km`, `_pairwise_distance_matrix`, `_bartlett_kernel`, `_uniform_kernel`, `_validate_conley_kwargs`, `_compute_conley_vcov` — the validator and sandwich helper now accept keyword-only `time` / `unit` / `lag_cutoff` for the panel path); `compute_robust_vcov` in `diff_diff/linalg.py` threads the new kwargs through. **R `conleyreg` parity (Düsterhöft 2021, CRAN v0.1.9)** on **six** benchmark fixtures (`benchmarks/data/r_conleyreg_conley_golden.json`, regenerable via `benchmarks/R/generate_conley_golden.R`): 3 cross-sectional (Phase 1) + 3 new panel fixtures (`panel_haversine_lag1`, `panel_haversine_lag2`, `panel_lat_lon_realistic_lag1`; n_units × T = 60×3, 80×5, 100×4 at lag={1,2,1}); observed max abs diff ~5.7e-16. Earth radius 6371.01 km matches `conleyreg::haversine_dist`. Test file `tests/test_conley_vcov.py` skips parity cleanly when the JSON is absent. New REGISTRY section `## ConleySpatialHAC`. Subsequent phases of the spillover-conley initiative (ring-indicator spillover-aware DiD per Butts 2021; survey-design / replicate-weight support; `SyntheticDiD` Conley path) are tracked in `TODO.md` under "Tech Debt from Code Reviews" → spillover-conley rows. diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index da7eb0e3..098ff1a1 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2574,7 +2574,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in - [x] Phase 5 (wave 1, PR #402): `llms-full.txt` HeterogeneousAdoptionDiD section + result-class blocks + `## HAD Pretests` index + Choosing-an-Estimator row landed; constructor / fit() parameter names are regression-locked against `inspect.signature(HeterogeneousAdoptionDiD.__init__)` and `HeterogeneousAdoptionDiD.fit` for parameter-name presence (parameter defaults and the non-return parameter type annotations remain unpinned; the `fit()` return-type union is locked BOTH at the source-code level AND at the test level by `TestFitReturnAnnotation`); result-class field tables enumerate every public dataclass field (regression-tested via `dataclasses.fields()`); `llms-practitioner.txt` Step 4 decision tree distinguishes ContinuousDiD (per-dose ATT(d), needs never-treated) from HeterogeneousAdoptionDiD (WAS, universal-rollout-compatible). - [x] Phase 5 (partial): README catalog one-liner, bundled `llms.txt` `## Estimators` entry, `docs/api/had.rst` (autoclass for the three classes), and `docs/references.rst` citation landed in PR #372 docs refresh. - [x] Phase 5 (wave 2 first slice, PR #409): T21 HAD pretest workflow tutorial (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `did_had_pretest_workflow`. Uses a `Uniform[$0.01K, $50K]` dose-distribution variant of T20's brand-campaign panel (true support strictly positive but near-zero, chosen so QUG fails-to-reject `H0: d_lower = 0` in finite sample). Walks through `aggregate="overall"` (Steps 1 + 3 only, verdict explicitly flags Step 2 deferral) and upgrades to `aggregate="event_study"` (joint pre-trends Stute + joint homogeneity Stute close the gap). Side panel exercises both `yatchew_hr_test` null modes (`linearity` vs `mean_independence`). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (16 tests pinning panel composition, both verdict pivots, structural anchors, deterministic stats, bootstrap p-value tolerance bands per backend, and `HAD(design="auto")` resolution to `continuous_at_zero` on this panel). -- [x] Phase 5 (wave 2 second slice): T22 weighted/survey HAD tutorial (`docs/tutorials/22_had_survey_design.ipynb`) - shipped as the follow-up to PR #432. End-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` under `SurveyDesign(weights, strata, psu, fpc)` on a BRFSS-shape state-rollout panel (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum). Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (30 tests pinning panel composition, naive-vs-survey SE inflation direction, design auto-detection, event-study cband-vs-pointwise width ordering, `_QUG_DEFERRED_SUFFIX` substring on `report.verdict` for both overall and event-study paths, the distinct `report.summary()` QUG-skip note on the event-study path, deterministic Yatchew sigma2_*, bootstrap p-value tolerance bands per `feedback_strata_bootstrap_path_divergence` (>= 0.25 abs), workflow-surface separation between overall and event-study paths, and the weighted point-estimation contract via the `_fit_continuous` algebraic identity). +- [x] Phase 5 (wave 2 second slice): T22 weighted/survey HAD tutorial (`docs/tutorials/22_had_survey_design.ipynb`) - shipped as the follow-up to PR #432. End-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` under `SurveyDesign(weights, strata, psu, fpc)` on a BRFSS-shape state-rollout panel (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum). Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (31 tests pinning panel composition, naive-vs-survey SE inflation direction, design auto-detection, event-study cband-vs-pointwise width ordering, `_QUG_DEFERRED_SUFFIX` substring on `report.verdict` for both overall and event-study paths, the distinct `report.summary()` QUG-skip note on the event-study path, deterministic Yatchew sigma2_*, bootstrap p-value anchored windows of total width 0.30 (± 0.15 around seeded centers) per `feedback_strata_bootstrap_path_divergence`, workflow-surface separation between overall and event-study paths, and the weighted point-estimation contract via the `_fit_continuous` algebraic identity). - [ ] Documentation of non-testability of Assumptions 5 and 6. - [ ] Warnings for staggered treatment timing (redirect to `ChaisemartinDHaultfoeuille`). - [ ] `NotImplementedError` phase pointer when `covariates=` is passed (Theorem 6 future work). diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb index b503ed86..b09b764c 100644 --- a/docs/tutorials/22_had_survey_design.ipynb +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -301,10 +301,15 @@ "target estimand `WAS_d_lower` (Weighted Average Slope at the lower\n", "boundary). T22 fits the same configuration twice: once naive (no\n", "`survey_design` argument), once survey-aware\n", - "(`survey_design=sd`). Both fits use the same weighted local-linear\n", - "estimator at d_lower - the survey path consumes the weights in\n", - "the local-linear `tau_bc` boundary fit AND in the numerator's\n", - "weighted `dy_mean` AND in the denominator `E_w[D - d_lower]`. On\n", + "(`survey_design=sd`). Both fits use the same local-linear estimator family at d_lower,\n", + "but the moment computations in `_fit_continuous` switch to weighted\n", + "form only when weights are present (`had.py:3747-3760` for the\n", + "denominator; `:3803-3808` for `dy_mean`). The naive fit uses the\n", + "unweighted local-linear `tau_bc`, the unweighted `dy_mean`, and\n", + "the unweighted denominator `E[D - d_lower]`. The survey-aware fit\n", + "uses the WEIGHTED `tau_bc` (via `bias_corrected_local_linear(...,\n", + "weights=weights_arr)`), the weighted `np.average(dy, weights=...)`,\n", + "and the weighted `np.average(d - d_lower, weights=...)`. On\n", "this DGP the weight CV (~0.30) and the dose-distribution shape do\n", "not co-vary strongly enough to shift the boundary slope materially,\n", "so the two ATTs are numerically close on this DGP. The SE\n", diff --git a/tests/test_t22_had_survey_design_drift.py b/tests/test_t22_had_survey_design_drift.py index eef8f9c8..e5757b33 100644 --- a/tests/test_t22_had_survey_design_drift.py +++ b/tests/test_t22_had_survey_design_drift.py @@ -25,7 +25,8 @@ continuous-dose HAD panel shape T20 uses (Design 1, dose ~ Uniform[$5K, $50K], att_slope=100). DGP and seed locked at ``_scratch/t22/dev.py``. -**Bootstrap p-value pins use abs tolerance bands >= 0.25** per +**Bootstrap p-value pins use anchored windows of total width 0.30 +(± 0.15 around the seed=22 captured centers)** per ``feedback_bootstrap_drift_tests_need_backend_tolerance`` and ``feedback_strata_bootstrap_path_divergence``. Stratified Mammen multiplier paths (PR #432) reduce effective dofs vs non-strata; PR #432 @@ -392,6 +393,17 @@ def test_survey_se_strictly_inflated_vs_naive(naive_overall_result, survey_overa ) +def test_survey_se_inflation_ratio_in_band(naive_overall_result, survey_overall_result): + """Anchored band on the seeded SE-inflation ratio. T22 §3 narrative + quotes "around 1.10x" inflation; sign-only assertion above is too + weak to catch numerical drift in the magnitude (per CI AI review + R4 P3). Locks the seed=87 captured ratio (~1.0985) to a tight + window so the §3 prose can't go silently stale if the analytical + Binder/TSL composition drifts.""" + ratio = float(survey_overall_result.se / naive_overall_result.se) + assert 1.00 <= ratio <= 1.20, ratio + + def test_survey_ci_covers_truth(survey_overall_result): """Survey-aware CI covers the true slope=100.""" lo, hi = survey_overall_result.conf_int @@ -474,8 +486,8 @@ def test_overall_report_all_pass_under_null(overall_report): def test_overall_report_stute_fails_to_reject(overall_report): """Stute CvM fails-to-reject linearity. Anchored bootstrap-p band - centered on the seed=22 captured value (~0.42) with abs width - ~0.30 per ``feedback_strata_bootstrap_path_divergence`` + centered on the seed=22 captured value (~0.42) with total width + 0.30 (± 0.15) per ``feedback_strata_bootstrap_path_divergence`` (stratified Mammen multiplier reduces effective dofs vs non-strata; PR #432 commit ``aef07020`` had to relax bit-equality on this code path). Drift either toward rejection or toward an @@ -549,7 +561,7 @@ def test_event_study_report_pretrends_and_homogeneity_fail_to_reject(event_study """Both joint pretrends and joint homogeneity fail-to-reject under the linear-DGP null. Anchored bootstrap-p bands centered on the seed=22 captured values (pretrends ~0.39, homogeneity ~0.41) with - abs width ~0.30 per + total width 0.30 (± 0.15) per ``feedback_strata_bootstrap_path_divergence`` (same rationale as Stute overall). Tighter than 0.10-0.95: catches drift in either direction rather than only rejecting on cross-the-line moves.""" From b022ba11a25d91070573a3e8a599e4f9571f223a Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 07:19:33 -0400 Subject: [PATCH 08/15] Address PR #440 R5 review (1 P2 + 1 P3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P2 — T22 §4 opener said WAS_d_lower is "a weighted average of unit-specific slopes in a local-linear neighborhood at d_lower", which conflated the target parameter with its nuisance estimator. The shipped contract (REGISTRY § HAD; `had.py:21-31`) defines `WAS_{d̲} = (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2 ≤ d]) / E[D_2 - d̲]` — an average slope above d_lower, NOT a neighborhood estimand. The local-linear boundary fit is one component (estimating the limit term). The leading-order variance still concentrates at the boundary because that's where the only nonparametric estimation happens, but that is a property of the variance, not the estimand. Rewrote the §4 opener to make this distinction explicit. P3 — T22 §5 attributed the survey event-study cband to "Phase 4.5 C composition", but per REGISTRY:2366-2380 the weighted event-study sup-t cband is Phase 4.5 B; Phase 4.5 C is the pretest/workflow extension demonstrated in §6. Updated to "Phase 4.5 B composition" with a one-clause note that the §6 material is the Phase 4.5 C work. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/22_had_survey_design.ipynb | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb index b09b764c..0d46bc71 100644 --- a/docs/tutorials/22_had_survey_design.ipynb +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -413,10 +413,14 @@ "source": [ "## 4. Why the SE inflation is modest for HAD\n", "\n", - "The HAD `WAS_d_lower` estimand is a weighted average of\n", - "unit-specific slopes in a **local-linear neighborhood at d_lower**.\n", - "The variance reads off the influence functions of the units in that\n", - "neighborhood, not from the full panel. With dose ~ Uniform[5, 50] and\n", + "The HAD `WAS_d_lower` estimand is the **average slope above d_lower**:\n", + "`WAS_{d̲} = (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2 ≤ d]) / E[D_2 - d̲]`\n", + "(REGISTRY § HeterogeneousAdoptionDiD; `had.py:21-31`). The estimator\n", + "uses a **local-linear boundary fit** to estimate the\n", + "`lim_{d↓d_lower} E[ΔY | D_2 ≤ d]` term — the only component of the\n", + "estimand that requires nonparametric identification. The leading-\n", + "order variance is therefore dominated by the influence functions of\n", + "units near `d_lower`, NOT by the full panel. With dose ~ Uniform[5, 50] and\n", "60 states, only a handful of states sit close to d_lower ~ 5 - and\n", "those are the units whose IFs dominate `Var(WAS_d_lower)`. The\n", "PSU-level cluster correlation can amplify the variance only as much\n", @@ -509,7 +513,8 @@ "per-horizon ATT estimates plus a sup-t confidence band that adjusts\n", "for the multiple-horizon comparison. The cband is computed via a\n", "multiplier bootstrap that aggregates per-PSU IF tensor under the\n", - "survey design (Phase 4.5 C composition).\n" + "survey design (Phase 4.5 B composition; the Phase 4.5 C work\n", + "covered the survey-aware Stute pretests demonstrated in §6).\n" ] }, { From f7471ff7d85a029a222167dd51470dc5a92d5c60 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 09:10:05 -0400 Subject: [PATCH 09/15] =?UTF-8?q?Address=20PR=20#440=20R6=20review=20(1=20?= =?UTF-8?q?P3=20=E2=80=94=20leadership-block=20estimand=20echo)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The §4 estimand-vs-estimator fix from R5 had a residual echo at the §7 leadership-block clause: "HAD's WAS-d_lower estimand is a local-linear at d_lower". Same conflation class — collapses the target parameter (`WAS_{d̲} = (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2 ≤ d]) / E[D_2 - d̲]`, the average slope above d_lower) with the nuisance estimator (the local-linear boundary fit used to estimate the limit term). Rewrote the clause along the lines the reviewer suggested: HAD uses a local-linear boundary fit at d_lower to estimate the boundary-limit term in the WAS-d_lower formula; variance is dominated by the few states near the boundary. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/22_had_survey_design.ipynb | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb index 0d46bc71..14bcae62 100644 --- a/docs/tutorials/22_had_survey_design.ipynb +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -755,7 +755,8 @@ "> design-based SE on the headline fit is ~10% larger than the naive\n", "> SE - smaller than the inflation a CallawaySantAnna or\n", "> LinearRegression coefficient would see at this PSU correlation,\n", - "> because HAD's WAS-d_lower estimand is a local-linear at d_lower\n", + "> because HAD uses a local-linear boundary fit at d_lower to\n", + "> estimate the boundary-limit term in the `WAS-d_lower` formula\n", "> (variance is dominated by the few states near the boundary, not\n", "> by the full panel; see section 4).\n" ] From 3e378facb2b74c23b598935c57eb3b8e58171dca Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 09:18:44 -0400 Subject: [PATCH 10/15] =?UTF-8?q?Address=20PR=20#440=20R7=20review=20(1=20?= =?UTF-8?q?P3=20=E2=80=94=20non-testable=20Assumption=205/6=20caveat)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit T22 walked the Design 1 (`continuous_near_d_lower`) survey-aware workflow but never restated the non-testable Assumption 5/6 caveat that point identification of `WAS_d_lower` rests on (the QUG-under-survey deferral was the only methodology caveat T22's prose called out). This made the survey-path caveat discussion read as if QUG deferral + linearity diagnostics exhaust the methodology risk, which is not the full HAD contract. Compounding factor: the five HAD/pretest fit cells used `warnings.filterwarnings("ignore", category=UserWarning)` — a blanket filter that silently swallowed the `continuous_near_d_lower ... Assumption 5 or 6` UserWarning the library fires on every Design 1 fit. So users running the notebook saw neither the warning nor the prose explanation. Two-part fix matching the reviewer's recommendation and T20's established convention: 1. Narrow each `filterwarnings` call from the catch-all `category=UserWarning` to two specific message patterns: `r".*pweight.*"` (suppress the noisy normalization message) and `r".*continuous_near_d_lower.*Assumption.*"` (suppress the redundant Assumption 5/6 advisory on subsequent fits; the §3 first fit lets it surface naturally so users see it at least once). This mirrors T20's own pattern at `20_had_brand_campaign.ipynb` where the headline fit lets the warning fire and the second event-study cell narrowly filters it out as redundant. Other UserWarnings — notably the QUG-deferred-under-survey advisory from `did_had_pretest_workflow` — are now no longer suppressed and fire as the load-bearing user-facing methodology signal that `feedback_no_silent_failures` requires. 2. Add an Assumption 5/6 caveat note to the §3 "Reading the table" interpretation cell. Mirrors T20's L229 prose: explains that point identification of `WAS_d_lower` requires Assumption 6 (or Assumption 5 for sign only); both are about local linearity of the dose-response near `d_lower` and are not testable from data; the §6 linearity diagnostics are necessary but not sufficient; users should justify Assumption 6 from domain knowledge. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/22_had_survey_design.ipynb | 66 ++++++++++++++++++++--- 1 file changed, 60 insertions(+), 6 deletions(-) diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb index 14bcae62..200d2c38 100644 --- a/docs/tutorials/22_had_survey_design.ipynb +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -345,7 +345,14 @@ "outputs": [], "source": [ "with warnings.catch_warnings():\n", - " warnings.filterwarnings(\"ignore\", category=UserWarning)\n", + " warnings.filterwarnings(\n", + " \"ignore\", message=r\".*pweight.*\", category=UserWarning\n", + " )\n", + " warnings.filterwarnings(\n", + " \"ignore\",\n", + " message=r\".*continuous_near_d_lower.*Assumption.*\",\n", + " category=UserWarning,\n", + " )\n", " naive = HAD(design=\"auto\").fit(\n", " panel_2p,\n", " outcome_col=\"screening_uptake\",\n", @@ -403,7 +410,26 @@ "inflation is modest - around 1.10x - which is smaller than the\n", "inflation a similar `weight_cv` and PSU shock would produce on, say,\n", "a CallawaySantAnna ATT. That's a feature of HAD's local-linear\n", - "estimand, not a bug. We unpack it in the next section.\n" + "estimand, not a bug. We unpack it in the next section.\n", + "\n", + "\n", + "**A note on the non-testable identifying assumption.** Design 1\n", + "(`continuous_near_d_lower`) requires **Assumption 6** from de\n", + "Chaisemartin et al. (2026) for point identification of\n", + "`WAS_d_lower`, or **Assumption 5** for sign identification only.\n", + "Both are about local linearity of the dose-response near `d_lower`\n", + "and are **not testable from data** — the linearity diagnostics\n", + "exercised in §6 (Stute, Yatchew, joint pretrends, joint\n", + "homogeneity) are necessary but not sufficient. Justify Assumption\n", + "6 from domain knowledge: is there reason to believe the marginal\n", + "effect of the next $1K of supplemental spend is roughly constant\n", + "in the $5K-$50K range? On this DGP it is, by construction; in a\n", + "real analysis, this is the load-bearing methodology caveat\n", + "alongside the QUG-under-survey deferral §6 calls out. The library\n", + "fires a `UserWarning` flagging this on every Design 1 fit; we\n", + "let it surface in the cell above for the headline fit and\n", + "narrowly filter it on subsequent fits to keep the cell output\n", + "focused." ] }, { @@ -447,7 +473,14 @@ "outputs": [], "source": [ "with warnings.catch_warnings():\n", - " warnings.filterwarnings(\"ignore\", category=UserWarning)\n", + " warnings.filterwarnings(\n", + " \"ignore\", message=r\".*pweight.*\", category=UserWarning\n", + " )\n", + " warnings.filterwarnings(\n", + " \"ignore\",\n", + " message=r\".*continuous_near_d_lower.*Assumption.*\",\n", + " category=UserWarning,\n", + " )\n", " naive_es = HAD(design=\"auto\").fit(\n", " panel,\n", " outcome_col=\"screening_uptake\",\n", @@ -525,7 +558,14 @@ "outputs": [], "source": [ "with warnings.catch_warnings():\n", - " warnings.filterwarnings(\"ignore\", category=UserWarning)\n", + " warnings.filterwarnings(\n", + " \"ignore\", message=r\".*pweight.*\", category=UserWarning\n", + " )\n", + " warnings.filterwarnings(\n", + " \"ignore\",\n", + " message=r\".*continuous_near_d_lower.*Assumption.*\",\n", + " category=UserWarning,\n", + " )\n", " es = HAD(design=\"auto\").fit(\n", " panel,\n", " outcome_col=\"screening_uptake\",\n", @@ -620,7 +660,14 @@ "outputs": [], "source": [ "with warnings.catch_warnings():\n", - " warnings.filterwarnings(\"ignore\", category=UserWarning)\n", + " warnings.filterwarnings(\n", + " \"ignore\", message=r\".*pweight.*\", category=UserWarning\n", + " )\n", + " warnings.filterwarnings(\n", + " \"ignore\",\n", + " message=r\".*continuous_near_d_lower.*Assumption.*\",\n", + " category=UserWarning,\n", + " )\n", " overall_report = did_had_pretest_workflow(\n", " panel_2p,\n", " outcome_col=\"screening_uptake\",\n", @@ -663,7 +710,14 @@ "outputs": [], "source": [ "with warnings.catch_warnings():\n", - " warnings.filterwarnings(\"ignore\", category=UserWarning)\n", + " warnings.filterwarnings(\n", + " \"ignore\", message=r\".*pweight.*\", category=UserWarning\n", + " )\n", + " warnings.filterwarnings(\n", + " \"ignore\",\n", + " message=r\".*continuous_near_d_lower.*Assumption.*\",\n", + " category=UserWarning,\n", + " )\n", " es_report = did_had_pretest_workflow(\n", " panel,\n", " outcome_col=\"screening_uptake\",\n", From e2a809d3de1cb4d8c02f8c3a5a4ee9ee50092eff Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 09:24:17 -0400 Subject: [PATCH 11/15] =?UTF-8?q?Address=20PR=20#440=20R8=20review=20(1=20?= =?UTF-8?q?P3=20=E2=80=94=20headline-fit=20warning=20consistency)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit R7's narrowing filter pass applied the same two-message filter (`pweight` + `Assumption 5/6`) uniformly to all five fit cells. The §3 interpretation prose then claimed the headline fit cell "lets [the Assumption 5/6 warning] surface" — which contradicted the actual cell behavior (the warning was suppressed there too). Removed the Assumption filter from the §3 headline fit cell only, so the warning fires once on the canonical first fit (matching the prose). All four subsequent fit cells (§4 event-study ratio, §5 event-study + cband, §6 overall workflow, §6 event-study workflow) keep both filters because the warning is redundant there. Added a NB comment in the headline fit cell explaining the deliberate omission. Pattern matches T20's L229 idiom: headline fit fires the warning once, prose explains it, subsequent fits narrowly filter as redundant. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/22_had_survey_design.ipynb | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb index 200d2c38..7a149150 100644 --- a/docs/tutorials/22_had_survey_design.ipynb +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -348,11 +348,11 @@ " warnings.filterwarnings(\n", " \"ignore\", message=r\".*pweight.*\", category=UserWarning\n", " )\n", - " warnings.filterwarnings(\n", - " \"ignore\",\n", - " message=r\".*continuous_near_d_lower.*Assumption.*\",\n", - " category=UserWarning,\n", - " )\n", + " # NB: we deliberately do NOT filter the\n", + " # `continuous_near_d_lower ... Assumption 5 or 6` warning here\n", + " # so it fires once on the headline fit (the §3 interpretation\n", + " # cell discusses it in prose). Subsequent fit cells suppress it\n", + " # as redundant.\n", " naive = HAD(design=\"auto\").fit(\n", " panel_2p,\n", " outcome_col=\"screening_uptake\",\n", From 57753eeb32c1663783415fd62d9975168860b776 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 09:51:59 -0400 Subject: [PATCH 12/15] T22 consolidation pass: tighten prose, preserve methodology MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Eight rounds of CI-review iteration tightened methodology precision but left the notebook prose denser than necessary — implementation detail and version bookkeeping had crept into §3, §4, §5, and §7 alongside the pedagogical arc. This pass prunes those without regressing on any methodology contract: - §3 setup paragraph: dropped the file:line dump (`had.py:3747-3760`, `:3803-3808`) and the redundant weighted-vs-unweighted point-by-point enumeration. The three-point weight-consumption claim (`tau_bc`, weighted ΔY mean, weighted denominator) is preserved in compact form. - §3 Assumption 5/6 note: trimmed from 15 lines to 11. Kept all load-bearing content (Assumption 6 / Assumption 5; not testable from data; §6 diagnostics necessary but not sufficient; domain knowledge justification; paired-with-QUG-deferral framing). - §4 opener: restructured to lead with intuition (few states near d_lower → small lever for PSU correlation), with the formal `WAS_{d̲}` definition pushed into a `**Formal definition.**` callout. Both halves are preserved — the formal definition is unchanged in content, just demoted from the lead. - §5: dropped the "(Phase 4.5 B composition; ...)" parenthetical (internal version bookkeeping, not user-facing). - §7 methodologist block: tightened from a numbered list with two verbatim verdict quotes to a compact two-clause description of the two paths plus the shared verdict suffix quoted once. `report.yatchew` / `report.stute = None` callout on the event-study path preserved. The SE-inflation-is-modest explanation (with section 4 cross-link) preserved. Methodology preservation verified against 14 load-bearing anchors: estimand definition, Assumption 5/6 caveat, non-testability, QUG-under-survey deferral, Phase 4.5 C0 label, Stute + Yatchew surfaces, joint pretrends + homogeneity surfaces, ES-path `yatchew/stute is None`, Binder TSL composition, local-linear boundary fit description, PSU x period shock mechanism. All 14 still present in the rendered prose. 31/31 drift tests still pass (the drift suite anchors load-bearing claims via the runtime API contract, not the notebook prose, so prose tightening is structurally safe). Diff: +57/-79 (net 22-line reduction in tutorial body). Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/22_had_survey_design.ipynb | 136 +++++++++------------- 1 file changed, 57 insertions(+), 79 deletions(-) diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb index 7a149150..3a059fb9 100644 --- a/docs/tutorials/22_had_survey_design.ipynb +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -295,26 +295,19 @@ "source": [ "## 3. Naive vs survey-aware headline fit\n", "\n", - "T20's headline path collapses to two periods (pre-mean vs post-mean per\n", - "state) and fits HAD with `design=\"auto\"` - the heuristic lands on\n", - "`continuous_near_d_lower` (Design 1) on this dose support, with the\n", - "target estimand `WAS_d_lower` (Weighted Average Slope at the lower\n", - "boundary). T22 fits the same configuration twice: once naive (no\n", - "`survey_design` argument), once survey-aware\n", - "(`survey_design=sd`). Both fits use the same local-linear estimator family at d_lower,\n", - "but the moment computations in `_fit_continuous` switch to weighted\n", - "form only when weights are present (`had.py:3747-3760` for the\n", - "denominator; `:3803-3808` for `dy_mean`). The naive fit uses the\n", - "unweighted local-linear `tau_bc`, the unweighted `dy_mean`, and\n", - "the unweighted denominator `E[D - d_lower]`. The survey-aware fit\n", - "uses the WEIGHTED `tau_bc` (via `bias_corrected_local_linear(...,\n", - "weights=weights_arr)`), the weighted `np.average(dy, weights=...)`,\n", - "and the weighted `np.average(d - d_lower, weights=...)`. On\n", - "this DGP the weight CV (~0.30) and the dose-distribution shape do\n", - "not co-vary strongly enough to shift the boundary slope materially,\n", - "so the two ATTs are numerically close on this DGP. The SE\n", - "and CI differ because the survey path additionally folds the PSU\n", - "clustering and FPC into the variance via the Binder TSL composition.\n" + "T20's headline path collapses to two periods (pre-mean vs post-mean\n", + "per state) and fits HAD with `design=\"auto\"` — the heuristic lands\n", + "on `continuous_near_d_lower` (Design 1), with the target estimand\n", + "`WAS_d_lower`. T22 fits the same configuration twice: once naive\n", + "(no `survey_design` argument), once survey-aware. Both fits share\n", + "the same local-linear machinery at d_lower; the survey path\n", + "additionally consumes the weights in the local-linear `tau_bc`\n", + "boundary fit, in the weighted ΔY mean, and in the weighted\n", + "denominator `E_w[D - d_lower]`. On this DGP the weight CV (~0.30)\n", + "and dose distribution do not co-vary strongly enough to shift the\n", + "boundary slope materially, so the two ATTs land close. The SE and\n", + "CI differ because the survey path folds PSU clustering and FPC\n", + "into the variance via Binder TSL.\n" ] }, { @@ -413,23 +406,19 @@ "estimand, not a bug. We unpack it in the next section.\n", "\n", "\n", - "**A note on the non-testable identifying assumption.** Design 1\n", - "(`continuous_near_d_lower`) requires **Assumption 6** from de\n", - "Chaisemartin et al. (2026) for point identification of\n", - "`WAS_d_lower`, or **Assumption 5** for sign identification only.\n", - "Both are about local linearity of the dose-response near `d_lower`\n", - "and are **not testable from data** — the linearity diagnostics\n", - "exercised in §6 (Stute, Yatchew, joint pretrends, joint\n", - "homogeneity) are necessary but not sufficient. Justify Assumption\n", - "6 from domain knowledge: is there reason to believe the marginal\n", - "effect of the next $1K of supplemental spend is roughly constant\n", - "in the $5K-$50K range? On this DGP it is, by construction; in a\n", - "real analysis, this is the load-bearing methodology caveat\n", - "alongside the QUG-under-survey deferral §6 calls out. The library\n", - "fires a `UserWarning` flagging this on every Design 1 fit; we\n", - "let it surface in the cell above for the headline fit and\n", - "narrowly filter it on subsequent fits to keep the cell output\n", - "focused." + "**A non-testable identifying assumption.** Design 1 requires\n", + "**Assumption 6** for point identification of `WAS_d_lower` (or\n", + "**Assumption 5** for sign identification only) — both are about\n", + "local linearity of the dose-response near `d_lower` and are **not\n", + "testable from data**. The §6 linearity diagnostics (Stute,\n", + "Yatchew, joint pretrends/homogeneity) are necessary but not\n", + "sufficient. Assumption 6 itself is justified from domain\n", + "knowledge (is the marginal effect of the next $1K of supplemental\n", + "spend roughly constant in the $5K-$50K range?). The library fires\n", + "a `UserWarning` on every Design 1 fit; the headline cell above\n", + "lets it surface, subsequent cells filter it as redundant. This is\n", + "the load-bearing methodology caveat alongside the QUG-under-survey\n", + "deferral (§6)." ] }, { @@ -439,21 +428,23 @@ "source": [ "## 4. Why the SE inflation is modest for HAD\n", "\n", - "The HAD `WAS_d_lower` estimand is the **average slope above d_lower**:\n", - "`WAS_{d̲} = (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2 ≤ d]) / E[D_2 - d̲]`\n", - "(REGISTRY § HeterogeneousAdoptionDiD; `had.py:21-31`). The estimator\n", - "uses a **local-linear boundary fit** to estimate the\n", - "`lim_{d↓d_lower} E[ΔY | D_2 ≤ d]` term — the only component of the\n", - "estimand that requires nonparametric identification. The leading-\n", - "order variance is therefore dominated by the influence functions of\n", - "units near `d_lower`, NOT by the full panel. With dose ~ Uniform[5, 50] and\n", - "60 states, only a handful of states sit close to d_lower ~ 5 - and\n", - "those are the units whose IFs dominate `Var(WAS_d_lower)`. The\n", - "PSU-level cluster correlation can amplify the variance only as much\n", - "as those few units are correlated with PSU-mates. With 2 states/PSU\n", - "and only a small share of states near the boundary, the within-PSU\n", + "**The intuition.** `WAS_d_lower` is the average slope above d_lower,\n", + "but its leading-order variance reads off a local-linear boundary\n", + "fit at `d_lower` — and that fit only weights units near the\n", + "boundary. With dose ~ Uniform[5, 50] and 60 states, only a handful\n", + "of states sit close to d_lower ~ 5, and those are the units whose\n", + "influence functions dominate `Var(WAS_d_lower)`. The PSU-level\n", + "cluster correlation can amplify the variance only as much as those\n", + "few units are correlated with PSU-mates. With 2 states/PSU and\n", + "only a small share of states near the boundary, the within-PSU\n", "correlation has a small lever to act on.\n", "\n", + "**Formal definition.** `WAS_{d̲} = (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2\n", + "≤ d]) / E[D_2 - d̲]` (REGISTRY § HeterogeneousAdoptionDiD;\n", + "`had.py:21-31`). The estimator uses a local-linear boundary fit at\n", + "`d_lower` to estimate the `lim_{d↓d̲} E[ΔY | D_2 ≤ d]` term — the\n", + "only component requiring nonparametric identification.\n", + "\n", "Contrast with the event-study path: each event-time horizon is a\n", "**separate** local-linear fit on that horizon's first differences\n", "`ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}` against the common regressor\n", @@ -545,9 +536,8 @@ "Refit with `aggregate=\"event_study\"` and `cband=True` to get\n", "per-horizon ATT estimates plus a sup-t confidence band that adjusts\n", "for the multiple-horizon comparison. The cband is computed via a\n", - "multiplier bootstrap that aggregates per-PSU IF tensor under the\n", - "survey design (Phase 4.5 B composition; the Phase 4.5 C work\n", - "covered the survey-aware Stute pretests demonstrated in §6).\n" + "multiplier bootstrap that aggregates the per-PSU IF tensor under\n", + "the survey design.\n" ] }, { @@ -785,34 +775,22 @@ "> rollout.\n", "\n", "> **For the methodologist.** The HAD pretest workflow runs two\n", - "> diagnostic passes; the QUG step is deferred under survey/weights\n", - "> per Phase 4.5 C0 (the load-bearing caveat we owe the audience).\n", - ">\n", - "> 1. **Overall (two-period) path:** `Stute` (CvM linearity test on\n", - "> residuals) + `Yatchew-HR` (closed-form weighted-OLS sandwich,\n", - "> `null=\"linearity\"` only - T22 does not exercise the\n", - "> `mean_independence` mode). Both fail-to-reject; verdict reads\n", - "> `\"Stute and Yatchew linearity diagnostics fail-to-reject\n", - "> (linearity-conditional verdict; QUG-under-survey deferred per\n", - "> Phase 4.5 C0)\"`.\n", - ">\n", - "> 2. **Event-study path:** `joint pre-trends` (joint-Stute over the\n", - "> three pre-launch placebo horizons) + `joint homogeneity`\n", - "> (joint-Stute over the four post-launch horizons). Both\n", - "> fail-to-reject; verdict reads `\"joint pre-trends and joint\n", - "> linearity diagnostics fail-to-reject (linearity-conditional\n", - "> verdict; QUG-under-survey deferred per Phase 4.5 C0)\"`.\n", - "> `report.yatchew is None` and `report.stute is None` on this\n", - "> path - those single-horizon tests are overall-only.\n", + "> diagnostic passes — overall (`Stute` CvM + `Yatchew-HR`\n", + "> closed-form) on the two-period collapse, and event-study\n", + "> (`joint pre-trends` + `joint homogeneity`, both joint-Stute)\n", + "> on the full panel. Both passes fail-to-reject on this DGP. Both\n", + "> verdicts end in `(linearity-conditional verdict; QUG-under-survey\n", + "> deferred per Phase 4.5 C0)` — the load-bearing C0 caveat. On the\n", + "> event-study path `report.yatchew` and `report.stute` are `None`;\n", + "> those single-horizon tests are overall-only.\n", ">\n", - "> Both paths share the QUG-under-survey deferral suffix. The\n", - "> design-based SE on the headline fit is ~10% larger than the naive\n", - "> SE - smaller than the inflation a CallawaySantAnna or\n", + "> The design-based SE on the headline fit is ~10% larger than the\n", + "> naive SE — smaller than the inflation a CallawaySantAnna or\n", "> LinearRegression coefficient would see at this PSU correlation,\n", - "> because HAD uses a local-linear boundary fit at d_lower to\n", - "> estimate the boundary-limit term in the `WAS-d_lower` formula\n", - "> (variance is dominated by the few states near the boundary, not\n", - "> by the full panel; see section 4).\n" + "> because HAD uses a local-linear boundary fit at `d_lower` to\n", + "> estimate the boundary-limit term in the `WAS_d_lower` formula\n", + "> (variance is dominated by the few states near the boundary; see\n", + "> §4).\n" ] }, { From 31d68e842d8dbca5c98d20f7f0788ea5e1eb37b0 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 10:00:28 -0400 Subject: [PATCH 13/15] Fix post-consolidation P1: event-study plot uses survey-aware CIs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The §5 matplotlib event-study plot hard-coded `yerr=1.96 * es.se` (Normal-theory pointwise CI) but the table immediately above uses `es.conf_int_low/high`, which the estimator computes with `t` critical values and `df_survey` on the survey path (`diff_diff/had.py:4352-4445`, `diff_diff/utils.py:38-46,177-210`). The plot silently understated uncertainty and contradicted its own neighboring table — a real methodology bug, not just prose. The cband ribbon was already drawn from `es.cband_low/high` (unaffected); only the pointwise error bars were broken. Fix: - Plot now builds asymmetric `yerr` from the stored `es.conf_int_low/high`. matplotlib's `errorbar` accepts a (2, n) array of `[lower_distances, upper_distances]`, which is what the estimator's stored endpoints encode (no need to back out the implied t critical value manually). - Legend label changed from "point + pointwise CI" to "point + pointwise CI (survey-aware t)" to flag the inference family in the figure. Drift coverage: - New `test_event_study_plot_uses_stored_pointwise_ci_endpoints` inspects the notebook source and rejects both the `1.96 * np.asarray(es.se)` and `1.96 * es.se` patterns, AND requires that the plot cell references `conf_int_low` / `conf_int_high`. This is a source-level static check (the plot cell has no return value to introspect at runtime), but it catches exactly this class of regression. Brings the drift suite from 31 to 32 tests; CHANGELOG / REGISTRY counts updated. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 2 +- docs/methodology/REGISTRY.md | 2 +- docs/tutorials/22_had_survey_design.ipynb | 9 +++++- tests/test_t22_had_survey_design_drift.py | 36 +++++++++++++++++++++++ 4 files changed, 46 insertions(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 23f5e1f9..9a5283d6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added -- **Tutorial 22: Survey-Weighted HAD** (`docs/tutorials/22_had_survey_design.ipynb`) — end-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` on a BRFSS-shape stratified household-survey panel (5 strata × 6 PSUs/stratum × 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU × period interaction shocks injected so cluster correlation survives DiD first-differencing). Demonstrates the `SurveyDesign(strata=...)` path through the Stute pretest family that the previous `[Unreleased]` entry unblocked. Eight numbered sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with a side-by-side ATT / SE / CI table (~10% SE inflation, sign-only direction asserted); a dedicated section explaining why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study fit with sup-t cband under the survey design (per-horizon table + matplotlib gated plot); pretest workflow on both `aggregate="overall"` and `aggregate="event_study"` paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; "Communicating to Leadership" two-paragraph template (executive + methodologist); Extensions + Summary Checklist surfacing the still-deferred `lonely_psu='adjust'` + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (31 tests across 7 groups: panel + survey composition with deterministic exact pins; naive-vs-survey headline with sign-only SE-inflation anchor; event-study cband-vs-pointwise ordering and post/pre coverage; pretest overall path with `_QUG_DEFERRED_SUFFIX` lock and Yatchew `sigma2_*` deterministic pins; pretest event-study path with the SAME `_QUG_DEFERRED_SUFFIX` lock plus a SEPARATE substring lock on `report.summary()` for the L736 QUG-skip note; workflow-surface separation locking that overall has Stute+Yatchew while event-study has joint pretrends/joint linearity with `yatchew=None` and `stute=None`; and weighted point-estimation contract anchoring `survey.att != naive.att` plus the algebraic identity `att = (dy_mean_w - tau_bc) / den_w` from `_fit_continuous`). Bootstrap p-value pins use anchored windows of total width 0.30 (± 0.15 around seeded centers) per `feedback_strata_bootstrap_path_divergence` (stratified Mammen multiplier paths reduce effective dofs vs non-strata; PR #432 commit `aef07020` already had to relax bit-equality bands on this code path). T20 and T21 "Extensions" bullets updated with forward-pointers to T22; `docs/practitioner_decision_tree.rst` HAD universal-rollout and survey sections each gain a `.. tip::` cross-link to T22 (adjacent to T20 / T17, NOT displacing); `docs/api/had.rst` gains a "Survey-aware fit" cross-reference; `docs/survey-roadmap.md` gains a "Phase 4.5 C ✅ Shipped" entry; bundled `diff_diff/guides/llms.txt` and `llms-practitioner.txt` carry T22 inventory entries (the `llms-full.txt` reference guide is left as a follow-up to keep T22 PR scope tight); `docs/doc-deps.yaml` wires T22 as a dependent of both `had.py` and `had_pretests.py`. Closes the Phase 5 (wave 2 second slice) tutorial gap; the realistic survey-weighted HAD workflow on BRFSS / CPS / NHANES / ACS-shaped designs is now end-to-end documented for practitioners. +- **Tutorial 22: Survey-Weighted HAD** (`docs/tutorials/22_had_survey_design.ipynb`) — end-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` on a BRFSS-shape stratified household-survey panel (5 strata × 6 PSUs/stratum × 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU × period interaction shocks injected so cluster correlation survives DiD first-differencing). Demonstrates the `SurveyDesign(strata=...)` path through the Stute pretest family that the previous `[Unreleased]` entry unblocked. Eight numbered sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with a side-by-side ATT / SE / CI table (~10% SE inflation, sign-only direction asserted); a dedicated section explaining why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study fit with sup-t cband under the survey design (per-horizon table + matplotlib gated plot); pretest workflow on both `aggregate="overall"` and `aggregate="event_study"` paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; "Communicating to Leadership" two-paragraph template (executive + methodologist); Extensions + Summary Checklist surfacing the still-deferred `lonely_psu='adjust'` + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (32 tests across 7 groups: panel + survey composition with deterministic exact pins; naive-vs-survey headline with sign-only SE-inflation anchor; event-study cband-vs-pointwise ordering and post/pre coverage; pretest overall path with `_QUG_DEFERRED_SUFFIX` lock and Yatchew `sigma2_*` deterministic pins; pretest event-study path with the SAME `_QUG_DEFERRED_SUFFIX` lock plus a SEPARATE substring lock on `report.summary()` for the L736 QUG-skip note; workflow-surface separation locking that overall has Stute+Yatchew while event-study has joint pretrends/joint linearity with `yatchew=None` and `stute=None`; and weighted point-estimation contract anchoring `survey.att != naive.att` plus the algebraic identity `att = (dy_mean_w - tau_bc) / den_w` from `_fit_continuous`). Bootstrap p-value pins use anchored windows of total width 0.30 (± 0.15 around seeded centers) per `feedback_strata_bootstrap_path_divergence` (stratified Mammen multiplier paths reduce effective dofs vs non-strata; PR #432 commit `aef07020` already had to relax bit-equality bands on this code path). T20 and T21 "Extensions" bullets updated with forward-pointers to T22; `docs/practitioner_decision_tree.rst` HAD universal-rollout and survey sections each gain a `.. tip::` cross-link to T22 (adjacent to T20 / T17, NOT displacing); `docs/api/had.rst` gains a "Survey-aware fit" cross-reference; `docs/survey-roadmap.md` gains a "Phase 4.5 C ✅ Shipped" entry; bundled `diff_diff/guides/llms.txt` and `llms-practitioner.txt` carry T22 inventory entries (the `llms-full.txt` reference guide is left as a follow-up to keep T22 PR scope tight); `docs/doc-deps.yaml` wires T22 as a dependent of both `had.py` and `had_pretests.py`. Closes the Phase 5 (wave 2 second slice) tutorial gap; the realistic survey-weighted HAD workflow on BRFSS / CPS / NHANES / ACS-shaped designs is now end-to-end documented for practitioners. - **HAD pretest workflow: stratified survey-design support (Phase 4.5 C continuation).** Lifts the `NotImplementedError` gate on `SurveyDesign(strata=...)` in `stute_test` (`had_pretests.py:1927-1940` pre-PR) and `stute_joint_pretest` (`:3259-3271` pre-PR), and by inheritance in `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` (the wrappers delegate to the joint Stute helper). Implements a documented synthesis of clustered-wild-bootstrap ingredients (Cameron-Gelbach-Miller 2008 cluster-level multipliers; Davidson-Flachaire 2008 wild-bootstrap centering; Djogbenou-MacKinnon-Nielsen 2019 cluster-wild consistency for nonlinear functionals; Kreiss-Lahiri 2012 within-block centering analogy; Wu 1986 / Liu 1988 Bessel small-sample correction) — no single paper covers the exact composition for the stratified Stute CvM functional. The recipe: within-stratum demean + `sqrt(n_h/(n_h-1))` Bessel rescale applied to the PSU multipliers `psu_mults` BEFORE the per-obs broadcast `eta_obs = psu_mults[b, psu_col_idx]` in the wild-residual loop. Bootstrap CvM variance matches the analytical Binder-TSL stratified target `V_S = sum_h (1 - f_h) (n_h / (n_h - 1)) sum_j (psi_hj - psi_h_bar)²` exactly (the `(1 - f_h)` FPC factor was already baked in by `generate_survey_multiplier_weights_batch`; this PR bakes the remaining `(n_h / (n_h - 1))` factor and enforces within-stratum-mean-zero centering). New shared helper `bootstrap_utils.apply_stratum_centering(psu_mults, resolved_survey, psu_ids, psu_axis=...)` is called from both the new Stute path (psu_axis=1 on the multiplier matrix) AND the existing HAD sup-t event-study cband bootstrap (psu_axis=0 on the PSU-aggregated influence tensor; refactored bit-exactly from the inline block previously at `had.py:2172-2204`). Locks the algebraic identity architecturally instead of leaving parallel code blocks to drift. MC oracle consistency validated under a 4-stratum × 6-PSU/stratum stratified null DGP with weights+strata+PSU (200 seeded draws, empirical Type I at α=0.05 in `[0, 0.10]` — 3σ band; the FPC bake-in is covered separately by the helper-unit test `test_fpc_baked_in_helper_is_fpc_agnostic`); MC power validated under a known-alternative stratified DGP (rejection > 0.50). HAD sup-t event-study cband bit-parity preserved (`atol=1e-14, rtol=1e-14` on the refactored helper output + 29 existing cband tests passing post-refactor; that helper-level bit-parity test locks the axis-0 algebra). A separate wired-in regression at `tests/test_had_pretests.py::TestStuteStratifiedSurveyBootstrap::test_stute_call_sites_invoke_apply_stratum_centering` monkey-patches the helper and asserts both Stute call sites (`stute_test` at `had_pretests.py:1985` and `stute_joint_pretest` at `:3312`) invoke it with `psu_axis=1` — that test fails if either call site is disconnected (the axis-0 helper-parity test alone does not catch that case). See `docs/methodology/REGISTRY.md` § HeterogeneousAdoptionDiD — "Note (Stute stratified survey-bootstrap calibration)" for the full derivation. Remaining deferrals: `lonely_psu='adjust'` + singleton-strata (same pseudo-stratum centering gap as the HAD sup-t deviation at REGISTRY:2382) and replicate-weight designs (BRR/Fay/JK1/JKn/SDR — separate Rao-Wu / JKn bootstrap composition). Unblocks the realistic survey-weighted HAD workflow on BRFSS/CPS/NHANES/ACS-shaped designs. - **Conley (1999) Wave A mechanical extensions** on top of the Phase 1+2 sandwich (`diff_diff/conley.py`, `diff_diff/linalg.py`, `diff_diff/estimators.py`, `diff_diff/twfe.py`). **(1) DiD support (#118):** `DifferenceInDifferences(vcov_type="conley").fit(..., unit="")` is now supported. `unit` is a fit-time kwarg (NOT on `__init__`; unused unless Conley is set; not part of `get_params()` / `set_params()`) mirroring `MultiPeriodDiD.fit(unit=...)` / `TwoWayFixedEffects.fit(unit=...)`. DiD inherits the same panel block-decomposed sandwich as MPD/TWFE; on a 2-period panel it matches `MultiPeriodDiD(...).fit(..., post_periods=[1], reference_period=0)` bit-exactly. Missing `unit=`/`conley_lag_cutoff`/`conley_coords`/`conley_cutoff_km` raise `ValueError`; `survey_design=` + Conley raises `NotImplementedError` (Bertanha-Imbens 2014 follow-up); `inference="wild_bootstrap"` + Conley raises `NotImplementedError`. **(2) Combined spatial + cluster product kernel (#119):** `compute_robust_vcov(vcov_type="conley", cluster_ids=...)` / `LinearRegression(vcov_type="conley", cluster_ids=...)` / `TwoWayFixedEffects(vcov_type="conley", cluster="")` / `MultiPeriodDiD(vcov_type="conley", cluster="")` / `DifferenceInDifferences(vcov_type="conley", cluster="")` apply `K_total[i, j] = K_space(d_ij/h) · 1{c_i = c_j}`. On the panel block-decomposed path the cluster indicator multiplies BOTH the spatial sandwich AND the serial sandwich; the validator enforces that `cluster_ids` is constant within each unit across periods (the within-unit serial mask is then trivially all-ones; cross-sectional path has no such constraint). TWFE's default auto-cluster on the Conley path remains silently dropped (combining with unit-level clusters would zero out all between-unit pairs and defeat the spatial pooling); users must pass an explicit above-unit cluster (e.g. region) to opt in. DiD has no auto-cluster — the choice is fully explicit. Two limit fixtures anchor correctness (no R parity — R `conleyreg` does not support combined kernels): all-unique-clusters reduces to HC0; huge-cutoff reduces to pure within-cluster CR1. The huge-cutoff reduction is EXACT only for `conley_kernel="uniform"` (`K(u) = 1` for `|u| ≤ 1`); for `conley_kernel="bartlett"` the identity is asymptotic since `K_bartlett(u) = 1 - |u| < 1` for `u > 0`. The fixture anchor uses uniform for an exact identity check. Per-slice mask construction (NOT full n×n) preserves memory on panel paths. **(3) Sparse k-d-tree fast path (#120):** auto-activates for the spatial Bartlett meat when `n > 5_000` AND metric is `"haversine"` or `"euclidean"` AND kernel is `"bartlett"`. Builds a CSR sparse kernel matrix via `scipy.spatial.cKDTree.query_ball_tree` instead of materializing the full n×n distance matrix; haversine projects to a 3-D unit-sphere chord representation with the exact great-circle recomputed for in-range neighbors only. Bit-identity parity vs the dense path at `atol=1e-10`; R parity at `atol=1e-6` is preserved on the existing 3 panel R fixtures with the sparse path force-enabled. The bartlett-only gate is for boundary correctness — bartlett at `u=1` is exactly 0, so the sparse path safely drops at-cutoff pairs; uniform at `u=1` is 1 and would require a closed-interval query semantic that haversine chord projection cannot reliably preserve. Constants: `_CONLEY_SPARSE_N_THRESHOLD = 5_000` (auto-toggle); `_CONLEY_DENSE_WARN_N` renamed `_CONLEY_DENSE_OOM_WARN_N = 20_000` (memory exhaustion threshold for the dense fallback — independent of the sparse threshold). Private `_conley_sparse: Optional[bool]` kwarg on `_compute_conley_vcov` controls the toggle (`None` = auto, `True` = force, `False` = force dense; `True` with an unsupported kernel/metric raises). The serial component (within-unit Bartlett over time) remains dense regardless — per-unit slices are small. **(4) Callable `conley_metric` validation (#123):** result must satisfy shape `(n, n)`, finite, non-negative, symmetric to `atol=1e-10`, AND zero on the diagonal (`|d(i, i)| ≤ 1e-10`); each failure raises a targeted `ValueError` naming the violated invariant. The zero-diagonal contract is load-bearing for the Conley sandwich: the `i = j` term must reduce to the HC0 diagonal `X_i ε_i² X_i'` via `K(0) = 1`; positive self-distance would silently attenuate the HC0 contribution by `K(d_ii / h) < 1`. Built-in metrics (`"haversine"`, `"euclidean"`) satisfy this by construction. Previously, malformed callables produced opaque BLAS errors deep in the pipeline. **Tests:** `tests/test_conley_vcov.py::TestConleySparse` (12), `::TestConleySparseRParityForced` (3), `::TestConleyCluster` (10), `::TestConleyDistanceMetrics` extended (7 new); existing rejection tests flipped to behavioral; `test_did_conley_matches_mpd_post_periods_1` locks the DiD-vs-MPD bit-exact agreement. **Docs:** REGISTRY `## ConleySpatialHAC` updates: new "Combined spatial + cluster product kernel" + "Performance / scale" subsections, DiD-vs-TWFE cluster asymmetry paragraph, updated panel-API restrictions table. TODO rows 118 / 119 / 120 / 123 removed; rows 121 (Conley + survey_design / weights, Bertanha-Imbens 2014) and 122 (`SyntheticDiD(vcov_type="conley")`, spatial-block bootstrap per Politis-Romano 1994) retained for future waves. - **Conley (1999) spatial-HAC standard errors via `vcov_type="conley"`** on cross-sectional `LinearRegression` / `compute_robust_vcov` plus panel `MultiPeriodDiD` / `TwoWayFixedEffects` (Phases 1 and 2 of the spillover-conley initiative). **Cross-sectional contract:** `conley_coords` (n × 2 array of lat/lon or projected coords), `conley_cutoff_km=` (positive finite bandwidth in km for haversine, or coord units for euclidean — REQUIRED, no default per the no-silent-failures contract), `conley_metric="haversine"|"euclidean"|callable` (default `"haversine"`; great-circle uses Earth's mean radius 6371.01 km matching R `conleyreg`), `conley_kernel="bartlett"|"uniform"` (default `"bartlett"`; both kernels emit a `UserWarning` if the resulting meat has a materially negative eigenvalue — neither the radial 1-D Bartlett nor the uniform kernel is formally PSD-guaranteed; Conley 1999's explicit PSD formula is the 2-D separable lattice product window at Eq 3.14). Cross-sectional variance estimator `Var̂(β) = (X'X)^{-1} · ( Σ_{i,j} K(d_ij/h) · X_i ε_i ε_j X_j' ) · (X'X)^{-1}` (Conley 1999 Eq 4.2). **Panel contract (Phase 2, new):** Three new co-required kwargs `conley_time` (n-length array), `conley_unit` (n-length array), and `conley_lag_cutoff=` (non-negative; 0 means within-period spatial only, no serial component) switch into the **block-decomposed panel sandwich** that matches R `conleyreg` with `lag_cutoff > 0`: `XeeX_total = Σ_t (within-period spatial sandwich) + Σ_u (within-unit Bartlett temporal sandwich, lag ∈ {1..L}, same-time excluded)`. This is NOT a multiplicative product kernel — verified empirically against `conleyreg::time_dist` and `XeeXhC` at ~1e-14 on the panel parity fixtures. The temporal kernel is hardcoded Bartlett `(1 - |lag|/(L+1))` regardless of `conley_kernel`, mirroring `conleyreg::time_dist.cpp`; documented as a `Note (deviation from R-symmetric API)` in REGISTRY. **Panel estimator wire-up (Phase 2):** `MultiPeriodDiD(vcov_type="conley", conley_lag_cutoff=...).fit(..., unit=...)` and `TwoWayFixedEffects(vcov_type="conley", conley_lag_cutoff=...).fit(..., unit=...)` lift the Phase 1 fit-time rejection; the `conley_time` and `conley_unit` arrays are auto-derived from the existing `time` and `unit` column-name arguments at fit-time. `DifferenceInDifferences(vcov_type="conley")` is also supported (Wave A #118 in this release; see the Wave A entry above) — pass `unit=` as a fit-time kwarg to `DiD.fit(...)`. **Other constraints (Phase 1, unchanged):** `SyntheticDiD(vcov_type="conley")` raises `TypeError` (uses bootstrap variance, not analytical sandwich); `set_params` mirrors the constructor rejection. `vcov_type="conley"` + `weights=` / `survey_design=` raises `NotImplementedError` (Bertanha-Imbens 2014 weighted-Conley deferred to a follow-up PR). `vcov_type="conley"` + explicit `cluster_ids=` is supported via the combined spatial + cluster product kernel (Wave A #119; see the Wave A entry above). TWFE's default auto-cluster on the Conley path is silently dropped (combining with unit-level clusters would defeat the spatial pooling); users opt into the combined kernel by passing an explicit above-unit cluster. `inference="wild_bootstrap"` + Conley raises (incompatible inference modes). A sparse k-d-tree fast path auto-activates for the spatial Bartlett meat when `n > 5_000` with bartlett kernel and haversine/euclidean metric (Wave A #120); the dense fallback still emits an OOM `UserWarning` at `n > 20_000`. **Implementation:** Helpers live in `diff_diff/conley.py` (`_haversine_km`, `_pairwise_distance_matrix`, `_bartlett_kernel`, `_uniform_kernel`, `_validate_conley_kwargs`, `_compute_conley_vcov` — the validator and sandwich helper now accept keyword-only `time` / `unit` / `lag_cutoff` for the panel path); `compute_robust_vcov` in `diff_diff/linalg.py` threads the new kwargs through. **R `conleyreg` parity (Düsterhöft 2021, CRAN v0.1.9)** on **six** benchmark fixtures (`benchmarks/data/r_conleyreg_conley_golden.json`, regenerable via `benchmarks/R/generate_conley_golden.R`): 3 cross-sectional (Phase 1) + 3 new panel fixtures (`panel_haversine_lag1`, `panel_haversine_lag2`, `panel_lat_lon_realistic_lag1`; n_units × T = 60×3, 80×5, 100×4 at lag={1,2,1}); observed max abs diff ~5.7e-16. Earth radius 6371.01 km matches `conleyreg::haversine_dist`. Test file `tests/test_conley_vcov.py` skips parity cleanly when the JSON is absent. New REGISTRY section `## ConleySpatialHAC`. Subsequent phases of the spillover-conley initiative (ring-indicator spillover-aware DiD per Butts 2021; survey-design / replicate-weight support; `SyntheticDiD` Conley path) are tracked in `TODO.md` under "Tech Debt from Code Reviews" → spillover-conley rows. diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 098ff1a1..ba7d4374 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2574,7 +2574,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in - [x] Phase 5 (wave 1, PR #402): `llms-full.txt` HeterogeneousAdoptionDiD section + result-class blocks + `## HAD Pretests` index + Choosing-an-Estimator row landed; constructor / fit() parameter names are regression-locked against `inspect.signature(HeterogeneousAdoptionDiD.__init__)` and `HeterogeneousAdoptionDiD.fit` for parameter-name presence (parameter defaults and the non-return parameter type annotations remain unpinned; the `fit()` return-type union is locked BOTH at the source-code level AND at the test level by `TestFitReturnAnnotation`); result-class field tables enumerate every public dataclass field (regression-tested via `dataclasses.fields()`); `llms-practitioner.txt` Step 4 decision tree distinguishes ContinuousDiD (per-dose ATT(d), needs never-treated) from HeterogeneousAdoptionDiD (WAS, universal-rollout-compatible). - [x] Phase 5 (partial): README catalog one-liner, bundled `llms.txt` `## Estimators` entry, `docs/api/had.rst` (autoclass for the three classes), and `docs/references.rst` citation landed in PR #372 docs refresh. - [x] Phase 5 (wave 2 first slice, PR #409): T21 HAD pretest workflow tutorial (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `did_had_pretest_workflow`. Uses a `Uniform[$0.01K, $50K]` dose-distribution variant of T20's brand-campaign panel (true support strictly positive but near-zero, chosen so QUG fails-to-reject `H0: d_lower = 0` in finite sample). Walks through `aggregate="overall"` (Steps 1 + 3 only, verdict explicitly flags Step 2 deferral) and upgrades to `aggregate="event_study"` (joint pre-trends Stute + joint homogeneity Stute close the gap). Side panel exercises both `yatchew_hr_test` null modes (`linearity` vs `mean_independence`). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (16 tests pinning panel composition, both verdict pivots, structural anchors, deterministic stats, bootstrap p-value tolerance bands per backend, and `HAD(design="auto")` resolution to `continuous_at_zero` on this panel). -- [x] Phase 5 (wave 2 second slice): T22 weighted/survey HAD tutorial (`docs/tutorials/22_had_survey_design.ipynb`) - shipped as the follow-up to PR #432. End-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` under `SurveyDesign(weights, strata, psu, fpc)` on a BRFSS-shape state-rollout panel (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum). Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (31 tests pinning panel composition, naive-vs-survey SE inflation direction, design auto-detection, event-study cband-vs-pointwise width ordering, `_QUG_DEFERRED_SUFFIX` substring on `report.verdict` for both overall and event-study paths, the distinct `report.summary()` QUG-skip note on the event-study path, deterministic Yatchew sigma2_*, bootstrap p-value anchored windows of total width 0.30 (± 0.15 around seeded centers) per `feedback_strata_bootstrap_path_divergence`, workflow-surface separation between overall and event-study paths, and the weighted point-estimation contract via the `_fit_continuous` algebraic identity). +- [x] Phase 5 (wave 2 second slice): T22 weighted/survey HAD tutorial (`docs/tutorials/22_had_survey_design.ipynb`) - shipped as the follow-up to PR #432. End-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` under `SurveyDesign(weights, strata, psu, fpc)` on a BRFSS-shape state-rollout panel (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum). Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (32 tests pinning panel composition, naive-vs-survey SE inflation direction, design auto-detection, event-study cband-vs-pointwise width ordering, `_QUG_DEFERRED_SUFFIX` substring on `report.verdict` for both overall and event-study paths, the distinct `report.summary()` QUG-skip note on the event-study path, deterministic Yatchew sigma2_*, bootstrap p-value anchored windows of total width 0.30 (± 0.15 around seeded centers) per `feedback_strata_bootstrap_path_divergence`, workflow-surface separation between overall and event-study paths, and the weighted point-estimation contract via the `_fit_continuous` algebraic identity). - [ ] Documentation of non-testability of Assumptions 5 and 6. - [ ] Warnings for staggered treatment timing (redirect to `ChaisemartinDHaultfoeuille`). - [ ] `NotImplementedError` phase pointer when `covariates=` is passed (Theorem 6 future work). diff --git a/docs/tutorials/22_had_survey_design.ipynb b/docs/tutorials/22_had_survey_design.ipynb index 3a059fb9..f6ffca17 100644 --- a/docs/tutorials/22_had_survey_design.ipynb +++ b/docs/tutorials/22_had_survey_design.ipynb @@ -595,7 +595,14 @@ " ax.axhline(0.0, color=\"black\", linewidth=0.5)\n", " ax.axhline(TRUE_SLOPE, color=\"gray\", linewidth=0.6, linestyle=\":\", label=\"true slope = 100\")\n", " ax.fill_between(et, np.asarray(es.cband_low), np.asarray(es.cband_high), alpha=0.15, label=\"sup-t cband (survey)\")\n", - " ax.errorbar(et, np.asarray(es.att), yerr=1.96 * np.asarray(es.se), fmt=\"o\", color=\"#1f77b4\", capsize=3, label=\"point + pointwise CI\")\n", + " # Use the estimator's stored pointwise CI endpoints (built from t\n", + " # critical values with df_survey under SurveyDesign — NOT hard-coded\n", + " # 1.96 * se, which would silently understate uncertainty under\n", + " # survey-aware inference). matplotlib errorbar takes asymmetric\n", + " # yerr as a (2, n) array of [lower_distances, upper_distances].\n", + " att = np.asarray(es.att)\n", + " yerr = np.vstack([att - np.asarray(es.conf_int_low), np.asarray(es.conf_int_high) - att])\n", + " ax.errorbar(et, att, yerr=yerr, fmt=\"o\", color=\"#1f77b4\", capsize=3, label=\"point + pointwise CI (survey-aware t)\")\n", " ax.axvline(-0.5, color=\"red\", linewidth=0.6, linestyle=\"--\")\n", " ax.set_xlabel(\"event time (weeks since campaign launch)\")\n", " ax.set_ylabel(\"per-$1K WAS-d_lower\")\n", diff --git a/tests/test_t22_had_survey_design_drift.py b/tests/test_t22_had_survey_design_drift.py index e5757b33..b455d75b 100644 --- a/tests/test_t22_had_survey_design_drift.py +++ b/tests/test_t22_had_survey_design_drift.py @@ -393,6 +393,42 @@ def test_survey_se_strictly_inflated_vs_naive(naive_overall_result, survey_overa ) +def test_event_study_plot_uses_stored_pointwise_ci_endpoints(): + """The §5 matplotlib plot must build pointwise CI bars from the + estimator's stored ``conf_int_low`` / ``conf_int_high`` (which on + the survey path use ``t`` critical values with ``df_survey``), + NOT from hard-coded ``1.96 * es.se`` Normal-theory bands. The + earlier version of the plot used the latter and silently + understated uncertainty relative to the table printed in the + cell above it (CI AI review post-consolidation P1). + + This is a static check on the notebook source — the plot cell + runs but produces no return value we can introspect, so we lock + the construction at the source level.""" + from pathlib import Path + import nbformat + + nb_path = ( + Path(__file__).resolve().parents[1] / "docs" / "tutorials" / "22_had_survey_design.ipynb" + ) + nb = nbformat.read(nb_path, as_version=4) + plot_cell_src = None + for cell in nb.cells: + if cell.cell_type != "code": + continue + src = cell.source if isinstance(cell.source, str) else "".join(cell.source) + if "HAD event-study under SurveyDesign" in src and "errorbar" in src: + plot_cell_src = src + break + assert plot_cell_src is not None, "T22 event-study plot cell not found" + # Must use stored endpoints + assert "conf_int_low" in plot_cell_src, plot_cell_src + assert "conf_int_high" in plot_cell_src, plot_cell_src + # Must NOT hard-code Normal-theory bars + assert "1.96 * np.asarray(es.se)" not in plot_cell_src, plot_cell_src + assert "1.96 * es.se" not in plot_cell_src, plot_cell_src + + def test_survey_se_inflation_ratio_in_band(naive_overall_result, survey_overall_result): """Anchored band on the seeded SE-inflation ratio. T22 §3 narrative quotes "around 1.10x" inflation; sign-only assertion above is too From f7a716ab17285617aa75765b45c739009a3fd9f6 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 10:12:17 -0400 Subject: [PATCH 14/15] CI fix: add T22 to docs/index.rst toctree Sphinx HTML build failed with `-W warnings as errors` because the new tutorial was included nowhere in the docs toctree: docs/tutorials/22_had_survey_design.ipynb: WARNING: document isn't included in any toctree [toc.not_included] Added the missing entry to the "Tutorials: Business Applications" toctree at `docs/index.rst`, alongside T20 and T21. Same convention as the existing HAD-series entries. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/index.rst b/docs/index.rst index ea7f8ad0..86755820 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -82,6 +82,7 @@ Quick Links tutorials/19_dcdh_marketing_pulse tutorials/20_had_brand_campaign tutorials/21_had_pretest_workflow + tutorials/22_had_survey_design .. toctree:: :maxdepth: 1 From abe874d380f4d4dc41713149ecbf7d8fe98534b4 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 10:49:10 -0400 Subject: [PATCH 15/15] CI fix: skip plot-source test on isolated-install CI MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `test_event_study_plot_uses_stored_pointwise_ci_endpoints` imported `nbformat` and read the notebook from the repo's `docs/tutorials/` directory. CI Python Tests run from `/tmp/tests/` (isolated install of the wheel, no repo-tree access) and don't include nbformat in the runtime deps, so the test errored: ModuleNotFoundError: No module named 'nbformat' tests/test_t22_had_survey_design_drift.py:409 Two guards added (per `feedback_golden_file_pytest_skip` — same pattern that benchmarks/data/*.json drift tests use): 1. `nbformat = pytest.importorskip("nbformat")` — skips when the optional dep is missing. 2. `if not nb_path.exists(): pytest.skip(...)` — skips on the isolated-install matrix where docs/ isn't copied alongside tests/. The test runs in any environment that has both nbformat and the repo tree (dev workspace + tutorial-exec CI workflows), which is where it actually adds value. The Python Tests matrix doesn't need to lock notebook source against the prose/code mismatch the test was added to prevent. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/test_t22_had_survey_design_drift.py | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/tests/test_t22_had_survey_design_drift.py b/tests/test_t22_had_survey_design_drift.py index b455d75b..4e2528ae 100644 --- a/tests/test_t22_had_survey_design_drift.py +++ b/tests/test_t22_had_survey_design_drift.py @@ -404,13 +404,18 @@ def test_event_study_plot_uses_stored_pointwise_ci_endpoints(): This is a static check on the notebook source — the plot cell runs but produces no return value we can introspect, so we lock - the construction at the source level.""" + the construction at the source level. Skipped on isolated-install + CI jobs where ``docs/`` is not copied alongside ``tests/`` and + ``nbformat`` is not in the runtime deps (per + ``feedback_golden_file_pytest_skip``).""" from pathlib import Path - import nbformat + nbformat = pytest.importorskip("nbformat") nb_path = ( Path(__file__).resolve().parents[1] / "docs" / "tutorials" / "22_had_survey_design.ipynb" ) + if not nb_path.exists(): + pytest.skip(f"Notebook not present at {nb_path} (isolated-install CI)") nb = nbformat.read(nb_path, as_version=4) plot_cell_src = None for cell in nb.cells: