Add Tutorial 22: Survey-Weighted HAD walkthrough#440
Conversation
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Demonstrates the now-fully-supported HeterogeneousAdoptionDiD + did_had_pretest_workflow workflow under SurveyDesign(weights, strata, psu, fpc) end-to-end on a BRFSS-shape state-rollout panel (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post- stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU x period interaction shocks injected so cluster correlation survives DiD first-differencing). Closes the Phase 5 wave 2 second slice tutorial gap that the survey-strata gate lift unblocked. Eight sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with side-by-side ATT/SE/CI table (~10% SE inflation, sign-only direction asserted); a dedicated discussion of why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study with sup-t cband under the survey design; pretest workflow on both overall and event-study paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; communicating-to-leadership two-paragraph template; Extensions + Summary Checklist surfacing the still-deferred lonely_psu='adjust' + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file (25 tests across 5 groups) locks panel composition with deterministic exact pins, naive-vs-survey SE inflation direction (sign-only structural anchor; HAD's IF concentration caps inflation around 1.10x even at large PSU shock SD), design auto-detection to continuous_near_d_lower, event-study cband-vs-pointwise width ordering, _QUG_DEFERRED_SUFFIX substring on report.verdict for both overall and event-study paths, the distinct report.summary() QUG-skip note on the event-study path, deterministic Yatchew sigma2_* pins, and bootstrap p-value tolerance bands at >= 0.25 abs per feedback_strata_bootstrap_path_divergence. Cross-surface updates: T20 and T21 Extensions bullets gain forward-pointers to T22 (T20 also drops the deprecated weights= kwarg phrasing in favor of survey_design=); practitioner decision tree HAD universal-rollout and survey sections each gain a tip cross-link to T22 (adjacent to T20 / T17, not displacing); api/had.rst gains a Survey-aware fit cross-reference; survey-roadmap.md gains a Phase 4.5 C HAD Stute Survey Workflow section; bundled llms.txt and llms-practitioner.txt carry T22 inventory entries; doc-deps.yaml wires T22 as a dependent of both had.py and had_pretests.py; REGISTRY closers L2529 + L2577 flipped to shipped; TODO row L115 marked shipped; CHANGELOG carries the new Unreleased Added entry plus closer flips at the T21 (PR #409) and HAD handlers (PR #402) queued-tutorial closer lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ebb73de to
1eabce5
Compare
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
Static-review note: I could not execute |
P1 #1 — T22 §3 prose contradicted the implementation. Said the analytical local-linear at d_lower "does not consume the survey weights in the slope". The weighted continuous path (diff_diff/had.py:3744-3810) consumes weights in (a) the local- linear `tau_bc` boundary fit, (b) the numerator `np.average(dy_arr, weights=weights_arr)`, AND (c) the denominator `np.average(d_reg, weights=weights_arr)`. Rewrote §3 to say the two ATTs are close on this DGP because the weight CV (~0.30) and the dose-distribution shape do not co-vary strongly enough to shift the boundary slope materially — NOT because weights are ignored. Added two drift tests for the weighted point-estimation contract: `test_survey_att_differs_from_naive_att` (sign-only — if weights were ignored the values would be bit-identical) and `test_survey_att_matches_weighted_denominator_contract` (verifies the algebraic identity `att = (dy_mean_w - tau_bc) / den_w` from `_fit_continuous`). P1 #2 — T22 §7 leadership block conflated overall and event-study pretest paths. Said "all three linearity diagnostics", "Yatchew-HR fails-to-reject under both null modes" (T22 doesn't run the side panel — that's T21), and quoted the overall-path verdict string while describing event-study joint diagnostics. Split the methodologist write-up by path: overall = `Stute + Yatchew`; event-study = `joint pre-trends + joint linearity` with explicit `report.yatchew is None` and `report.stute is None` callouts. Added three drift tests to lock the per-path workflow surfaces: `test_overall_report_pretrends_joint_is_none` (overall has no joint diagnostics), `test_event_study_report_stute_and_yatchew_are_none` (event-study has no single-horizon Stute or Yatchew), and `test_overall_and_event_study_verdict_prefixes_distinct` (the two paths share `_QUG_DEFERRED_SUFFIX` but have distinct verdict prefixes; locks the §7 prose against re-conflating). P3 — CHANGELOG L11 claimed `diff_diff/guides/llms-full.txt` got a T22 inventory entry; the file was intentionally scoped out per the plan-review feedback (would expand scope beyond T22 to T17-T21 backfill). Updated the closer to reflect the actual scope and flag llms-full.txt as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P3 #1 — T22 §3 prose said the two ATTs land "within bootstrap noise of each other", but the headline path uses analytical Binder/TSL Taylor-linearized SE, not a bootstrap. Replaced with "numerically close on this DGP" to avoid misstating the inference mechanism. P3 #2 — CHANGELOG L11 and REGISTRY L2577 said "25 tests across 5 groups" for the T22 drift suite, but the R1 fix added 5 more tests (2 in Group F: workflow-surface separation; 3 in Group G: weighted point-estimation contract). Updated both summaries to "30 tests across 7 groups" with one-line descriptions of the two new groups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1 — T22 §4 + §5 prose claimed pre-period event-study horizons have
"dose ... zero across the board" and used that premise to explain
the SE pattern. Per paper Appendix B.2 (`had.py:1827-1838`) the
event-study contract reuses `D_{g,F}` as the SINGLE dose regressor
for every horizon, including pre-period placebos — the regressor is
NOT zero on the pre-period. Rewrote both passages to base the
SE-pattern explanation on the outcome-side `ΔY_{g,t} = Y_{g,t} -
Y_{g,F-1}`: placebos have small ΔY (within-pre noise only) so the
local-linear fits low residual variance and reads small SEs;
post-period horizons have ΔY that scales with `slope * D_{g,F}`
plus noise, so residual variance is larger and SEs are larger.
P3 — `test_survey_att_matches_weighted_denominator_contract`
previously only constructed an implied `tau_bc` from the fitted
att/den_w and checked finiteness + scale, which is too weak to be
called an "identity lock" per the CHANGELOG/REGISTRY prose. Renamed
to `test_survey_att_matches_weighted_local_linear_identity` and
strengthened to three concrete identities:
1. `survey.effective_dose_mean == np.average(d - d_lower, w)`
bit-equal (1e-10).
2. Direct call to `bias_corrected_local_linear` with HAD defaults
(`kernel="epanechnikov"`, `alpha=0.05`, `boundary=0`) recovers
the SAME `tau_bc` boundary limit the estimator used.
3. `att = (mean_w(dy) - tau_bc) / den_w` matches the fitted
`survey.att` to ~1e-13 (verified locally; same float ops on
the same inputs).
The CHANGELOG/REGISTRY prose now genuinely matches the test
coverage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality Performance Maintainability Tech Debt Security Documentation/Tests Path to Approval
|
R2 fixed two §4/§5 sites that misframed event-study mechanics. R3
reviewer caught a third site at the §4 cell that introduces the
per-horizon SE-ratio table (`L422-426`). It said the post-period
horizons "aggregate across all post-period observations" and the
variance is "read off the full panel" — wrong. HAD event-study
fits each horizon as a SEPARATE local-linear on that horizon's
first differences (`ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}`) against the
common `D_{g,F}` regressor (paper Appendix B.2;
`diff_diff/had.py:4298-4451`); the pointwise per-horizon SE reads
each horizon's own residual variance, not a panel aggregate.
Also softened the "per-horizon SE ratio should be larger than the
overall ratio" claim — the empirical per-horizon ratios on the
locked seed are mixed ([1.087, 1.0, 0.816, 1.203, 1.127, 1.085,
1.126]), some larger and some smaller than the overall ~1.10x. The
revised text frames this as an empirical observation of how PSU
correlation interacts with each horizon's `ΔY` distribution, NOT a
methodological guarantee.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R3 reviewer flagged that CHANGELOG/REGISTRY prose says T22 drift suite "pins bootstrap p-values with >= 0.25 abs tolerance bands", but the actual assertions were `0.10 <= p <= 0.95` (width 0.85). The prose was stronger than the test coverage. Tightened to T21-style anchored bands centered on the seed=22 captured values: - stute (overall): `0.27 <= p <= 0.57` (was 0.10-0.95; captured ~0.42) - pretrends_joint: `0.24 <= p <= 0.54` (captured ~0.39) - homogeneity_joint: `0.26 <= p <= 0.56` (captured ~0.41) Each band has abs width ~0.30, satisfying the >= 0.25 abs band contract from `feedback_strata_bootstrap_path_divergence` while catching drift in either direction (toward rejection or toward an even cleaner pass) rather than only rejecting on cross-the-line moves. Aligns the CHANGELOG/REGISTRY claim with the test coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. The remaining survey-path deferrals are explicitly surfaced rather than silently deferred ( Security No findings. Documentation/Tests
|
P2 — T22 §3 said "Both fits use the same weighted local-linear estimator at d_lower". `_fit_continuous` only switches to weighted moments when `weights_arr is not None` (`had.py:3747-3760`, `:3803-3808`); the naive fit uses unweighted moments. Rewrote §3 to say "same local-linear estimator family" with explicit per-fit moment-form distinction (naive: unweighted; survey: weighted via `bias_corrected_local_linear(..., weights=weights_arr)` plus weighted `np.average` for `dy_mean` and the denominator). P3 #1 — CHANGELOG/REGISTRY/test docstring/comments said the bootstrap p-value pins use ">= 0.25 abs tolerance bands", but the R3 tightened bands are width 0.30 total (± 0.15 around seeded centers). The "abs tolerance >= 0.25" wording is ambiguous between half-width and total-width interpretations. Updated all three surfaces to "anchored windows of total width 0.30 (± 0.15 around seeded centers)" so the prose is unambiguous and matches the actual assertions. P3 #2 — T22 §3 quotes "around 1.10x" SE inflation but the drift suite only checked direction (`survey.se > naive.se`). The seeded ratio could drift to 1.05 or 1.20 silently. Added `test_survey_se_inflation_ratio_in_band` asserting `1.00 <= ratio <= 1.20` — locks the seed=87 captured ratio (~1.0985) tightly enough to flag drift but loosely enough to not flake on RNG-path differences. Bumped CHANGELOG/REGISTRY test count from 30 → 31 to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. The remaining HAD survey-path limitations are still properly tracked rather than silently deferred ( Security No findings. Documentation/Tests
|
P2 — T22 §4 opener said WAS_d_lower is "a weighted average of
unit-specific slopes in a local-linear neighborhood at d_lower",
which conflated the target parameter with its nuisance estimator.
The shipped contract (REGISTRY § HAD; `had.py:21-31`) defines
`WAS_{d̲} = (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2 ≤ d]) / E[D_2 - d̲]`
— an average slope above d_lower, NOT a neighborhood estimand.
The local-linear boundary fit is one component (estimating the
limit term). The leading-order variance still concentrates at the
boundary because that's where the only nonparametric estimation
happens, but that is a property of the variance, not the
estimand. Rewrote the §4 opener to make this distinction
explicit.
P3 — T22 §5 attributed the survey event-study cband to "Phase
4.5 C composition", but per REGISTRY:2366-2380 the weighted
event-study sup-t cband is Phase 4.5 B; Phase 4.5 C is the
pretest/workflow extension demonstrated in §6. Updated to
"Phase 4.5 B composition" with a one-clause note that the §6
material is the Phase 4.5 C work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
The §4 estimand-vs-estimator fix from R5 had a residual echo at
the §7 leadership-block clause: "HAD's WAS-d_lower estimand is a
local-linear at d_lower". Same conflation class — collapses the
target parameter (`WAS_{d̲} = (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2 ≤ d])
/ E[D_2 - d̲]`, the average slope above d_lower) with the nuisance
estimator (the local-linear boundary fit used to estimate the
limit term). Rewrote the clause along the lines the reviewer
suggested: HAD uses a local-linear boundary fit at d_lower to
estimate the boundary-limit term in the WAS-d_lower formula;
variance is dominated by the few states near the boundary.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
T22 walked the Design 1 (`continuous_near_d_lower`) survey-aware
workflow but never restated the non-testable Assumption 5/6
caveat that point identification of `WAS_d_lower` rests on (the
QUG-under-survey deferral was the only methodology caveat T22's
prose called out). This made the survey-path caveat discussion
read as if QUG deferral + linearity diagnostics exhaust the
methodology risk, which is not the full HAD contract.
Compounding factor: the five HAD/pretest fit cells used
`warnings.filterwarnings("ignore", category=UserWarning)` —
a blanket filter that silently swallowed the
`continuous_near_d_lower ... Assumption 5 or 6` UserWarning the
library fires on every Design 1 fit. So users running the
notebook saw neither the warning nor the prose explanation.
Two-part fix matching the reviewer's recommendation and T20's
established convention:
1. Narrow each `filterwarnings` call from the catch-all
`category=UserWarning` to two specific message patterns:
`r".*pweight.*"` (suppress the noisy normalization message)
and `r".*continuous_near_d_lower.*Assumption.*"` (suppress
the redundant Assumption 5/6 advisory on subsequent fits;
the §3 first fit lets it surface naturally so users see it
at least once). This mirrors T20's own pattern at
`20_had_brand_campaign.ipynb` where the headline fit lets the
warning fire and the second event-study cell narrowly filters
it out as redundant. Other UserWarnings — notably the
QUG-deferred-under-survey advisory from
`did_had_pretest_workflow` — are now no longer suppressed and
fire as the load-bearing user-facing methodology signal that
`feedback_no_silent_failures` requires.
2. Add an Assumption 5/6 caveat note to the §3 "Reading the
table" interpretation cell. Mirrors T20's L229 prose: explains
that point identification of `WAS_d_lower` requires Assumption
6 (or Assumption 5 for sign only); both are about local
linearity of the dose-response near `d_lower` and are not
testable from data; the §6 linearity diagnostics are
necessary but not sufficient; users should justify Assumption
6 from domain knowledge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R7's narrowing filter pass applied the same two-message filter (`pweight` + `Assumption 5/6`) uniformly to all five fit cells. The §3 interpretation prose then claimed the headline fit cell "lets [the Assumption 5/6 warning] surface" — which contradicted the actual cell behavior (the warning was suppressed there too). Removed the Assumption filter from the §3 headline fit cell only, so the warning fires once on the canonical first fit (matching the prose). All four subsequent fit cells (§4 event-study ratio, §5 event-study + cband, §6 overall workflow, §6 event-study workflow) keep both filters because the warning is redundant there. Added a NB comment in the headline fit cell explaining the deliberate omission. Pattern matches T20's L229 idiom: headline fit fires the warning once, prose explains it, subsequent fits narrowly filter as redundant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Eight rounds of CI-review iteration tightened methodology
precision but left the notebook prose denser than necessary —
implementation detail and version bookkeeping had crept into §3,
§4, §5, and §7 alongside the pedagogical arc. This pass prunes
those without regressing on any methodology contract:
- §3 setup paragraph: dropped the file:line dump
(`had.py:3747-3760`, `:3803-3808`) and the redundant
weighted-vs-unweighted point-by-point enumeration. The
three-point weight-consumption claim (`tau_bc`, weighted ΔY
mean, weighted denominator) is preserved in compact form.
- §3 Assumption 5/6 note: trimmed from 15 lines to 11. Kept all
load-bearing content (Assumption 6 / Assumption 5; not testable
from data; §6 diagnostics necessary but not sufficient; domain
knowledge justification; paired-with-QUG-deferral framing).
- §4 opener: restructured to lead with intuition (few states near
d_lower → small lever for PSU correlation), with the formal
`WAS_{d̲}` definition pushed into a `**Formal definition.**`
callout. Both halves are preserved — the formal definition is
unchanged in content, just demoted from the lead.
- §5: dropped the "(Phase 4.5 B composition; ...)" parenthetical
(internal version bookkeeping, not user-facing).
- §7 methodologist block: tightened from a numbered list with two
verbatim verdict quotes to a compact two-clause description of
the two paths plus the shared verdict suffix quoted once.
`report.yatchew` / `report.stute = None` callout on the
event-study path preserved. The SE-inflation-is-modest
explanation (with section 4 cross-link) preserved.
Methodology preservation verified against 14 load-bearing anchors:
estimand definition, Assumption 5/6 caveat, non-testability,
QUG-under-survey deferral, Phase 4.5 C0 label, Stute + Yatchew
surfaces, joint pretrends + homogeneity surfaces, ES-path
`yatchew/stute is None`, Binder TSL composition, local-linear
boundary fit description, PSU x period shock mechanism. All 14
still present in the rendered prose.
31/31 drift tests still pass (the drift suite anchors load-bearing
claims via the runtime API contract, not the notebook prose, so
prose tightening is structurally safe).
Diff: +57/-79 (net 22-line reduction in tutorial body).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
The §5 matplotlib event-study plot hard-coded `yerr=1.96 * es.se` (Normal-theory pointwise CI) but the table immediately above uses `es.conf_int_low/high`, which the estimator computes with `t` critical values and `df_survey` on the survey path (`diff_diff/had.py:4352-4445`, `diff_diff/utils.py:38-46,177-210`). The plot silently understated uncertainty and contradicted its own neighboring table — a real methodology bug, not just prose. The cband ribbon was already drawn from `es.cband_low/high` (unaffected); only the pointwise error bars were broken. Fix: - Plot now builds asymmetric `yerr` from the stored `es.conf_int_low/high`. matplotlib's `errorbar` accepts a (2, n) array of `[lower_distances, upper_distances]`, which is what the estimator's stored endpoints encode (no need to back out the implied t critical value manually). - Legend label changed from "point + pointwise CI" to "point + pointwise CI (survey-aware t)" to flag the inference family in the figure. Drift coverage: - New `test_event_study_plot_uses_stored_pointwise_ci_endpoints` inspects the notebook source and rejects both the `1.96 * np.asarray(es.se)` and `1.96 * es.se` patterns, AND requires that the plot cell references `conf_int_low` / `conf_int_high`. This is a source-level static check (the plot cell has no return value to introspect at runtime), but it catches exactly this class of regression. Brings the drift suite from 31 to 32 tests; CHANGELOG / REGISTRY counts updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology No findings. Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. The remaining survey-path limitations referenced by T22 are already explicitly tracked/documented rather than silently introduced. TODO.md:L101-L115, docs/methodology/REGISTRY.md:L2452-L2454 Security No findings. Documentation/Tests No findings. The new drift file materially improves notebook regression coverage, including the exact CI-construction bug from the prior review and the weighted |
Sphinx HTML build failed with `-W warnings as errors` because the new tutorial was included nowhere in the docs toctree: docs/tutorials/22_had_survey_design.ipynb: WARNING: document isn't included in any toctree [toc.not_included] Added the missing entry to the "Tutorials: Business Applications" toctree at `docs/index.rst`, alongside T20 and T21. Same convention as the existing HAD-series entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology No findings. The changed tutorial and supporting docs are consistent with the Methodology Registry and the implementation for weighted continuous HAD, survey-aware event-study inference, and the Phase 4.5 C0 QUG deferral. Code Quality No findings. The new drift suite materially locks the prior review regressions instead of only relying on notebook execution smoke tests. Performance No findings. This PR is docs/tests only and does not change estimator hot paths. Maintainability No findings. The tutorial is wired into the index, API docs, decision tree, guide inventory, and doc dependency map consistently. Tech Debt No findings. The remaining limitations are explicitly tracked rather than silently introduced: replicate-weight designs, Security No findings. Documentation/Tests No findings. The new drift file covers the important user-facing claims in T22, including survey CI construction, verdict wording, workflow-surface separation, and the weighted |
`test_event_study_plot_uses_stored_pointwise_ci_endpoints` imported
`nbformat` and read the notebook from the repo's `docs/tutorials/`
directory. CI Python Tests run from `/tmp/tests/` (isolated install
of the wheel, no repo-tree access) and don't include nbformat in
the runtime deps, so the test errored:
ModuleNotFoundError: No module named 'nbformat'
tests/test_t22_had_survey_design_drift.py:409
Two guards added (per `feedback_golden_file_pytest_skip` — same
pattern that benchmarks/data/*.json drift tests use):
1. `nbformat = pytest.importorskip("nbformat")` — skips when
the optional dep is missing.
2. `if not nb_path.exists(): pytest.skip(...)` — skips on the
isolated-install matrix where docs/ isn't copied alongside
tests/.
The test runs in any environment that has both nbformat and the
repo tree (dev workspace + tutorial-exec CI workflows), which is
where it actually adds value. The Python Tests matrix doesn't
need to lock notebook source against the prose/code mismatch the
test was added to prevent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings. One P2 test/CI coverage gap remains. Executive Summary
Methodology No findings. The changed tutorial/docs are consistent with the shipped methodology for weighted continuous HAD, survey-aware event-study inference, and the Phase 4.5 C0 QUG deferral. Code Quality No findings. The new drift suite usefully locks the prior review regressions plus the workflow-surface separation, instead of relying only on notebook execution smoke tests. Performance No findings. This PR is docs/tests only and does not change estimator hot paths. Maintainability No findings. The T22 references are wired consistently across the index, API docs, doc-deps map, decision tree, roadmap, and guide inventories. Tech Debt No findings. The remaining HAD survey-path limitations are explicitly tracked/documented rather than silently introduced: Security No findings. Documentation/Tests
|
Summary
Closes the Phase 5 wave 2 second slice tutorial gap unblocked by the survey-strata gate lift on the HAD Stute pretest family.
docs/tutorials/22_had_survey_design.ipynb(26 cells, 8 sections) walksHeterogeneousAdoptionDiD+did_had_pretest_workflowend-to-end on a BRFSS-shape stratified household-survey panel: 60 states organized as 5 strata × 6 PSUs/stratum × 2 states/PSU, post-stratification raking weights with CV ~ 0.30, FPC = 30 PSUs/stratum, PSU × period interaction shocks injected so cluster correlation survives DiD first-differencing.tests/test_t22_had_survey_design_drift.py(25 tests / 5 groups) locks panel composition (deterministic exact pins), naive-vs-survey SE inflation direction (sign-only structural anchor — HAD's WAS-d_lower IF concentration caps inflation around 1.10x), design auto-detection, event-study cband-vs-pointwise width ordering,_QUG_DEFERRED_SUFFIXsubstring onreport.verdictfor both overall and event-study paths, the distinctreport.summary()QUG-skip note on the event-study path, deterministic Yatchew sigma2_*, and bootstrap p-value tolerance bands at >= 0.25 abs perfeedback_strata_bootstrap_path_divergence.weights=phrasing in favor ofsurvey_design=); practitioner decision tree HAD universal-rollout and survey sections each gain a.. tip::cross-link to T22 (adjacent to T20 / T17, NOT displacing);docs/api/had.rstSurvey-aware fit cross-reference;docs/survey-roadmap.mdPhase 4.5 C HAD Stute Survey Workflow section;diff_diff/guides/llms.txtandllms-practitioner.txtT22 inventory entries;docs/doc-deps.yamlwires T22 as a dependent of bothhad.pyandhad_pretests.py; REGISTRY closers L2529 + L2577 flipped; TODO row L115 marked shipped; CHANGELOG carries the new Unreleased Added entry plus closer flips on the T21 (PR Tutorial 21: HAD pre-test workflow (composite QUG + Stute + Yatchew) #409) and HAD-handlers (PR HAD Phase 5 wave 1: agent-facing surfaces (_handle_had + llms-full.txt) #402) "queued tutorial" lines.Methodology references (required if estimator / math changes)
diff_diff/,rust/src/, ordocs/methodology/REGISTRY.mdsource-side edits beyond the closer-flip prose)Validation
tests/test_t22_had_survey_design_drift.py(25 new tests)Security / privacy