Add Tutorial 22: Survey-Weighted HAD walkthrough by igerber · Pull Request #440 · igerber/diff-diff

igerber · 2026-05-15T00:51:58Z

Summary

Closes the Phase 5 wave 2 second slice tutorial gap unblocked by the survey-strata gate lift on the HAD Stute pretest family.

New docs/tutorials/22_had_survey_design.ipynb (26 cells, 8 sections) walks HeterogeneousAdoptionDiD + did_had_pretest_workflow end-to-end on a BRFSS-shape stratified household-survey panel: 60 states organized as 5 strata × 6 PSUs/stratum × 2 states/PSU, post-stratification raking weights with CV ~ 0.30, FPC = 30 PSUs/stratum, PSU × period interaction shocks injected so cluster correlation survives DiD first-differencing.
New tests/test_t22_had_survey_design_drift.py (25 tests / 5 groups) locks panel composition (deterministic exact pins), naive-vs-survey SE inflation direction (sign-only structural anchor — HAD's WAS-d_lower IF concentration caps inflation around 1.10x), design auto-detection, event-study cband-vs-pointwise width ordering, _QUG_DEFERRED_SUFFIX substring on report.verdict for both overall and event-study paths, the distinct report.summary() QUG-skip note on the event-study path, deterministic Yatchew sigma2_*, and bootstrap p-value tolerance bands at >= 0.25 abs per feedback_strata_bootstrap_path_divergence.
Cross-surface updates land in the same commit: T20 + T21 Extensions forward-pointers (T20 also drops the deprecated weights= phrasing in favor of survey_design=); practitioner decision tree HAD universal-rollout and survey sections each gain a .. tip:: cross-link to T22 (adjacent to T20 / T17, NOT displacing); docs/api/had.rst Survey-aware fit cross-reference; docs/survey-roadmap.md Phase 4.5 C HAD Stute Survey Workflow section; diff_diff/guides/llms.txt and llms-practitioner.txt T22 inventory entries; docs/doc-deps.yaml wires T22 as a dependent of both had.py and had_pretests.py; REGISTRY closers L2529 + L2577 flipped; TODO row L115 marked shipped; CHANGELOG carries the new Unreleased Added entry plus closer flips on the T21 (PR Tutorial 21: HAD pre-test workflow (composite QUG + Stute + Yatchew) #409) and HAD-handlers (PR HAD Phase 5 wave 1: agent-facing surfaces (_handle_had + llms-full.txt) #402) "queued tutorial" lines.

Methodology references (required if estimator / math changes)

Method name(s): N/A — no methodology changes (no diff_diff/, rust/src/, or docs/methodology/REGISTRY.md source-side edits beyond the closer-flip prose)
Paper / source link(s): N/A — tutorial only; the underlying methodology shipped earlier and is documented in REGISTRY § HeterogeneousAdoptionDiD
Any intentional deviations from the source (and why): None

Validation

Tests added/updated: tests/test_t22_had_survey_design_drift.py (25 new tests)
Backtest / simulation / notebook evidence (if applicable): The notebook executes end-to-end (verified via in-process nbformat exec); all 25 drift tests pass locally; T20 + T21 drift tests pass post-edit (no regression from the Extensions cross-link updates).

Security / privacy

Confirm no secrets/PII in this PR: Yes

github-actions · 2026-05-15T00:57:25Z

Overall Assessment

⚠️ Needs changes

Executive Summary

P1: T22 says the survey-aware HAD headline fit keeps the same slope because the local-linear estimator “does not consume the survey weights,” but the weighted continuous path explicitly uses weights in both the local-linear fit and the beta-scale numerator/denominator.
P1: T22’s “For the methodologist” block conflates the overall and event-study pretest paths, quoting the overall Stute + Yatchew verdict while describing event-study joint diagnostics; the event-study workflow does not run Yatchew.
The new drift test suite is good on SE/cband/QUG-suffix pins, but it does not lock either of the two methodology statements above.
P3: CHANGELOG.md says llms-full.txt got a T22 inventory entry, but there is no matching T22 entry in the current diff_diff/guides/llms-full.txt.
No runtime-code, performance, or security issues were apparent in the changed files.

Methodology

Severity P1. Impact: docs/tutorials/22_had_survey_design.ipynb:L304-L306 tells readers that the survey-aware headline fit should have the same point estimate because the local-linear estimator “does not consume the survey weights in the slope.” That contradicts both the implementation and the registry: the weighted HAD continuous path passes weights into bias_corrected_local_linear, uses weighted dy_mean, and uses a weighted denominator E_w[D] / E_w[D-d_lower]. Users following the tutorial will misinterpret any naive-vs-survey ATT difference as accidental rather than intended weighted estimation. Concrete fix: rewrite that paragraph to say the two ATTs are close on this DGP because the weights are only mildly informative here, not because weights are ignored; add a drift assertion that locks the weighted point-estimation contract (for example, a manual weighted denominator/ATT check or a check against effective_dose_mean). Refs: docs/tutorials/22_had_survey_design.ipynb:L304-L306, diff_diff/had.py:L3744-L3808, docs/methodology/REGISTRY.md:L2346-L2348, tests/test_t22_had_survey_design_drift.py:L368-L398.
Severity P1. Impact: docs/tutorials/22_had_survey_design.ipynb:L705-L713 reports the wrong pretest surface. It says the workflow ran “all three linearity diagnostics,” says Yatchew failed to reject “under both null modes,” and quotes the overall-path verdict string. But on the survey event-study path, did_had_pretest_workflow returns yatchew=None and uses the separate joint-pretrends/joint-homogeneity verdict composer; only the overall path runs Yatchew. This overstates which assumption checks were actually run and encourages users to quote the wrong verdict for the event-study workflow. Concrete fix: split the write-up by path. Keep Stute + Yatchew language only for aggregate="overall", use joint pre-trends + joint linearity for aggregate="event_study", and remove “both null modes” unless the notebook actually runs a standalone Yatchew side analysis. Add drift coverage that event_study_report.yatchew is None and that the overall/event-study verdict prefixes are distinct. Refs: docs/tutorials/22_had_survey_design.ipynb:L705-L713, diff_diff/had_pretests.py:L4305-L4406, diff_diff/had_pretests.py:L4836-L4860, diff_diff/had_pretests.py:L4908-L4935, docs/tutorials/21_had_pretest_workflow.ipynb:L293-L297.

Code Quality

No findings in the changed runtime/test code.

Performance

No findings. This PR is docs/tests only.

Maintainability

No findings beyond the methodology mismatches above.

Tech Debt

No blocking tech-debt issue. TODO.md correctly marks T22 as shipped.

Security

No findings.

Documentation/Tests

Severity P3. Impact: CHANGELOG.md:L11 claims diff_diff/guides/llms-full.txt now carries a T22 inventory entry, but there is no matching Tutorial 22 / 22_had_survey_design entry in the current diff_diff/guides/llms-full.txt. This is release-note drift, not a behavioral defect. Concrete fix: either add the missing bundled-guide entry or remove llms-full.txt from the changelog bullet. Ref: CHANGELOG.md:L11.
Verification note: this review was static. I could not execute tests/test_t22_had_survey_design_drift.py here because numpy and pytest are not installed in the environment.

Path to Approval

Fix T22 section 3 so it describes the weighted HAD point estimator correctly and stop implying survey weights only affect SE/CI.
Add a regression check that locks the weighted point-estimation contract, not just SE inflation direction.
Rewrite T22 section 7 so the overall-path and event-study-path diagnostics/verdicts are reported separately and accurately.
Add tests that distinguish the two workflow surfaces (overall_report uses Stute + Yatchew; event_study_report has yatchew is None and uses the joint-Stute verdict).

Demonstrates the now-fully-supported HeterogeneousAdoptionDiD + did_had_pretest_workflow workflow under SurveyDesign(weights, strata, psu, fpc) end-to-end on a BRFSS-shape state-rollout panel (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post- stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum; PSU x period interaction shocks injected so cluster correlation survives DiD first-differencing). Closes the Phase 5 wave 2 second slice tutorial gap that the survey-strata gate lift unblocked. Eight sections: motivation; panel + in-notebook helper for attaching survey columns to a HAD panel; naive vs survey-aware headline fit with side-by-side ATT/SE/CI table (~10% SE inflation, sign-only direction asserted); a dedicated discussion of why the SE inflation is modest for HAD specifically (WAS-d_lower IF concentration at the boundary vs full-panel regression coefficients); event-study with sup-t cband under the survey design; pretest workflow on both overall and event-study paths walking the Phase 4.5 C0 QUG-deferred verdict suffix and the now-supported stratified-clustered Stute multiplier bootstrap; communicating-to-leadership two-paragraph template; Extensions + Summary Checklist surfacing the still-deferred lonely_psu='adjust' + singleton-strata, replicate-weight designs, and the permanent QUG-under-survey C0 deferral. Companion drift-test file (25 tests across 5 groups) locks panel composition with deterministic exact pins, naive-vs-survey SE inflation direction (sign-only structural anchor; HAD's IF concentration caps inflation around 1.10x even at large PSU shock SD), design auto-detection to continuous_near_d_lower, event-study cband-vs-pointwise width ordering, _QUG_DEFERRED_SUFFIX substring on report.verdict for both overall and event-study paths, the distinct report.summary() QUG-skip note on the event-study path, deterministic Yatchew sigma2_* pins, and bootstrap p-value tolerance bands at >= 0.25 abs per feedback_strata_bootstrap_path_divergence. Cross-surface updates: T20 and T21 Extensions bullets gain forward-pointers to T22 (T20 also drops the deprecated weights= kwarg phrasing in favor of survey_design=); practitioner decision tree HAD universal-rollout and survey sections each gain a tip cross-link to T22 (adjacent to T20 / T17, not displacing); api/had.rst gains a Survey-aware fit cross-reference; survey-roadmap.md gains a Phase 4.5 C HAD Stute Survey Workflow section; bundled llms.txt and llms-practitioner.txt carry T22 inventory entries; doc-deps.yaml wires T22 as a dependent of both had.py and had_pretests.py; REGISTRY closers L2529 + L2577 flipped to shipped; TODO row L115 marked shipped; CHANGELOG carries the new Unreleased Added entry plus closer flips at the T21 (PR #409) and HAD handlers (PR #402) queued-tutorial closer lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T10:16:09Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1eabce5e2fa7775b726931c704a49faf3793d2db

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review result: both previously identified P1 methodology issues in docs/tutorials/22_had_survey_design.ipynb are still unmitigated.
Section 3 still says the survey-aware HAD point estimate stays the same because the local-linear estimator “does not consume the survey weights,” which contradicts the weighted HAD contract in both the registry and diff_diff/had.py.
Section 7 still conflates the overall and event-study survey pretest surfaces: the event-study workflow has yatchew=None and a joint-Stute verdict, not the overall-path “Stute and Yatchew...” verdict or “both null modes” claim.
The new drift suite improves coverage for suffixes, horizon labels, and SE direction, but it still does not lock either of those two methodology-sensitive statements.
The earlier P3 changelog drift also remains: CHANGELOG.md says llms-full.txt got a T22 entry, but no matching T22 entry exists in diff_diff/guides/llms-full.txt.
I did not find new runtime-code, performance, or security issues in the changed files.

Methodology

Severity P1. Impact: docs/tutorials/22_had_survey_design.ipynb:L304-L306 still tells readers that the survey-aware headline fit should have the same slope because the local-linear estimator “does not consume the survey weights.” That is not the shipped method: weights enter the local-linear fit, the weighted ΔY mean, and the weighted beta-scale denominator (docs/methodology/REGISTRY.md:L2346-L2348, diff_diff/had.py:L3776-L3808, diff_diff/had.py:L363-L366). This leaves the tutorial teaching an undocumented methodology deviation. Concrete fix: rewrite the paragraph to say the two ATTs are close on this DGP because the weights are only mildly informative here, not because weights are ignored; add a regression test that locks the weighted point-estimation contract.
Severity P1. Impact: docs/tutorials/22_had_survey_design.ipynb:L705-L713 still conflates the overall and event-study survey pretest paths. On the event-study path, did_had_pretest_workflow(..., aggregate="event_study") returns yatchew=None, and the survey verdict is composed from joint pre-trends plus joint linearity Stute only (diff_diff/had_pretests.py:L4368-L4406, diff_diff/had_pretests.py:L4850-L4860, diff_diff/had_pretests.py:L3952-L3956, docs/methodology/REGISTRY.md:L2449-L2451). The current leadership template encourages users to quote the wrong verdict and to attribute “both null modes” Yatchew evidence to a workflow that never ran it. Concrete fix: split the write-up by path, keep Stute+Yatchew language only for aggregate="overall", use joint pre-trends + joint linearity wording for aggregate="event_study", and add a regression that event_study_report.yatchew is None and the event-study verdict prefix is the joint-Stute one.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the documentation/test drift below.

Tech Debt

No findings. TODO.md correctly marks T22 as shipped; I did not see a blocker that is already mitigated by TODO tracking.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new drift suite still does not lock the two methodology-sensitive claims above. Group B only checks design auto-detection, att near truth, SE direction, and CI coverage (tests/test_t22_had_survey_design_drift.py:L360-L398); Group E checks suffixes, summary note, and joint horizon labels (tests/test_t22_had_survey_design_drift.py:L502-L556). That allowed the same prose drift from the prior review to survive this re-review. Concrete fix: add one direct weighted-point-estimation invariant and one event-study workflow-surface invariant.
Severity P3. Impact: CHANGELOG.md:L11 still says diff_diff/guides/llms-full.txt carries a T22 inventory entry, but there is no matching Tutorial 22 / 22_had_survey_design entry in diff_diff/guides/llms-full.txt. Concrete fix: add the missing entry or remove llms-full.txt from that changelog sentence.

Path to Approval

Rewrite T22 Section 3 at docs/tutorials/22_had_survey_design.ipynb:L300-L306 so it states that survey weights enter the continuous HAD point estimator, while keeping any “point estimates are close here” language explicitly DGP-specific.
Add a regression test that locks the weighted point-estimation contract for T22, e.g. via a manual weighted denominator / weighted-moment check or an explicit invariant on effective_dose_mean plus weighted ATT behavior.
Rewrite the Section 7 “For the methodologist” block at docs/tutorials/22_had_survey_design.ipynb:L705-L713 to distinguish the overall-path report from the event-study report and remove the “both null modes” Yatchew claim unless the notebook actually runs that side analysis.
Add a regression test that event_study_report.yatchew is None and that the event-study verdict uses the joint-Stute wording from diff_diff/had_pretests.py:L4368-L4406, not the overall-path “Stute and Yatchew...” wording.

Static-review note: I could not execute tests/test_t22_had_survey_design_drift.py in this environment because numpy and pytest are not installed.

P1 #1 — T22 §3 prose contradicted the implementation. Said the analytical local-linear at d_lower "does not consume the survey weights in the slope". The weighted continuous path (diff_diff/had.py:3744-3810) consumes weights in (a) the local- linear `tau_bc` boundary fit, (b) the numerator `np.average(dy_arr, weights=weights_arr)`, AND (c) the denominator `np.average(d_reg, weights=weights_arr)`. Rewrote §3 to say the two ATTs are close on this DGP because the weight CV (~0.30) and the dose-distribution shape do not co-vary strongly enough to shift the boundary slope materially — NOT because weights are ignored. Added two drift tests for the weighted point-estimation contract: `test_survey_att_differs_from_naive_att` (sign-only — if weights were ignored the values would be bit-identical) and `test_survey_att_matches_weighted_denominator_contract` (verifies the algebraic identity `att = (dy_mean_w - tau_bc) / den_w` from `_fit_continuous`). P1 #2 — T22 §7 leadership block conflated overall and event-study pretest paths. Said "all three linearity diagnostics", "Yatchew-HR fails-to-reject under both null modes" (T22 doesn't run the side panel — that's T21), and quoted the overall-path verdict string while describing event-study joint diagnostics. Split the methodologist write-up by path: overall = `Stute + Yatchew`; event-study = `joint pre-trends + joint linearity` with explicit `report.yatchew is None` and `report.stute is None` callouts. Added three drift tests to lock the per-path workflow surfaces: `test_overall_report_pretrends_joint_is_none` (overall has no joint diagnostics), `test_event_study_report_stute_and_yatchew_are_none` (event-study has no single-horizon Stute or Yatchew), and `test_overall_and_event_study_verdict_prefixes_distinct` (the two paths share `_QUG_DEFERRED_SUFFIX` but have distinct verdict prefixes; locks the §7 prose against re-conflating). P3 — CHANGELOG L11 claimed `diff_diff/guides/llms-full.txt` got a T22 inventory entry; the file was intentionally scoped out per the plan-review feedback (would expand scope beyond T22 to T17-T21 backfill). Updated the closer to reflect the actual scope and flag llms-full.txt as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T10:23:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ed7d5e2ce3a26d740bdfcda64fc9bce9898e94c6

Overall Assessment

✅ Looks good

Executive Summary

Re-review result: the two prior P1 methodology findings in Tutorial 22 are resolved, so this no longer needs changes.
Section 3 now matches the shipped weighted-HAD contract: survey weights enter the boundary local-linear fit, the weighted ΔY mean, and the weighted denominator (docs/tutorials/22_had_survey_design.ipynb:L298-L312, diff_diff/had.py:L3776-L3827, docs/methodology/REGISTRY.md:L2346-L2348).
Section 7 now correctly separates the survey aggregate="overall" and aggregate="event_study" pretest surfaces; the event-study path correctly uses joint pre-trends/joint linearity and leaves stute / yatchew unset (docs/tutorials/22_had_survey_design.ipynb:L711-L730, diff_diff/had_pretests.py:L4368-L4406, diff_diff/had_pretests.py:L4850-L4860).
The new drift suite now locks both regressions directly, including workflow-surface separation and weighted-point-estimation coverage (tests/test_t22_had_survey_design_drift.py:L559-L669).
The earlier llms-full.txt changelog drift is also resolved: CHANGELOG.md now correctly says T22 added inventory entries to llms.txt and llms-practitioner.txt, with llms-full.txt left as follow-up (CHANGELOG.md:L11).
Only minor documentation drift remains. I could not execute pytest here because pytest is not installed.

Methodology

No P0/P1 findings. The prior weighted-estimation and event-study-surface mismatches are fixed and now align with the registry/source contracts.
Severity P3. Impact: docs/tutorials/22_had_survey_design.ipynb:L304-L310 says the two headline ATTs land “within bootstrap noise of each other,” but Section 3’s headline path is analytical; the survey SE is Binder/TSL Taylor-linearized, not bootstrap-based (diff_diff/had.py:L3809-L3827, docs/methodology/REGISTRY.md:L2348). This is only wording drift, but it misstates the inference mechanism. Concrete fix: replace “within bootstrap noise” with “numerically close on this DGP” or similar non-bootstrap wording.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining survey-path limitations called out in T22 are already documented/tracked and are non-blocking (TODO.md:L100-L115, docs/methodology/REGISTRY.md:L2452-L2454).

Security

No findings.

Documentation/Tests

Severity P3. Impact: CHANGELOG.md:L11-L12 and docs/methodology/REGISTRY.md:L2577-L2578 still describe the T22 drift suite as “25 tests” / “5 groups,” but the added file currently defines 30 tests and includes two extra groups for workflow-surface separation and weighted point-estimation coverage (tests/test_t22_had_survey_design_drift.py:L559-L669). This does not affect correctness, but it leaves the regression-coverage inventory stale. Concrete fix: update those summaries to “30 tests across 7 groups,” or drop the hard-coded counts and describe the covered behaviors instead.
Validation note: static review only. I could not run tests/test_t22_had_survey_design_drift.py in this environment because pytest is not installed (/usr/bin/python: No module named pytest).

P3 #1 — T22 §3 prose said the two ATTs land "within bootstrap noise of each other", but the headline path uses analytical Binder/TSL Taylor-linearized SE, not a bootstrap. Replaced with "numerically close on this DGP" to avoid misstating the inference mechanism. P3 #2 — CHANGELOG L11 and REGISTRY L2577 said "25 tests across 5 groups" for the T22 drift suite, but the R1 fix added 5 more tests (2 in Group F: workflow-surface separation; 3 in Group G: weighted point-estimation contract). Updated both summaries to "30 tests across 7 groups" with one-line descriptions of the two new groups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T10:30:46Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cf438fa355588e71340f0134f8f74f739a7e9f18

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior minor drift items are resolved: T22 now says the headline naive/survey ATTs are “numerically close on this DGP” instead of “within bootstrap noise,” and the changelog/registry now match the actual 30-test / 7-group drift suite (docs/tutorials/22_had_survey_design.ipynb:L298-L312, CHANGELOG.md:L11, docs/methodology/REGISTRY.md:L2577, tests/test_t22_had_survey_design_drift.py:L293-L603).
One P1 methodology issue remains in the new tutorial prose: T22 explains pre-period event-study SEs as if dose were zero on placebo horizons, but the shipped event-study contract reuses D_{g,F} for every horizon, including placebos (docs/tutorials/22_had_survey_design.ipynb:L474-L479, docs/tutorials/22_had_survey_design.ipynb:L564-L571, diff_diff/had.py:L1827-L1838, docs/methodology/REGISTRY.md:L2556-L2560).
The previous workflow-surface conflation is fixed and now explicitly regression-tested: overall survey workflow uses Stute+Yatchew, while event-study survey workflow uses joint pre-trends + joint homogeneity with stute=None / yatchew=None (tests/test_t22_had_survey_design_drift.py:L567-L598, diff_diff/had_pretests.py:L4305-L4406).
The docs/TODO/registry updates are internally consistent about T22 being shipped and about the remaining survey-path deferrals (TODO.md:L115-L116, docs/survey-roadmap.md:L143-L170, docs/methodology/REGISTRY.md:L2452-L2454).
Validation here was static only: this environment is missing both pytest and numpy, so I could not execute the new drift suite.

Methodology

Severity P1. Impact: T22’s Section 4/5 variance explanation says the pre-period event-study horizons have “dose ... zero across the board” and therefore “nothing to localize to,” then uses that premise to explain the survey-vs-naive SE pattern (docs/tutorials/22_had_survey_design.ipynb:L474-L479, docs/tutorials/22_had_survey_design.ipynb:L564-L571). That is not the shipped HAD event-study contract: the implementation and registry are explicit that the single regressor is D_{g,F} reused for every horizon, including pre-period placebos (diff_diff/had.py:L1827-L1838, docs/methodology/REGISTRY.md:L2556-L2560). As written, the tutorial teaches the wrong mechanics for placebo-horizon inference. Concrete fix: rewrite those passages to state that event-study placebos still regress Y_{g,t} - Y_{g,F-1} on period-F dose variation, and base any SE-ratio explanation on horizon-specific outcome differences / IF weighting rather than on “zero-dose” localization.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining survey-path limitations are still properly documented as deferred work rather than silent behavior changes (docs/survey-roadmap.md:L165-L170, docs/methodology/REGISTRY.md:L2452-L2454).

Security

No findings.

Documentation/Tests

Severity P3. Impact: the release-note / registry text says the new drift suite locks the weighted _fit_continuous identity, but test_survey_att_matches_weighted_denominator_contract does not compare against the estimator’s actual tau_bc; it only constructs an implied tau_bc = dy_mean_w - att * den_w from the returned ATT and checks finiteness / scale (tests/test_t22_had_survey_design_drift.py:L625-L669, CHANGELOG.md:L11, docs/methodology/REGISTRY.md:L2577). That makes the prose stronger than the test coverage actually is. Concrete fix: either strengthen the test to compare against the real weighted boundary fit (for example via bias_corrected_local_linear / _fit_continuous) or soften the changelog/registry wording to call this a sanity check rather than an identity lock.
Validation note: static review only. I could not run tests/test_t22_had_survey_design_drift.py because /usr/bin/python lacks both pytest and numpy.

Path to Approval

Fix the T22 Section 4/5 event-study explanation so it matches the shipped HAD contract: pre-period placebos still use D_{g,F}, and the SE discussion should not rely on “dose is zero across the board.”
Either strengthen tests/test_t22_had_survey_design_drift.py::test_survey_att_matches_weighted_denominator_contract to verify the actual weighted boundary-fit identity, or downgrade the CHANGELOG.md / REGISTRY.md wording so it does not claim that identity is locked when it is only loosely sanity-checked.

P1 — T22 §4 + §5 prose claimed pre-period event-study horizons have "dose ... zero across the board" and used that premise to explain the SE pattern. Per paper Appendix B.2 (`had.py:1827-1838`) the event-study contract reuses `D_{g,F}` as the SINGLE dose regressor for every horizon, including pre-period placebos — the regressor is NOT zero on the pre-period. Rewrote both passages to base the SE-pattern explanation on the outcome-side `ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}`: placebos have small ΔY (within-pre noise only) so the local-linear fits low residual variance and reads small SEs; post-period horizons have ΔY that scales with `slope * D_{g,F}` plus noise, so residual variance is larger and SEs are larger. P3 — `test_survey_att_matches_weighted_denominator_contract` previously only constructed an implied `tau_bc` from the fitted att/den_w and checked finiteness + scale, which is too weak to be called an "identity lock" per the CHANGELOG/REGISTRY prose. Renamed to `test_survey_att_matches_weighted_local_linear_identity` and strengthened to three concrete identities: 1. `survey.effective_dose_mean == np.average(d - d_lower, w)` bit-equal (1e-10). 2. Direct call to `bias_corrected_local_linear` with HAD defaults (`kernel="epanechnikov"`, `alpha=0.05`, `boundary=0`) recovers the SAME `tau_bc` boundary limit the estimator used. 3. `att = (mean_w(dy) - tau_bc) / den_w` matches the fitted `survey.att` to ~1e-13 (verified locally; same float ops on the same inputs). The CHANGELOG/REGISTRY prose now genuinely matches the test coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T10:59:02Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ba99bc98212d9f7e52e983cbdac2a3e0e4c39618

Overall Assessment

⚠️ Needs changes

Executive Summary

[Newly identified] One P1 methodology issue remains in T22: Section 4 still says HAD event-study post horizons “aggregate across all post-period observations” and that their variance is “read off the full panel.” That is not the shipped contract.
The prior P1 from the last review is resolved: T22 now correctly explains that HAD event-study reuses D_{g,F} on both placebo and post horizons, and its later variance discussion matches the Appendix B.2 / registry contract.
The prior P3 is also resolved: the new drift test now re-derives the weighted _fit_continuous identity end-to-end rather than only sanity-checking an implied value.
The shipped/deferral docs are internally consistent: T22 is marked shipped, while QUG-under-survey and the remaining survey-path limits stay explicitly tracked in TODO.md / survey-roadmap.md / REGISTRY.md.
Validation here was static only. I could not run the new drift suite because this environment lacks pytest, numpy, and pandas.

Methodology

Severity P1 [Newly identified]. Impact: T22 Section 4 still teaches the wrong event-study SE mechanics. It says the post-period horizons “aggregate across all post-period observations” and therefore their variance is “read off the full panel,” with per-horizon survey/naive SE ratios expected to exceed the overall ratio (docs/tutorials/22_had_survey_design.ipynb:L422-L426). That is not how HAD event-study is implemented or documented. The estimator loops over horizons and fits each horizon independently on that horizon’s first differences, with the common regressor D_{g,F} and pointwise per-horizon SEs; joint cross-horizon covariance is explicitly not part of the pointwise contract (diff_diff/had.py:L4298-L4451, docs/methodology/REGISTRY.md:L2559-L2560). The notebook is now internally inconsistent, because its later text correctly explains the common-D_{g,F} / horizon-specific-ΔY mechanics (docs/tutorials/22_had_survey_design.ipynb:L474-L489, docs/tutorials/22_had_survey_design.ipynb:L569-L579). Concrete fix: rewrite or delete L422-L426 so Section 4 says each event-time horizon is a separate local-linear fit on ΔY_{g,t} = Y_{g,t} - Y_{g,F-1} using the same D_{g,F} regressor, and explain larger post-horizon SEs on this DGP via horizon-specific residual variance in ΔY, not “full-panel” aggregation. If you keep the ratio comparison, frame it as an empirical feature of this seeded DGP, not as the method contract.

Code Quality
No findings in changed scope.

Performance
No findings in changed scope.

Maintainability
No findings in changed scope.

Tech Debt
No findings. The remaining survey-path limitations are properly tracked rather than silently deferred (TODO.md:L115-L116, docs/survey-roadmap.md:L143-L169, docs/methodology/REGISTRY.md:L2577).

Security
No findings.

Documentation/Tests
No new findings beyond the methodology issue above. The previous drift-test coverage concern is resolved: tests/test_t22_had_survey_design_drift.py:L625-L685 now re-derives the weighted _fit_continuous identity, matching the stronger wording in CHANGELOG.md:L11 and docs/methodology/REGISTRY.md:L2577. Validation note: static review only; test execution was not possible here because pytest, numpy, and pandas are unavailable.

Path to Approval

Fix docs/tutorials/22_had_survey_design.ipynb:L422-L426 so it no longer says HAD event-study variance is “read off the full panel.” Replace it with the actual contract from diff_diff/had.py:L4298-L4451 / docs/methodology/REGISTRY.md:L2559-L2560: one independent horizon-specific fit on that horizon’s first differences, with common regressor D_{g,F}.
Remove or soften the sentence that the per-horizon survey/naive SE ratio “should” exceed the overall ratio unless you explicitly present that as an observed property of this seeded tutorial DGP rather than a methodological guarantee.

R2 fixed two §4/§5 sites that misframed event-study mechanics. R3 reviewer caught a third site at the §4 cell that introduces the per-horizon SE-ratio table (`L422-426`). It said the post-period horizons "aggregate across all post-period observations" and the variance is "read off the full panel" — wrong. HAD event-study fits each horizon as a SEPARATE local-linear on that horizon's first differences (`ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}`) against the common `D_{g,F}` regressor (paper Appendix B.2; `diff_diff/had.py:4298-4451`); the pointwise per-horizon SE reads each horizon's own residual variance, not a panel aggregate. Also softened the "per-horizon SE ratio should be larger than the overall ratio" claim — the empirical per-horizon ratios on the locked seed are mixed ([1.087, 1.0, 0.816, 1.203, 1.127, 1.085, 1.126]), some larger and some smaller than the overall ~1.10x. The revised text frames this as an empirical observation of how PSU correlation interacts with each horizon's `ΔY` distribution, NOT a methodological guarantee. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T11:04:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 16d7792e81bf1e3d5f8aaa6194beaecdf58bcc19

Overall Assessment

✅ Looks good

Executive Summary

[Resolved prior P1] T22 Section 4 now matches the shipped Appendix B.2 contract: each event-time horizon is a separate local-linear fit on ΔY_{g,t} with common regressor D_{g,F}, and the survey-vs-naive SE ratio is presented as DGP-specific rather than as a method guarantee (docs/tutorials/22_had_survey_design.ipynb:L409-L430, docs/tutorials/22_had_survey_design.ipynb:L478-L493, diff_diff/had.py:L4298-L4451, docs/methodology/REGISTRY.md:L2558-L2560).
[Resolved prior P3] Section 3 now correctly states that survey weights affect the HAD point estimate, and the new drift test re-derives the weighted _fit_continuous identity end-to-end (docs/tutorials/22_had_survey_design.ipynb:L300-L312, diff_diff/had.py:L3744-L3825, tests/test_t22_had_survey_design_drift.py:L604-L685).
The new drift suite also locks the corrected survey-pretest surface split: overall path uses Stute + Yatchew, while event-study uses joint pre-trends + joint homogeneity, with distinct verdict text and separate summary() QUG-skip messaging (tests/test_t22_had_survey_design_drift.py:L507-L598, diff_diff/had_pretests.py:L725-L745, diff_diff/had_pretests.py:L4300-L4406).
New finding is minor only: some changelog/registry/test prose overstates how tightly T22 bootstrap p-values are drift-locked.
Static review only. I could not run tests/test_t22_had_survey_design_drift.py because this environment lacks pytest, numpy, and pandas.

Methodology

No findings. The prior methodology issue from the last review is resolved, and I did not find a remaining undocumented mismatch against the HAD Appendix B.2 / registry contract in the changed scope.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The shipped-status updates are internally consistent across TODO.md:L100-L108,L115, docs/survey-roadmap.md:L143-L169, and docs/methodology/REGISTRY.md:L2449-L2454,L2577.

Tech Debt

No findings. The remaining survey-path limitations are explicitly tracked rather than silently deferred: QUG-under-survey research deferral, replicate-weight designs, lonely_psu='adjust' with singleton strata, and trends_lin × survey_design (TODO.md:L100-L108, docs/methodology/REGISTRY.md:L2449-L2454).

Security

No findings.

Documentation/Tests

Severity P3. Impact: CHANGELOG.md and REGISTRY.md say the new T22 drift suite “pins” bootstrap p-values with >= 0.25 absolute tolerance bands, but the actual T22 assertions only require broad fail-to-reject ranges (0.10 <= p <= 0.95) and would not catch large within-null drift in the seeded bootstrap outputs (CHANGELOG.md:L11, docs/methodology/REGISTRY.md:L2577, tests/test_t22_had_survey_design_drift.py:L475-L483, tests/test_t22_had_survey_design_drift.py:L544-L556). Concrete fix: either tighten T22 to anchored p-value bands around the seeded outputs, as in T21-style drift locks (tests/test_t21_had_pretest_workflow_drift.py:L284-L310), or soften the changelog/registry/test comments to describe these as broad null-range checks rather than “pins.”

R3 reviewer flagged that CHANGELOG/REGISTRY prose says T22 drift suite "pins bootstrap p-values with >= 0.25 abs tolerance bands", but the actual assertions were `0.10 <= p <= 0.95` (width 0.85). The prose was stronger than the test coverage. Tightened to T21-style anchored bands centered on the seed=22 captured values: - stute (overall): `0.27 <= p <= 0.57` (was 0.10-0.95; captured ~0.42) - pretrends_joint: `0.24 <= p <= 0.54` (captured ~0.39) - homogeneity_joint: `0.26 <= p <= 0.56` (captured ~0.41) Each band has abs width ~0.30, satisfying the >= 0.25 abs band contract from `feedback_strata_bootstrap_path_divergence` while catching drift in either direction (toward rejection or toward an even cleaner pass) rather than only rejecting on cross-the-line moves. Aligns the CHANGELOG/REGISTRY claim with the test coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T11:11:41Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9e2b1477f1292a4d0649d2968d1724a97c79cf91

Overall Assessment

✅ Looks good

Executive Summary

No new P0/P1 methodology issues in the changed scope. The new T22 materials match the shipped HAD contracts on survey-QUG deferral, joint-Stute-only event-study pretests, and the weighted _fit_continuous() formula.
Severity P2: T22 Section 3 still says the naive and survey fits “Both use the same weighted local-linear estimator at d_lower,” but the source only switches to weighted moments when weights are present, and the new drift test itself treats the survey-weighted identity as survey-only.
Severity P3: the prior bootstrap-band issue is only partially cleaned up. The tests now use anchored p-value windows, but the changelog/registry/test comments still describe them as >= 0.25 absolute-tolerance bands even though the actual assertions are only ±0.15 around the stored centers.
Severity P3: T22 still quotes “around 1.10x” SE inflation, but the drift suite only locks direction (survey.se > naive.se), so that seeded numeric claim can still drift silently.
Static review only. I could not run tests/test_t22_had_survey_design_drift.py here because this environment lacks pytest, numpy, and pandas.

Methodology

Severity P2. Impact: docs/tutorials/22_had_survey_design.ipynb:L296-L312 says the naive and survey fits “Both use the same weighted local-linear estimator at d_lower,” which is not the shipped contract. In _fit_continuous(), the denominator and dy_mean are only weighted when weights_arr is not None (diff_diff/had.py:L3747-L3760, diff_diff/had.py:L3803-L3808), and the new T22 drift test explicitly checks that survey.att != naive.att and re-derives the weighted identity only for the survey fit (tests/test_t22_had_survey_design_drift.py:L622-L689). Concrete fix: rewrite that sentence to say both fits use the same local-linear estimator family at d_lower, but only the survey-aware fit uses weighted moments in tau_bc, dy_mean, and E_w[D-d_lower].

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining survey-path deferrals are explicitly surfaced rather than silently deferred (TODO.md:L115-L117, docs/survey-roadmap.md:L165-L170), which is compatible with the project’s tracking policy.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the previous P3 is only partially resolved. CHANGELOG.md:L11, docs/methodology/REGISTRY.md:L2577, and the T22 test docstring/comments (tests/test_t22_had_survey_design_drift.py:L28-L35, L475-L488, L548-L564) still say the bootstrap p-value checks use >= 0.25 absolute-tolerance bands, but the actual assertions are [0.27, 0.57], [0.24, 0.54], and [0.26, 0.56], i.e. ±0.15 around the seeded centers. Concrete fix: either widen the assertions to match the documented ±0.25 half-width, or update the changelog/registry/comments to describe the real ±0.15 windows (0.30 total band width).
Severity P3. Impact: T22’s prose still quotes a quantitative SE-inflation magnitude (“around 1.10x”) at docs/tutorials/22_had_survey_design.ipynb:L393-L401, but the drift suite only checks direction at tests/test_t22_had_survey_design_drift.py:L382-L392, despite the module docstring claiming quoted tutorial numbers are checked against the notebook (tests/test_t22_had_survey_design_drift.py:L11-L15). Concrete fix: add a bounded assertion on survey.se / naive.se, or soften the notebook/changelog wording to a qualitative claim (“survey SE is larger”) instead of a seeded numeric ratio.

P2 — T22 §3 said "Both fits use the same weighted local-linear estimator at d_lower". `_fit_continuous` only switches to weighted moments when `weights_arr is not None` (`had.py:3747-3760`, `:3803-3808`); the naive fit uses unweighted moments. Rewrote §3 to say "same local-linear estimator family" with explicit per-fit moment-form distinction (naive: unweighted; survey: weighted via `bias_corrected_local_linear(..., weights=weights_arr)` plus weighted `np.average` for `dy_mean` and the denominator). P3 #1 — CHANGELOG/REGISTRY/test docstring/comments said the bootstrap p-value pins use ">= 0.25 abs tolerance bands", but the R3 tightened bands are width 0.30 total (± 0.15 around seeded centers). The "abs tolerance >= 0.25" wording is ambiguous between half-width and total-width interpretations. Updated all three surfaces to "anchored windows of total width 0.30 (± 0.15 around seeded centers)" so the prose is unambiguous and matches the actual assertions. P3 #2 — T22 §3 quotes "around 1.10x" SE inflation but the drift suite only checked direction (`survey.se > naive.se`). The seeded ratio could drift to 1.05 or 1.20 silently. Added `test_survey_se_inflation_ratio_in_band` asserting `1.00 <= ratio <= 1.20` — locks the seed=87 captured ratio (~1.0985) tightly enough to flag drift but loosely enough to not flake on RNG-path differences. Bumped CHANGELOG/REGISTRY test count from 30 → 31 to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T11:17:44Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 573200cf2bc23391313ccfc0b3dd511ab9c4af27

Overall Assessment

✅ Looks good

Executive Summary

Previous re-review items are addressed. T22 section 3 now correctly distinguishes naive vs survey-weighted moments, and the new drift suite locks both the weighted _fit_continuous identity and a bounded SE-inflation ratio rather than only sign direction (docs/tutorials/22_had_survey_design.ipynb:L296-L316, tests/test_t22_had_survey_design_drift.py:L396-L405, tests/test_t22_had_survey_design_drift.py:L645-L705).
The earlier bootstrap-band wording mismatch is also fixed: changelog, registry, and tests now consistently describe total-width 0.30 (±0.15) anchored windows (CHANGELOG.md:L11-L11, docs/methodology/REGISTRY.md:L2577-L2577, tests/test_t22_had_survey_design_drift.py:L28-L33, tests/test_t22_had_survey_design_drift.py:L487-L499, tests/test_t22_had_survey_design_drift.py:L560-L576).
Severity P2. T22 section 4 still describes WAS_d_lower itself as a boundary-neighborhood estimand. The shipped HAD methodology defines WAS_d_lower as the average slope above d_lower, identified using a boundary local-linear fit; the boundary-neighborhood story belongs to the leading-order variance intuition, not the target parameter (docs/tutorials/22_had_survey_design.ipynb:L414-L425, docs/tutorials/20_had_brand_campaign.ipynb:L256-L258, docs/methodology/REGISTRY.md:L2287-L2320, diff_diff/had.py:L21-L31).
Severity P3. Section 5 labels the survey event-study cband as “Phase 4.5 C composition,” but the registry documents that weighted event-study cband/sup-t path as Phase 4.5 B; Phase 4.5 C is the pretest/workflow extension (docs/tutorials/22_had_survey_design.ipynb:L508-L512, docs/methodology/REGISTRY.md:L2366-L2380).
Static review only. I could not run tests/test_t22_had_survey_design_drift.py because this environment lacks pytest, numpy, and pandas.

Methodology

Severity P2. Impact: docs/tutorials/22_had_survey_design.ipynb:L414-L425 says “The HAD WAS_d_lower estimand is a weighted average of unit-specific slopes in a local-linear neighborhood at d_lower.” That conflates the target parameter with the nuisance boundary estimator. The shipped contract is WAS_{d̲} = (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2 ≤ d]) / E[D_2 - d̲], i.e. an average slope above d_lower, with the local-linear fit used only to estimate the boundary limit (docs/methodology/REGISTRY.md:L2287-L2320, diff_diff/had.py:L21-L31). Concrete fix: rewrite the section-4 opener to say that WAS_d_lower is the average slope above d_lower, while its leading-order variance is driven by the local-linear boundary fit and therefore by units near d_lower.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining HAD survey-path limitations are still properly tracked rather than silently deferred (TODO.md:L101-L115).

Security

No findings.

Documentation/Tests

Severity P3. Impact: docs/tutorials/22_had_survey_design.ipynb:L508-L512 attributes the survey event-study cband to “Phase 4.5 C composition,” but the documented phase split puts weighted event-study survey composition and sup-t cband in Phase 4.5 B, with Phase 4.5 C covering survey-aware pretests/workflow (docs/methodology/REGISTRY.md:L2366-L2380). This is historical/bookkeeping drift, not a behavioral defect. Concrete fix: change the phase label to Phase 4.5 B or drop the phase shorthand there entirely.
No additional findings. Static review only; test execution was not possible in this environment.

P2 — T22 §4 opener said WAS_d_lower is "a weighted average of unit-specific slopes in a local-linear neighborhood at d_lower", which conflated the target parameter with its nuisance estimator. The shipped contract (REGISTRY § HAD; `had.py:21-31`) defines `WAS_{d̲} = (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2 ≤ d]) / E[D_2 - d̲]` — an average slope above d_lower, NOT a neighborhood estimand. The local-linear boundary fit is one component (estimating the limit term). The leading-order variance still concentrates at the boundary because that's where the only nonparametric estimation happens, but that is a property of the variance, not the estimand. Rewrote the §4 opener to make this distinction explicit. P3 — T22 §5 attributed the survey event-study cband to "Phase 4.5 C composition", but per REGISTRY:2366-2380 the weighted event-study sup-t cband is Phase 4.5 B; Phase 4.5 C is the pretest/workflow extension demonstrated in §6. Updated to "Phase 4.5 B composition" with a one-clause note that the §6 material is the Phase 4.5 C work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T11:23:54Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b022ba11a25d91070573a3e8a599e4f9571f223a

Overall Assessment

✅ Looks good

Executive Summary

Previous re-review items are addressed. T22 section 4 now correctly defines WAS_{d̲} as the average slope above d_lower, and the event-study cband text now correctly points to Phase 4.5 B rather than 4.5 C (docs/tutorials/22_had_survey_design.ipynb:L414-L421, docs/tutorials/22_had_survey_design.ipynb:L512-L517; docs/methodology/REGISTRY.md:L2287-L2320, docs/methodology/REGISTRY.md:L2366-L2380).
The new drift suite is well targeted: it locks the weighted _fit_continuous contract, survey-vs-naive SE ordering, verdict/summary QUG-deferral strings, and overall-vs-event-study workflow separation (tests/test_t22_had_survey_design_drift.py:L467-L618, tests/test_t22_had_survey_design_drift.py:L645-L705; diff_diff/had.py:L3694-L3815; diff_diff/had_pretests.py:L725-L737, diff_diff/had_pretests.py:L4300-L4406).
Severity P3. The leadership template in T22 still says the WAS_d_lower estimand “is a local-linear at d_lower,” which reintroduces a smaller version of the prior estimand/estimator conflation (docs/tutorials/22_had_survey_design.ipynb:L754-L760).
No unmitigated P0/P1 issues found in the changed files.
Static review only. I could not run tests/test_t22_had_survey_design_drift.py because this environment lacks pytest, numpy, and pandas.

Methodology

Severity P3; Impact: docs/tutorials/22_had_survey_design.ipynb:L754-L760 says “HAD's WAS-d_lower estimand is a local-linear at d_lower.” The source contract is narrower: WAS_{d̲} is the average slope above d_lower, and the estimator uses a local-linear boundary fit only for the lim_{d↓d̲} E[ΔY | D_2 ≤ d] term (diff_diff/had.py:L21-L31, diff_diff/had.py:L3717-L3735, docs/methodology/REGISTRY.md:L2287-L2320). Concrete fix: rewrite that clause to “because HAD uses a local-linear boundary fit at d_lower, variance is dominated by the few states near the boundary.”
No other methodology findings. The PR is docs/tests only, and the shipped tutorial/test claims otherwise match the registry and source.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The closer flips are internally consistent: T22 is marked shipped in both TODO.md and the HAD registry checklist (TODO.md:L115-L115, docs/methodology/REGISTRY.md:L2577-L2577).

Security

No findings.

Documentation/Tests

No findings beyond the P3 methodology wording above.
The new drift suite materially improves protection against prose drift, especially around the weighted point-estimation identity and the survey-workflow verdict semantics (tests/test_t22_had_survey_design_drift.py:L621-L705).
Test execution was not possible in this environment because pytest, numpy, and pandas are unavailable.

The §4 estimand-vs-estimator fix from R5 had a residual echo at the §7 leadership-block clause: "HAD's WAS-d_lower estimand is a local-linear at d_lower". Same conflation class — collapses the target parameter (`WAS_{d̲} = (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2 ≤ d]) / E[D_2 - d̲]`, the average slope above d_lower) with the nuisance estimator (the local-linear boundary fit used to estimate the limit term). Rewrote the clause along the lines the reviewer suggested: HAD uses a local-linear boundary fit at d_lower to estimate the boundary-limit term in the WAS-d_lower formula; variance is dominated by the few states near the boundary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T13:15:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f7471ff7d85a029a222167dd51470dc5a92d5c60

Overall Assessment

✅ Looks good

Executive Summary

Re-review scope looks clean on the previously flagged T22 issue: the notebook now correctly distinguishes the WAS_d_lower estimand from the local-linear boundary-fit component, both in the main methodology section and in the leadership template. docs/tutorials/22_had_survey_design.ipynb:L416-L425, docs/tutorials/22_had_survey_design.ipynb:L754-L760, diff_diff/had.py:L21-L31, diff_diff/had.py:L3717-L3815
The new drift suite materially improves protection against prose drift. It now locks the weighted _fit_continuous identity, survey-vs-naive ATT divergence, verdict-suffix vs summary-note separation, and the overall-vs-event-study workflow split. tests/test_t22_had_survey_design_drift.py:L396-L705
Severity P3 [Newly identified]: T22 suppresses the Design 1 UserWarning surface and never restates the non-testable Assumption 5/6 caveat, even though the walkthrough auto-resolves to continuous_near_d_lower. docs/tutorials/22_had_survey_design.ipynb:L299-L317, docs/tutorials/22_had_survey_design.ipynb:L348-L367, docs/tutorials/22_had_survey_design.ipynb:L720-L760, docs/tutorials/20_had_brand_campaign.ipynb:L229-L229, docs/api/had.rst:L13-L18
Registry/TODO/roadmap closeout is internally consistent: T22 is marked shipped across the methodology registry, TODO tracker, tutorial index, and survey roadmap. docs/methodology/REGISTRY.md:L2529-L2577, TODO.md:L115-L115, docs/tutorials/README.md:L114-L120, docs/survey-roadmap.md:L143-L170
Static review only. I could not execute tests/test_t22_had_survey_design_drift.py in this environment because pytest and numpy are unavailable.

Methodology

Severity P3; Impact: T22’s survey walkthrough is a Design 1 / continuous_near_d_lower example, but every HAD/pretest fit cell blanket-suppresses UserWarning, and the notebook never replaces the suppressed warning with the still-applicable non-testable Assumption 5/6 caveat. That makes the survey-path caveat discussion read as if QUG deferral + linearity diagnostics exhaust the methodology risk, which is not the full contract for WAS_d_lower. Concrete fix: add one sentence in section 3 or section 7, mirroring T20/API wording, that point identification still requires Assumption 6 (or Assumption 5 for sign only), and narrow the warning filters to the intended messages rather than all UserWarnings. docs/tutorials/22_had_survey_design.ipynb:L299-L317, docs/tutorials/22_had_survey_design.ipynb:L348-L367, docs/tutorials/22_had_survey_design.ipynb:L720-L760, docs/tutorials/20_had_brand_campaign.ipynb:L229-L229, docs/api/had.rst:L13-L18
No other methodology findings. The prior estimand/estimator wording issue is resolved, and the new prose/tests now line up with the registry and source on WAS_d_lower, survey-weighted _fit_continuous, QUG deferral, and survey-path verdict semantics. docs/tutorials/22_had_survey_design.ipynb:L416-L425, docs/tutorials/22_had_survey_design.ipynb:L693-L712, diff_diff/had_pretests.py:L725-L777, diff_diff/had_pretests.py:L4300-L4406, docs/methodology/REGISTRY.md:L2435-L2456

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The “shipped” closer flips are coherent across the tracker and methodology registry, so this does not leave stale deferred-work bookkeeping behind. TODO.md:L115-L115, docs/methodology/REGISTRY.md:L2529-L2577

Security

No findings.

Documentation/Tests

No active findings beyond the methodology note above. The new drift suite is well targeted and now covers the most load-bearing T22 claims, especially the previously misdocumented weighted point-estimation contract and the survey verdict wording split. tests/test_t22_had_survey_design_drift.py:L396-L705
Test execution was not possible here because pytest and numpy are not installed.

T22 walked the Design 1 (`continuous_near_d_lower`) survey-aware workflow but never restated the non-testable Assumption 5/6 caveat that point identification of `WAS_d_lower` rests on (the QUG-under-survey deferral was the only methodology caveat T22's prose called out). This made the survey-path caveat discussion read as if QUG deferral + linearity diagnostics exhaust the methodology risk, which is not the full HAD contract. Compounding factor: the five HAD/pretest fit cells used `warnings.filterwarnings("ignore", category=UserWarning)` — a blanket filter that silently swallowed the `continuous_near_d_lower ... Assumption 5 or 6` UserWarning the library fires on every Design 1 fit. So users running the notebook saw neither the warning nor the prose explanation. Two-part fix matching the reviewer's recommendation and T20's established convention: 1. Narrow each `filterwarnings` call from the catch-all `category=UserWarning` to two specific message patterns: `r".*pweight.*"` (suppress the noisy normalization message) and `r".*continuous_near_d_lower.*Assumption.*"` (suppress the redundant Assumption 5/6 advisory on subsequent fits; the §3 first fit lets it surface naturally so users see it at least once). This mirrors T20's own pattern at `20_had_brand_campaign.ipynb` where the headline fit lets the warning fire and the second event-study cell narrowly filters it out as redundant. Other UserWarnings — notably the QUG-deferred-under-survey advisory from `did_had_pretest_workflow` — are now no longer suppressed and fire as the load-bearing user-facing methodology signal that `feedback_no_silent_failures` requires. 2. Add an Assumption 5/6 caveat note to the §3 "Reading the table" interpretation cell. Mirrors T20's L229 prose: explains that point identification of `WAS_d_lower` requires Assumption 6 (or Assumption 5 for sign only); both are about local linearity of the dose-response near `d_lower` and are not testable from data; the §6 linearity diagnostics are necessary but not sufficient; users should justify Assumption 6 from domain knowledge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T13:23:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3e378facb2b74c23b598935c57eb3b8e58171dca

Overall Assessment

✅ Looks good

Executive Summary

Re-review scope is clean on the prior T22 methodology issue: the notebook now explicitly states the non-testable Design 1 Assumption 5/6 caveat, which matches the estimator’s warning contract in had.py. docs/tutorials/22_had_survey_design.ipynb:L416-L432, diff_diff/had.py:L3372-L3389
The new T22 prose is methodologically aligned with source on the weighted continuous-fit contract and on the survey-pretest verdict/summary split. diff_diff/had.py:L3704-L3828, diff_diff/had_pretests.py:L730-L737, diff_diff/had_pretests.py:L4300-L4406
The new drift suite is well targeted: it locks the T22 DGP, design auto-detection, weighted point-estimation identity, survey-pretest suffixes, and overall-vs-event-study workflow separation.
Severity P3: one sentence in T22 now misdescribes its own warning behavior. The headline-fit cell suppresses the Design 1 warning, but the prose says that warning is allowed to surface.
Static review only. I could not execute the new drift test here because numpy, pandas, and pytest are unavailable in this environment.

Methodology

No findings. On re-review, the previous Assumption 5/6 documentation gap is addressed, and the T22 survey-weighted HAD walkthrough now matches the shipped methodology contracts in the estimator, pretest workflow, and registry. docs/tutorials/22_had_survey_design.ipynb:L416-L432, diff_diff/had.py:L3704-L3828, docs/methodology/REGISTRY.md:L2547-L2554

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P3; Impact: T22 now correctly explains the Design 1 Assumption 5/6 caveat, but it still tells readers that the headline-fit cell “let[s] it surface,” even though that same cell filters continuous_near_d_lower.*Assumption.*. This is minor because the caveat is now documented immediately below, but the walkthrough’s description of its own output is inaccurate. Concrete fix: either remove the Assumption-warning filter from the first headline-fit cell, or change the prose to say the warning is restated in prose while the code output is filtered. docs/tutorials/22_had_survey_design.ipynb:L344-L354, docs/tutorials/22_had_survey_design.ipynb:L416-L432, diff_diff/had.py:L3372-L3389
The drift test coverage is otherwise strong and materially reduces prose-drift risk for T22’s load-bearing claims.

R7's narrowing filter pass applied the same two-message filter (`pweight` + `Assumption 5/6`) uniformly to all five fit cells. The §3 interpretation prose then claimed the headline fit cell "lets [the Assumption 5/6 warning] surface" — which contradicted the actual cell behavior (the warning was suppressed there too). Removed the Assumption filter from the §3 headline fit cell only, so the warning fires once on the canonical first fit (matching the prose). All four subsequent fit cells (§4 event-study ratio, §5 event-study + cband, §6 overall workflow, §6 event-study workflow) keep both filters because the warning is redundant there. Added a NB comment in the headline fit cell explaining the deliberate omission. Pattern matches T20's L229 idiom: headline fit fires the warning once, prose explains it, subsequent fits narrowly filter as redundant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T13:29:02Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e2a809d3de1cb4d8c02f8c3a5a4ee9ee50092eff

Overall Assessment

✅ Looks good

Executive Summary

Re-review scope is clean on the prior T22 prose issue: the notebook now intentionally lets the Design 1 Assumption 5/6 warning surface on the headline fit, and the prose matches that behavior (docs/tutorials/22_had_survey_design.ipynb:L347-L356, docs/tutorials/22_had_survey_design.ipynb:L428-L431, diff_diff/had.py:L3379-L3390).
The new T22 tutorial is methodologically aligned with the shipped weighted HAD and survey-pretest contracts: weighted _fit_continuous, survey-QUG deferral, survey-specific verdict/summary text, and the stratified Stute bootstrap description all match source and REGISTRY (diff_diff/had.py:L3704-L3815, diff_diff/had_pretests.py:L731-L737, diff_diff/had_pretests.py:L4300-L4406, docs/methodology/REGISTRY.md:L2346-L2354, docs/methodology/REGISTRY.md:L2449-L2468).
The new drift suite materially improves protection against prose drift by locking the headline SE inflation band, QUG summary/verdict surfaces, workflow-path separation, and the weighted ATT algebraic identity (tests/test_t22_had_survey_design_drift.py:L396-L404, tests/test_t22_had_survey_design_drift.py:L474-L705).
Static review only. I could not execute tests/test_t22_had_survey_design_drift.py here because pytest, numpy, and pandas are unavailable in this environment.

Methodology

No findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The PR correctly converts the stale T22 TODO entry into a shipped item rather than leaving outdated debt markers behind (TODO.md:L115-L115).

Security

No findings.

Documentation/Tests

No findings. The previous notebook-warning mismatch is resolved, and the new drift file covers the load-bearing T22 claims added in this PR (docs/tutorials/22_had_survey_design.ipynb:L347-L356, docs/tutorials/22_had_survey_design.ipynb:L428-L431, tests/test_t22_had_survey_design_drift.py:L396-L404, tests/test_t22_had_survey_design_drift.py:L532-L705).

Eight rounds of CI-review iteration tightened methodology precision but left the notebook prose denser than necessary — implementation detail and version bookkeeping had crept into §3, §4, §5, and §7 alongside the pedagogical arc. This pass prunes those without regressing on any methodology contract: - §3 setup paragraph: dropped the file:line dump (`had.py:3747-3760`, `:3803-3808`) and the redundant weighted-vs-unweighted point-by-point enumeration. The three-point weight-consumption claim (`tau_bc`, weighted ΔY mean, weighted denominator) is preserved in compact form. - §3 Assumption 5/6 note: trimmed from 15 lines to 11. Kept all load-bearing content (Assumption 6 / Assumption 5; not testable from data; §6 diagnostics necessary but not sufficient; domain knowledge justification; paired-with-QUG-deferral framing). - §4 opener: restructured to lead with intuition (few states near d_lower → small lever for PSU correlation), with the formal `WAS_{d̲}` definition pushed into a `**Formal definition.**` callout. Both halves are preserved — the formal definition is unchanged in content, just demoted from the lead. - §5: dropped the "(Phase 4.5 B composition; ...)" parenthetical (internal version bookkeeping, not user-facing). - §7 methodologist block: tightened from a numbered list with two verbatim verdict quotes to a compact two-clause description of the two paths plus the shared verdict suffix quoted once. `report.yatchew` / `report.stute = None` callout on the event-study path preserved. The SE-inflation-is-modest explanation (with section 4 cross-link) preserved. Methodology preservation verified against 14 load-bearing anchors: estimand definition, Assumption 5/6 caveat, non-testability, QUG-under-survey deferral, Phase 4.5 C0 label, Stute + Yatchew surfaces, joint pretrends + homogeneity surfaces, ES-path `yatchew/stute is None`, Binder TSL composition, local-linear boundary fit description, PSU x period shock mechanism. All 14 still present in the rendered prose. 31/31 drift tests still pass (the drift suite anchors load-bearing claims via the runtime API contract, not the notebook prose, so prose tightening is structurally safe). Diff: +57/-79 (net 22-line reduction in tutorial body). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T13:58:25Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 57753eeb32c1663783415fd62d9975168860b776

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review scope is otherwise clean: the prior T22 warning/prose mismatch remains resolved. The headline fit now deliberately lets the Design 1 Assumption 5/6 warning surface, and the surrounding prose matches the shipped warning contract in diff_diff/had.py (docs/tutorials/22_had_survey_design.ipynb:L340-L349, docs/tutorials/22_had_survey_design.ipynb:L409-L420, diff_diff/had.py:L3372-L3390).
The new T22 prose is generally aligned with the Methodology Registry and implementation on the load-bearing points: weighted continuous HAD point estimation, QUG-under-survey deferral, the shared _QUG_DEFERRED_SUFFIX, overall-vs-event-study workflow separation, and the stratified Stute bootstrap composition (diff_diff/had.py:L3694-L3829, diff_diff/had_pretests.py:L725-L777, diff_diff/had_pretests.py:L4300-L4406, docs/methodology/REGISTRY.md:L2346-L2357, docs/methodology/REGISTRY.md:L2435-L2468).
P1: the new survey event-study plot in T22 labels its blue bars as “pointwise CI” but draws them as 1.96 * se; the survey-aware HAD event-study path actually uses t inference with df_survey, so the chart understates uncertainty relative to both the method contract and the table printed just above it (docs/tutorials/22_had_survey_design.ipynb:L571-L582, docs/tutorials/22_had_survey_design.ipynb:L592-L599, diff_diff/had.py:L4352-L4445, diff_diff/utils.py:L38-L46, diff_diff/utils.py:L177-L210).
Static review only. I could not execute tests/test_t22_had_survey_design_drift.py or the notebook here because pytest, numpy, and pandas are unavailable in this environment.

Methodology

P1 — The T22 survey event-study chart uses the wrong pointwise CI construction. The notebook’s plotting cell draws the error bars as yerr=1.96 * np.asarray(es.se) while labeling them “point + pointwise CI” (docs/tutorials/22_had_survey_design.ipynb:L592-L599). That does not match the shipped survey-aware inference path: event-study HAD passes df_survey into safe_inference, and safe_inference switches from Normal to Student-t critical values whenever df is present (diff_diff/had.py:L4352-L4445, diff_diff/utils.py:L38-L46, diff_diff/utils.py:L177-L210). The table immediately above the plot already uses the correct stored conf_int_low/high endpoints (docs/tutorials/22_had_survey_design.ipynb:L571-L582). Impact: the tutorial’s main event-study figure visually understates survey uncertainty and contradicts the method’s own printed interval table. Concrete fix: build the plotted error bars from es.conf_int_low/high (or from the t critical implied by es.survey_metadata.df_survey) instead of hard-coding 1.96 * es.se.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

P2 — The new drift suite does not guard the notebook’s survey event-study plotting logic. Its event-study checks cover horizon sets, CI coverage, and cband-vs-pointwise ordering, but not whether the notebook plot actually uses the stored pointwise CI endpoints (tests/test_t22_had_survey_design_drift.py:L418-L460). Impact: the tutorial can render the wrong interval construction without any drift-test failure. Concrete fix: add a lightweight regression that rejects hard-coded Normal-theory bars on the survey path and/or verifies that the plot uses es.conf_int_low/high.

Path to Approval

Update the T22 event-study plotting cell to use the actual pointwise CI endpoints (es.conf_int_low/high) or the t-based critical value implied by df_survey, then refresh the notebook output so the figure matches the estimator’s survey-aware inference contract.

The §5 matplotlib event-study plot hard-coded `yerr=1.96 * es.se` (Normal-theory pointwise CI) but the table immediately above uses `es.conf_int_low/high`, which the estimator computes with `t` critical values and `df_survey` on the survey path (`diff_diff/had.py:4352-4445`, `diff_diff/utils.py:38-46,177-210`). The plot silently understated uncertainty and contradicted its own neighboring table — a real methodology bug, not just prose. The cband ribbon was already drawn from `es.cband_low/high` (unaffected); only the pointwise error bars were broken. Fix: - Plot now builds asymmetric `yerr` from the stored `es.conf_int_low/high`. matplotlib's `errorbar` accepts a (2, n) array of `[lower_distances, upper_distances]`, which is what the estimator's stored endpoints encode (no need to back out the implied t critical value manually). - Legend label changed from "point + pointwise CI" to "point + pointwise CI (survey-aware t)" to flag the inference family in the figure. Drift coverage: - New `test_event_study_plot_uses_stored_pointwise_ci_endpoints` inspects the notebook source and rejects both the `1.96 * np.asarray(es.se)` and `1.96 * es.se` patterns, AND requires that the plot cell references `conf_int_low` / `conf_int_high`. This is a source-level static check (the plot cell has no return value to introspect at runtime), but it catches exactly this class of regression. Brings the drift suite from 31 to 32 tests; CHANGELOG / REGISTRY counts updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T14:04:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 31d68e842d8dbca5c98d20f7f0788ea5e1eb37b0

Overall Assessment

✅ Looks good

Executive Summary

Previous P1 resolved: the T22 event-study plot now builds pointwise error bars from the estimator’s stored conf_int_low / conf_int_high instead of hard-coded 1.96 * se, which matches the survey df_survey / Student-t inference path in HAD. docs/tutorials/22_had_survey_design.ipynb:L592-L605, tests/test_t22_had_survey_design_drift.py:L396-L429, diff_diff/had.py:L4352-L4451, diff_diff/utils.py:L177-L211
Previous P2 resolved: the new drift suite now explicitly locks that plot regression and the overall-vs-event-study verdict surface split, so the notebook cannot drift back to the earlier mismatch without a test failure. tests/test_t22_had_survey_design_drift.py:L396-L429, tests/test_t22_had_survey_design_drift.py:L615-L654
The revised T22 prose is now methodology-consistent with the shipped survey-aware estimator: weights enter the continuous HAD boundary fit, weighted numerator, and denominator, and the QUG-under-survey deferral/suffix behavior matches the Registry and had_pretests.py. docs/tutorials/22_had_survey_design.ipynb:L298-L310, diff_diff/had.py:L3744-L3829, docs/tutorials/22_had_survey_design.ipynb:L744-L760, diff_diff/had_pretests.py:L725-L777, diff_diff/had_pretests.py:L4300-L4406, docs/methodology/REGISTRY.md:L2449-L2468
The status/provenance updates are internally consistent: CHANGELOG, TODO, REGISTRY, and the survey roadmap now all treat T22 as shipped while preserving the remaining documented survey-path deferrals. CHANGELOG.md:L8-L11, TODO.md:L115-L115, docs/methodology/REGISTRY.md:L2577-L2577, docs/survey-roadmap.md:L143-L163
Static review only: I could not execute tests/test_t22_had_survey_design_drift.py here because pytest, numpy, pandas, scipy, and nbformat are unavailable in this environment.

Methodology

No findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining survey-path limitations referenced by T22 are already explicitly tracked/documented rather than silently introduced. TODO.md:L101-L115, docs/methodology/REGISTRY.md:L2452-L2454

Security

No findings.

Documentation/Tests

No findings. The new drift file materially improves notebook regression coverage, including the exact CI-construction bug from the prior review and the weighted _fit_continuous() identity/prose contract. tests/test_t22_had_survey_design_drift.py:L396-L429, tests/test_t22_had_survey_design_drift.py:L657-L720

Sphinx HTML build failed with `-W warnings as errors` because the new tutorial was included nowhere in the docs toctree: docs/tutorials/22_had_survey_design.ipynb: WARNING: document isn't included in any toctree [toc.not_included] Added the missing entry to the "Tutorials: Business Applications" toctree at `docs/index.rst`, alongside T20 and T21. Same convention as the existing HAD-series entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T14:18:08Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f7a716ab17285617aa75765b45c739009a3fd9f6

Overall Assessment

✅ Looks good

Executive Summary

Previous P1 remains resolved: T22 now correctly states that survey weights affect the HAD point estimate through the local-linear boundary fit, the weighted ΔY mean, and the weighted denominator, matching _fit_continuous; the new drift test re-derives that identity end-to-end. docs/tutorials/22_had_survey_design.ipynb:L296-L310, diff_diff/had.py:L3744-L3827, tests/test_t22_had_survey_design_drift.py:L657-L741
Previous P1 remains resolved: the event-study plot now uses stored conf_int_low / conf_int_high rather than hard-coded 1.96 * se, which is the correct survey-aware t-based inference path; the notebook source is now locked by a dedicated regression test. docs/tutorials/22_had_survey_design.ipynb:L592-L605, diff_diff/had.py:L4443-L4451, tests/test_t22_had_survey_design_drift.py:L396-L429
The survey-workflow prose matches the shipped methodology on QUG deferral, verdict suffixes, the separate summary skip note, and overall-vs-event-study diagnostic surfaces. diff_diff/had_pretests.py:L725-L777, diff_diff/had_pretests.py:L4300-L4406, docs/methodology/REGISTRY.md:L2441-L2474, docs/tutorials/22_had_survey_design.ipynb:L638-L760, tests/test_t22_had_survey_design_drift.py:L504-L654
Status/provenance updates are internally consistent: CHANGELOG, REGISTRY, TODO, and the survey roadmap all mark T22 as shipped while preserving the remaining documented survey-path deferrals. CHANGELOG.md:L10-L24, docs/methodology/REGISTRY.md:L2529-L2577, TODO.md:L100-L115, docs/survey-roadmap.md:L143-L170
Static review only: I could not execute the new drift test here because the sandbox lacks pytest, numpy, pandas, scipy, and nbformat.

Methodology

No findings. The changed tutorial and supporting docs are consistent with the Methodology Registry and the implementation for weighted continuous HAD, survey-aware event-study inference, and the Phase 4.5 C0 QUG deferral. docs/tutorials/22_had_survey_design.ipynb:L296-L310, docs/tutorials/22_had_survey_design.ipynb:L431-L456, docs/tutorials/22_had_survey_design.ipynb:L638-L760, diff_diff/had.py:L3744-L3827, docs/methodology/REGISTRY.md:L2441-L2474

Code Quality

No findings. The new drift suite materially locks the prior review regressions instead of only relying on notebook execution smoke tests. tests/test_t22_had_survey_design_drift.py:L396-L429, tests/test_t22_had_survey_design_drift.py:L504-L654, tests/test_t22_had_survey_design_drift.py:L657-L741

Performance

No findings. This PR is docs/tests only and does not change estimator hot paths.

Maintainability

No findings. The tutorial is wired into the index, API docs, decision tree, guide inventory, and doc dependency map consistently. docs/index.rst:L80-L86, docs/api/had.rst:L117-L126, docs/practitioner_decision_tree.rst:L313-L324, docs/doc-deps.yaml:L389-L418, diff_diff/guides/llms.txt:L95-L101

Tech Debt

No findings. The remaining limitations are explicitly tracked rather than silently introduced: replicate-weight designs, lonely_psu='adjust' with singleton strata, and trends_lin × survey_design stay documented follow-ups. TODO.md:L101-L108, docs/methodology/REGISTRY.md:L2449-L2454, docs/survey-roadmap.md:L165-L170

Security

No findings.

Documentation/Tests

No findings. The new drift file covers the important user-facing claims in T22, including survey CI construction, verdict wording, workflow-surface separation, and the weighted _fit_continuous contract. tests/test_t22_had_survey_design_drift.py:L396-L429, tests/test_t22_had_survey_design_drift.py:L523-L612, tests/test_t22_had_survey_design_drift.py:L623-L741

`test_event_study_plot_uses_stored_pointwise_ci_endpoints` imported `nbformat` and read the notebook from the repo's `docs/tutorials/` directory. CI Python Tests run from `/tmp/tests/` (isolated install of the wheel, no repo-tree access) and don't include nbformat in the runtime deps, so the test errored: ModuleNotFoundError: No module named 'nbformat' tests/test_t22_had_survey_design_drift.py:409 Two guards added (per `feedback_golden_file_pytest_skip` — same pattern that benchmarks/data/*.json drift tests use): 1. `nbformat = pytest.importorskip("nbformat")` — skips when the optional dep is missing. 2. `if not nb_path.exists(): pytest.skip(...)` — skips on the isolated-install matrix where docs/ isn't copied alongside tests/. The test runs in any environment that has both nbformat and the repo tree (dev workspace + tutorial-exec CI workflows), which is where it actually adds value. The Python Tests matrix doesn't need to lock notebook source against the prose/code mismatch the test was added to prevent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T14:53:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: abe874d380f4d4dc41713149ecbf7d8fe98534b4

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. One P2 test/CI coverage gap remains.

Executive Summary

Previous methodology-sensitive issue remains resolved: T22 now correctly states that survey weights affect the HAD point estimate through the weighted boundary fit, the weighted ΔY mean, and the weighted denominator, matching _fit_continuous() and locked by the new drift test. docs/tutorials/22_had_survey_design.ipynb:L296-L310, diff_diff/had.py:L3744-L3827, tests/test_t22_had_survey_design_drift.py:L673-L741
Previous inference-sensitive issue remains resolved: the event-study plot uses stored conf_int_low / conf_int_high from survey-aware t-based inference rather than hard-coded 1.96 * se, and the drift suite checks that source-level contract. docs/tutorials/22_had_survey_design.ipynb:L592-L605, diff_diff/had.py:L4443-L4451, tests/test_t22_had_survey_design_drift.py:L396-L434
The QUG-under-survey deferral, verdict suffix, summary skip note, and overall-vs-event-study workflow separation all match the Methodology Registry and current had_pretests.py behavior. docs/methodology/REGISTRY.md:L2435-L2466, docs/tutorials/22_had_survey_design.ipynb:L691-L760, diff_diff/had_pretests.py:L725-L777, diff_diff/had_pretests.py:L4300-L4406
Non-blocking gap: the new notebook-source regression test for the plot CI construction is skipped in the main pytest tests/ workflow because it depends on nbformat, which that workflow does not install. tests/test_t22_had_survey_design_drift.py:L396-L434, .github/workflows/rust-test.yml:L205-L219, .github/workflows/notebooks.yml:L42-L59
Static review only: this environment lacks pytest and even numpy, so I could not execute the new drift suite or notebook locally.

Methodology

No findings. The changed tutorial/docs are consistent with the shipped methodology for weighted continuous HAD, survey-aware event-study inference, and the Phase 4.5 C0 QUG deferral. docs/tutorials/22_had_survey_design.ipynb:L296-L310, docs/tutorials/22_had_survey_design.ipynb:L592-L605, docs/tutorials/22_had_survey_design.ipynb:L691-L760, diff_diff/had.py:L3744-L3827, diff_diff/had.py:L4443-L4451, diff_diff/had_pretests.py:L725-L777, diff_diff/had_pretests.py:L4300-L4406, docs/methodology/REGISTRY.md:L2435-L2466

Code Quality

No findings. The new drift suite usefully locks the prior review regressions plus the workflow-surface separation, instead of relying only on notebook execution smoke tests. tests/test_t22_had_survey_design_drift.py:L357-L456, tests/test_t22_had_survey_design_drift.py:L501-L646, tests/test_t22_had_survey_design_drift.py:L673-L741

Performance

No findings. This PR is docs/tests only and does not change estimator hot paths.

Maintainability

No findings. The T22 references are wired consistently across the index, API docs, doc-deps map, decision tree, roadmap, and guide inventories. docs/index.rst:L80-L85, docs/api/had.rst:L119-L126, docs/doc-deps.yaml:L394-L418, docs/practitioner_decision_tree.rst:L313-L324, docs/practitioner_decision_tree.rst:L417-L424, docs/survey-roadmap.md:L143-L170, docs/tutorials/README.md:L114-L120, diff_diff/guides/llms.txt:L95-L100, diff_diff/guides/llms-practitioner.txt:L191-L199

Tech Debt

No findings. The remaining HAD survey-path limitations are explicitly tracked/documented rather than silently introduced: lonely_psu='adjust' with singleton strata, replicate-weight designs, and permanent QUG deferral. docs/methodology/REGISTRY.md:L2435-L2466, docs/survey-roadmap.md:L165-L170, TODO.md:L115-L117

Security

No findings.

Documentation/Tests

Severity P2. Impact: test_event_study_plot_uses_stored_pointwise_ci_endpoints() currently skips in the main unit-test workflow because it calls pytest.importorskip("nbformat"), while the workflow that runs pytest tests/ installs only numpy pandas scipy pytest pytest-xdist; the separate notebooks workflow executes notebooks but does not run this test file. That leaves the prior “hard-coded 1.96 * se” tutorial regression unguarded in default CI. Concrete fix: either rewrite the test to parse the notebook with stdlib json instead of nbformat, or add nbformat to the workflow that runs pytest tests/. tests/test_t22_had_survey_design_drift.py:L396-L434, .github/workflows/rust-test.yml:L205-L219, .github/workflows/notebooks.yml:L42-L59

igerber force-pushed the feature/t22-had-survey-design branch from ebb73de to 1eabce5 Compare May 15, 2026 10:11

igerber added the ready-for-ci Triggers CI test workflows label May 15, 2026

igerber merged commit d41f4c3 into main May 15, 2026
26 checks passed

igerber deleted the feature/t22-had-survey-design branch May 15, 2026 16:06

igerber mentioned this pull request May 15, 2026

TODO.md cleanup: drop SHIPPED rows, refresh stale refs, add Tier A/B/C/D backlog #447

Open

6 tasks

Conversation

igerber commented May 15, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 15, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 15, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 15, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!