diff --git a/docs/methodology/papers/athey-2025-review.md b/docs/methodology/papers/athey-2025-review.md new file mode 100644 index 00000000..6ca71516 --- /dev/null +++ b/docs/methodology/papers/athey-2025-review.md @@ -0,0 +1,360 @@ +# Paper Review: Triply Robust Panel Estimators + +**Authors:** Susan Athey, Guido Imbens, Zhaonan Qu, Davide Viviano +**Citation:** Athey, S., Imbens, G.W., Qu, Z., & Viviano, D. (2025). Triply Robust Panel Estimators. *arXiv preprint arXiv:2508.21536v2*. +**PDF reviewed:** https://arxiv.org/abs/2508.21536v2 (version-pinned arXiv abstract for v2) +**Review date:** 2026-02-08 + +--- + +## Methodology Registry Entry + +*Working draft formatted to match docs/methodology/REGISTRY.md structure. Heading levels and labels align with existing entries. One open source-ambiguity item remains (weight normalization — see Gap #5 under "Gaps and Uncertainties" below); resolve it against the source before promoting this section into REGISTRY.md.* + +## TROP + +**Primary source:** Athey, S., Imbens, G.W., Qu, Z., & Viviano, D. (2025). Triply Robust Panel Estimators. arXiv:2508.21536. https://arxiv.org/abs/2508.21536 + +**Key implementation requirements:** + +*Assumption checks / warnings:* +- **Assumption 1** (Section 5, pages 20-21): Factor model structure on potential outcomes: `Y_it(0) = L_it + epsilon_it` where `L_it = Gamma_i^T Lambda_t` and `E[epsilon_it | L] = 0`. K (number of factors) is arbitrary and may grow with sample size. +- **Assumption 2** (page 22): Regression adjustment estimator class satisfies `E[L_hat | L] = Gamma (I + B) Lambda^T` where B is a K x K bias matrix. Satisfied by PCA/truncated SVD, nuclear-norm penalized least squares, and general singular-value shrinkage estimators. B = 0 means correctly specified. +- **Assumption 3** (Appendix, page 36): Factor model with covariates: `Y_it(0) = R_it + X_it beta + epsilon_it` where `R_it = Gamma Lambda^T`. +- **Assumption 4** (Appendix, page 36): Estimator class with covariates admits bias decomposition into factor bias, coefficient bias, and vanishing approximation error. +- Block treatment assignment: `W_it = 1{i > N_0} * 1{t > T_0}` (Assumption 1(i), expositional; general assignment patterns supported via Section 6). +- No spillovers and no dynamic effects (Section 5.3). +- Weight estimation error negligible when `N_0 >> N_1` and `T_0 >> T_1` (Section 5.3). +- Weights sufficiently dispersed: `||theta||_2, ||omega||_2 = o(sigma_NT)` (Section 5.3). + +*Target parameter (Equation 1):* + +``` + sum_{i=1}^{N} sum_{t=1}^{T} W_it (Y_it(1) - Y_it(0)) +tau = --------------------------------------------------------------- + sum_{i=1}^{N} sum_{t=1}^{T} W_it +``` + +This is the Average Treatment Effect on the Treated (ATT), averaged over all treated unit-period pairs. + +*Working model (Section 2.2, page 6):* + +``` +Y_it(0) = alpha_i + beta_t + L_it + epsilon_it, E[epsilon_it | L] = 0 +``` + +where: +- `alpha_i` = unit fixed effects +- `beta_t` = time fixed effects +- `L_it = Gamma_i^T Lambda_t` = low-rank factor structure component +- `epsilon_it` = idiosyncratic shocks + +*Estimator equation -- single treated unit (Equation 2, page 6):* + +``` +(alpha_hat, beta_hat, L_hat) = arg min_{alpha, beta, L} + sum_{j=1}^{N} sum_{s=1}^{T} theta_s^{i,t} omega_j^{i,t} (1 - W_{js}) (Y_{js} - alpha_j - beta_s - L_{js})^2 + + lambda_nn ||L||_* +``` + +where `||L||_*` is the nuclear norm of L. The treatment effect estimate is then: + +``` +tau_hat_it = Y_it - alpha_hat_i - beta_hat_t - L_hat_it +``` + +*Per-observation weights (Equation 3, page 7):* + +Time weights (exponential decay in temporal distance): +``` +theta_s^{i,t}(lambda) = exp(-lambda_time * dist_time(s, t)) +``` +where `dist_time(s, t) = |t - s|`. + +Unit weights (exponential decay in unit distance): +``` +omega_j^{i,t}(lambda) = exp(-lambda_unit * dist_unit_{-t}(j, i)) +``` +where the unit distance is the RMSE of outcome differences over shared control periods: +``` +dist_unit_{-t}(j, i) = sqrt( + sum_{u=1}^{T} 1{u != t} (1 - W_iu)(1 - W_ju)(Y_iu - Y_ju)^2 + / sum_{u=1}^{T} 1{u != t} (1 - W_iu)(1 - W_ju) +) +``` + +*General case -- multiple treated units (Equation 13, Section 6.1, pages 26-27):* + +Each treated unit-period (i, t) is estimated individually, as if it were the only treated unit: + +``` +tau_hat_it(lambda) = arg min_tau min_{alpha, beta, L} + sum_{j=1}^{N} sum_{s=1}^{T} + [(1 - W_{js}) + W_{js} I_{js}^{i,t}] omega_j^{i,t}(lambda) theta_s^{i,t}(lambda) + (Y_{js} - alpha_j - beta_s - L_{js} - tau_it * I_{js}^{i,t})^2 + + lambda_nn ||L||_* +``` + +where `I_{js}^{i,t} = 1{(j,s) = (i,t)}`. The ATT is then: + +``` +tau_hat = (1 / sum_{i,t} W_it) * sum_{i,t} W_it tau_hat_it(lambda_hat) +``` + +*Balancing representation (Equation 10, page 22):* + +The estimated counterfactual can be decomposed as: +``` +tau_hat_TROP(0, theta, omega) = L_hat_NT + + sum_{t<=T_0} theta_t (Y_Nt - L_hat_Nt) + + sum_{i<=N_0} omega_i (Y_iT - L_hat_iT) + - sum_{t<=T_0} sum_{i<=N_0} theta_t omega_i (Y_it - L_hat_it) +``` + +*With covariates (Equation 14, Section 6.2, page 28):* + +Parametrize `L_{js} = X_{js} beta_coef + R_{js}` where R is low-rank. The objective becomes: + +``` +tau_hat_{i*,t*} = arg min_tau min_{alpha, beta, beta_coef, R} + sum_{j=1}^{N} sum_{s=1}^{T} theta_s^{i*,t*} omega_j^{i*,t*} + (Y_{js} - alpha_j - beta_s - X_{js} beta_coef - R_{js} - tau * W_{js})^2 + + lambda_nn ||R||_* +``` + +This relaxes the low-rank assumption on L, requiring only that the residual after covariates is low-rank. + +*Special cases (Section 2.2, page 6):* +- `lambda_nn = infinity`, `omega_j = theta_s = 1` (uniform weights, no factor model) --> DID/TWFE +- `omega_j = theta_s = 1`, `lambda_nn < infinity` (uniform weights, with factor model) --> Matrix Completion (MC) +- `lambda_nn = infinity` with specific choices of unit and time weights --> SC and SDID + +*Standard errors (Section 5.3, pages 25-26):* +- Default: Bootstrap variance estimation (Algorithm 3, page 27) +- **No analytical standard errors** for the general case +- For single treated unit: inference driven by variance of idiosyncratic shock `sigma_NT^2` (multiple procedures available in literature) +- Bootstrap: Non-parametric (pairs) stratified bootstrap, resampling entire unit-level rows +- Clustering: Implicitly at the unit level (resamples full unit trajectories) +- Weight distribution: Not applicable (pairs bootstrap, not multiplier bootstrap) +- Recommended iterations: Not explicitly stated; simulations use 1000 replications for Monte Carlo RMSE + +*Triple robustness property (Theorem 5.1, page 23):* + +Under Assumption 1, for fixed (non-data-dependent) weights theta, omega: +``` +|E[tau_hat - tau | L]| <= ||Delta_u(omega, Gamma)||_2 * ||Delta_t(theta, Lambda)||_2 * ||B||_* +``` +where: +- `Delta_u(omega, Gamma) = Gamma_bar_0(omega) - Gamma_N` (unit imbalance in loadings) +- `Delta_t(theta, Lambda) = Lambda_bar_0(theta) - Lambda_T` (time imbalance in factors) +- `||B||_*` = spectral norm of regression adjustment bias matrix + +The bias is bounded by the **product** of three components. Strictly tighter than bounds for DID, SC, and SDID. + +*Corollary 1 (page 23):* + +Estimator is unbiased (`E[tau_hat - tau | L] = 0`) if **any one** of: +- (a) Unit balance: `sum_{i<=N_0} omega_i Gamma_i = Gamma_N` +- (b) Time balance: `sum_{s<=T_0} theta_s Lambda_s = Lambda_T` +- (c) Correct regression adjustment: `B = 0_K` + +*Bias comparison with existing estimators (Equation 11, Section 5.2, page 24):* +``` +B_DID = (Gamma_N - Gamma_bar_0)^T (Lambda_T - Lambda_bar_0) [no robustness] +B_SC = (Gamma_bar_0(omega) - Gamma_N)^T Lambda_T [singly robust] +B_SDID = (Gamma_bar_0(omega) - Gamma_N)^T (Lambda_bar_0(theta) - Lambda_T) [doubly robust] +B_TROP <= ||unit_imbalance||_2 * ||time_imbalance||_2 * ||regression_bias||_* [triply robust] +``` + +*Triple robustness with covariates (Theorem 8.1, Appendix, pages 36-37):* + +Under Assumptions 3 and 4: +``` +|E[tau_hat_TROP - Y_NT(0) | X, R]| <= ||Delta_u||_2 * ||B||_* * ||Delta_t||_2 + + ||g(theta, omega)||_2 * ||delta_beta||_2 + o(eta) +``` +where `g(theta, omega) = sum_{i,t} M(theta, omega)_it X_it` is the covariate contrast vector and `delta_beta = E[beta_hat | X, R] - beta` is the coefficient estimation bias. + +*Convergence rates (Section 5.3, page 25):* +- Weight estimation error: `epsilon_bar_0(theta, omega) = O_p(min{||theta||_2, ||omega||_2})` +- When weights sufficiently dispersed: `tau_hat - E[tau | L] = epsilon_NT(1) + o_p(sigma_NT^2)` +- With multiple post-treatment periods and CLT: `epsilon_NT(1) ~ N(0, sigma_NT^2)` + +*Edge cases:* +- Single treated unit, single period: TROP outperforms competitors in about half of cases, within 25% in others (Table 2) +- Very few control units (N_co ~ 10): TROP, SDID, and DIFP perform similarly; advantage emerges with larger control pools (Figure 2) +- No interactive effects (only additive FE): DID performs competitively with TROP (Table 3, "No M" row); TROP adapts without underperforming +- No additive fixed effects (only interactive): SC performs slightly better in some specs (Table 3, "No F" row) +- Only noise (no factors): MC and DID can slightly outperform TROP due to regularization overhead (Table 3, "Only Noise" row) +- Non-random treatment assignment: TWFE shows noticeable bias; TROP and SDID show substantially smaller/no bias (Appendix Table A2) +- Bootstrap validity requires growing number of treated units (Algorithm 3) +- Increasing pre-treatment periods can increase bias for DID-type estimators when treatment is at end of panel (Figure 2 right) + +*LOOCV tuning parameter selection (Equations 4-5, pages 7-8):* + +For each control observation (i, t) with W_it = 0, compute pseudo-treatment effect: +``` +tau_hat_it^{loocv}(lambda) = arg min_tau min_{alpha, beta, L} + sum_{j,s: W_{js}=0} omega_j^{i,t}(lambda) theta_s^{i,t}(lambda) + (Y_{js} - alpha_j - beta_s - L_{js} - tau * I_{js}^{i,t})^2 + + lambda_nn ||L||_* (Equation 4) +``` + +Select tuning parameters by minimizing: +``` +Q(lambda) = sum_{i=1}^{N} sum_{t=1}^{T} (1 - W_it) (tau_hat_it(lambda))^2 (Equation 5) +``` + +Practical LOOCV procedure (footnote 2, page 8): +1. Fix `lambda_nn = infinity`, `lambda_unit = 0`; optimize `lambda_time` over grid +2. Fix `lambda_time = optimal`, `lambda_nn = infinity`; optimize `lambda_unit` over grid +3. Fix `lambda_time = optimal`, `lambda_unit = optimal`; optimize `lambda_nn` over grid +4. Use these as upper bounds for a finer joint grid +5. Cycle through parameters, updating each while holding others at most recent optima + +*Algorithm 1 -- Single treated unit (page 9):* +``` +Input: Grid G for (lambda_time, lambda_unit, lambda_nn), treatments W, outcomes Y +1. For each lambda in G: + a. For each (i,t) with W_it = 0, estimate tau_hat_it(lambda) via Equation (4) + b. Compute Q(lambda) via Equation (5) +2. Find lambda_hat = arg min_{lambda in G} Q(lambda) +3. Compute tau_hat(lambda_hat) via Equation (2) with selected lambda_hat +``` + +*Algorithm 2 -- Multiple treated units (page 27):* +``` +Input: Grid G for (lambda_time, lambda_unit, lambda_nn), treatments W, outcomes Y +1. For each lambda in G: + a. For each (i,t) with W_it = 0, estimate tau_hat_it(lambda) via Equation (13) + b. Compute Q(lambda) = sum_{i,t} (1 - W_it)(tau_hat_it(lambda))^2 +2. Find lambda_hat = arg min_{lambda in G} Q(lambda) +3. Compute tau_hat = (1/sum W_it) * sum_{i,t} W_it tau_hat_it(lambda_hat) +``` + +*Algorithm 3 -- Bootstrap variance estimation (page 27):* +``` +Input: Y, W, B (number of bootstrap iterations) +1. For b = 1 to B: + a. Construct bootstrap dataset (Y^(b), W^(b)) by: + - Sampling N_0 rows of (Y, W) WITH REPLACEMENT from control units + - Sampling N_1 rows of (Y, W) WITH REPLACEMENT from treated units + b. Compute TROP estimator tau_hat^(b) from (Y^(b), W^(b)) +2. Variance estimator: + V_hat_tau = (1/B) sum_{b=1}^{B} (tau_hat^(b))^2 - ((1/B) sum_{b=1}^{B} tau_hat^(b))^2 +``` + +Note: Stratified bootstrap -- control and treated units resampled separately. Preserves within-unit temporal correlation. Validity requires growing number of treated units. + +**Reference implementation(s):** +- Authors' replication code (forthcoming as of review date) +- No specific software package mentioned in the paper +- Simulation designs reference Arkhangelsky et al. (2019) SDID replication infrastructure +- diff-diff: `diff_diff/trop.py` (existing implementation) + +**Requirements checklist:** +- [ ] Factor model estimated via nuclear-norm penalized least squares with soft-threshold SVD (Equation 2) +- [ ] Unit weights: `exp(-lambda_unit * RMSE_distance)` per Equation 3 +- [ ] Time weights: `exp(-lambda_time * |t - s|)` per Equation 3 +- [ ] LOOCV implemented for tuning parameter selection via Equation 5 +- [ ] LOOCV uses SUM of squared pseudo-treatment effects on control observations +- [ ] Coordinate-wise grid search followed by cycling (footnote 2) +- [ ] Per-observation treatment effect estimation for multiple treated units (Equation 13, Algorithm 2) +- [ ] ATT averages over all treated unit-period pairs with equal weight +- [ ] Stratified bootstrap preserving unit-level structure (Algorithm 3) +- [ ] Covariate extension supported (Equation 14) +- [ ] Special cases recoverable: DID, MC, SC, SDID via tuning parameters +- Weight normalization (`1^T omega = 1^T theta = 1`, Section 5, page 20) — **open source-ambiguity item**; see Gap #5 below. Section 5 states sum-to-one, Equation 3 / Equation 2 use unnormalized exponential weights, and the shipped implementation matches Equation 2. Do not promote this to a requirement until the discrepancy is resolved against the source. +- [ ] Heterogeneous treatment effects supported via per-observation estimation (Remark 6.1) + +--- + +## Implementation Notes + +### Data Structure Requirements +- **Paper assumption — balanced panel:** N units x T time periods. +- **Shipped implementation:** `diff_diff/trop.py` accepts unbalanced panels with structural gaps (see the "Unbalanced panels" item under Gaps and Uncertainties below and the corresponding section of `docs/methodology/REGISTRY.md` under TROP). +- **Outcome matrix**: Y (N x T), observed outcomes +- **Treatment matrix**: W (N x T), binary treatment assignments where `W_it in {0, 1}` +- **Covariates** (optional): X_it, observed covariates for each unit-period pair +- Treatment must be an absorbing state for standard block assignment (W_it = 1{i > N_0} * 1{t > T_0}) +- **Paper scope (Equation 13):** the paper extends TROP to general assignment patterns including treatment switching on/off. +- **Shipped implementation:** the current `diff_diff/trop.py` requires an absorbing-state treatment indicator and rejects non-absorbing/event-style inputs (gate in `diff_diff/trop.py:505-525`, also documented in `docs/methodology/REGISTRY.md` under TROP). Generalization to non-absorbing patterns is not in scope for the current implementation. + +### Computational Considerations +- **Main bottleneck**: LOOCV grid search -- for each grid point, every control observation requires a separate nuclear-norm penalized weighted least squares solve +- **Per-observation estimation**: With multiple treated units, each `tau_hat_it` estimated separately (Algorithm 2). Computationally expensive with many treated observations. +- **Coordinate-wise search** (footnote 2) reduces grid search from cubic to approximately linear in grid points per parameter +- **Unit distance matrix**: Computing pairwise RMSE distances between all units requires O(N^2 T) operations. Can be parallelized (existing Rust acceleration: `compute_unit_distance_matrix()`, 4-8x speedup). +- **LOOCV parallelization**: Each control observation's pseudo-treatment effect is independent, enabling parallelization across observations and grid points (existing Rust acceleration: `loocv_grid_search()`, 10-50x speedup). +- **Bootstrap parallelization**: Each bootstrap replicate is independent (existing Rust acceleration: `bootstrap_trop_variance()`, 5-15x speedup). + +### Tuning Parameters + +| Parameter | Type | Default | Selection Method | +|-----------|------|---------|-----------------| +| `lambda_time` | float >= 0 | Data-driven via LOOCV | Exponential decay rate for time weights. 0 = uniform time weights. | +| `lambda_unit` | float >= 0 | Data-driven via LOOCV | Exponential decay rate for unit weights. 0 = uniform unit weights. | +| `lambda_nn` | float > 0 or infinity | Data-driven via LOOCV | Nuclear norm penalty. infinity = no factor model (DID/TWFE). | + +Empirical guidance from Table 2 (cross-validated values, T_post=10, N_tr=10): + +| Dataset | lambda_unit | lambda_time | lambda_nn | +|---------|-------------|-------------|-----------| +| CPS logwage | 0 | 0.1 | 0.9 | +| CPS urate | 1.6 | 0.35 | 0.011 | +| PWT | 0.3 | 0.4 | 0.006 | +| Germany | 1.2 | 0.2 | 0.011 | +| Basque | 0 | 0.35 | 0.006 | +| Smoking | 0.25 | 0.4 | 0.011 | +| Boatlift | 0.2 | 0.2 | 0.151 | + +Key finding (Section 4.3, pages 18-19): Substantial heterogeneity across applications in optimal tuning parameters. Removing regression adjustment (lambda_nn = infinity) or time weights (lambda_time = 0) increases RMSE significantly (up to 85% and 91% respectively). Removing unit weights (lambda_unit = 0) increases RMSE only modestly (max 10%). + +### Relation to Existing diff-diff Estimators +- **Direct implementation**: `diff_diff/trop.py` implements the TROP estimator with two public methods (`diff_diff/trop.py:64-78`, validated at `diff_diff/trop.py:909`): `method="local"` is the per-treated-cell estimator (Algorithm 2), and `method="global"` fits a single weighted model on control observations and averages residual-based treated-cell effects into the ATT (`diff_diff/trop_global.py:554-585`). +- **Existing Rust backend**: `rust/src/trop.rs` provides accelerated distance matrix computation, LOOCV grid search, and bootstrap variance estimation. +- **Encompasses SyntheticDiD**: TROP with `lambda_nn = infinity` and specific weight choices reduces to SDID (`diff_diff/synthetic_did.py`). +- **Encompasses TWFE**: TROP with `lambda_nn = infinity` and uniform weights reduces to DID/TWFE (`diff_diff/twfe.py`). +- **Factor estimation**: The nuclear-norm penalized SVD in TROP is related to factor model estimation used in `SyntheticDiD`. +- **Bootstrap structure**: The stratified pairs bootstrap (Algorithm 3) differs from the multiplier bootstrap used in `CallawaySantAnna` and `SunAbraham` -- it resamples entire unit rows rather than applying random weights. + +--- + +## Simulation Design Details + +The paper uses semi-synthetic simulations (Section 3.1, pages 9-11) based on 6 real datasets (7 simulation applications, since CPS is used for two outcomes: logwage and urate): +1. CPS (Current Population Survey): N=50, T=40 — used for both CPS logwage and CPS urate applications +2. PWT (Penn World Table): N=111, T=48 +3. Germany reunification: N=17, T=44 +4. Basque country: N=18, T=43 +5. Smoking: N=39, T=31 +6. Boatlift: N=44, T=19 + +**Outcome DGP (Equation 6):** Rank-4 factor model with AR(2) autocorrelated errors, normalized to mean zero and unit variance. True treatment effects are zero (placebo studies). + +**Treatment assignment (Equation 8):** Logistic regression on latent factors, inducing confoundedness. + +**Key results (Table 1):** TROP outperforms all competitors in 20 of 21 specifications. Competitor RMSE inflation vs. best: DID up to 900%, SC up to 580%, MC up to 300%, SDID up to 90%. + +--- + +## Gaps and Uncertainties + +1. **Analytical standard errors not provided**: The paper provides bootstrap inference (Algorithm 3) but no closed-form standard errors for the general case. For single treated units, the paper references "existing literature" for variance estimation without specifying which procedures (Section 5.3, page 25). + +2. **Bootstrap iteration count**: The paper does not recommend a specific number of bootstrap iterations. Simulations use 1000 replications for Monte Carlo RMSE, but this is for the simulation study, not a bootstrap recommendation (Section 3, page 10). + +3. **Efficient estimation under homogeneity**: Remark 6.1 (page 27) notes that under homogeneous treatment effects, alternative weighting of per-unit estimates could improve precision, but leaves this to future research. + +4. **Computational complexity**: Not explicitly discussed. The LOOCV grid search is described as the bottleneck, but no formal complexity analysis is provided. + +5. **Weight normalization**: Section 5 (page 20) states weights sum to one (`1^T omega = 1^T theta = 1`), but the weight specification in Equation 3 (page 7) uses unnormalized exponential weights. It is unclear whether normalization is applied before or after the optimization, or whether the theoretical results in Section 5 assume normalized weights while the practical algorithm uses unnormalized weights. The existing diff-diff implementation uses unnormalized weights in the optimization (matching Equation 2). + +6. **Nuclear norm penalty in Equation 13** (resolved): the source uses the same unsquared nuclear-norm penalty `lambda_nn ||L||_*` in Equation 13 as in Equation 2 (consistent with the rest of the draft and confirmed against the paper text). The shipped implementation matches. + +7. **General assignment patterns**: Section 6.1 generalizes beyond block assignment, but the inference results (Section 5.3) and theoretical guarantees (Theorem 5.1) are stated under block assignment. The extent to which theoretical results carry over to general patterns is not fully characterized. + +8. **Rank selection**: The paper uses a fixed rank-4 approximation in simulations (Equation 6) but the theoretical framework allows arbitrary K. No guidance is provided for choosing K in practice beyond the nuclear norm penalty (which implicitly selects rank via soft-thresholding). + +9. **Unbalanced panels**: The paper assumes balanced panel data (N units x T periods). Extension to unbalanced panels is not discussed but is supported in the existing diff-diff implementation. diff --git a/docs/methodology/papers/wooldridge-2023-review.md b/docs/methodology/papers/wooldridge-2023-review.md new file mode 100644 index 00000000..3a99bffe --- /dev/null +++ b/docs/methodology/papers/wooldridge-2023-review.md @@ -0,0 +1,258 @@ +# Paper Review: Simple approaches to nonlinear difference-in-differences with panel data + +**Authors:** Jeffrey M. Wooldridge +**Citation:** Wooldridge, J. M. (2023). Simple approaches to nonlinear difference-in-differences with panel data. *The Econometrics Journal*, 26(3), C31–C66. https://doi.org/10.1093/ectj/utad016 +**PDF reviewed:** https://doi.org/10.1093/ectj/utad016 (Econometrics Journal — official article URL) +**Review date:** 2026-03-21 + +--- + +## Methodology Registry Entry + +*Formatted to match docs/methodology/REGISTRY.md structure.* + +## WooldridgeDiD (ETWFE) + +**Primary source:** Wooldridge, J. M. (2023). Simple approaches to nonlinear difference-in-differences with panel data. *The Econometrics Journal*, 26(3), C31–C66. https://doi.org/10.1093/ectj/utad016 + +**Secondary source:** Wooldridge, J. M. (2021). Two-way fixed effects, the two-way Mundlak regression, and difference-in-differences estimators. SSRN Working Paper. https://doi.org/10.2139/ssrn.3906345 + +**Key implementation requirements:** + +*Assumption checks / warnings:* +- **No Anticipation (NA):** `E[Y_1(1) - Y_1(0) | D = 1] = 0` (Eq 2.2). Pre-treatment outcomes unaffected by future treatment. +- **Linear Parallel Trends (LPT):** `E[Y_2(0)|D] - E[Y_1(0)|D] = γ_2` (Eq 2.5). Trend in untreated PO is the same across groups. Only valid for continuous/unbounded outcomes. +- **Index Parallel Trends (IPT):** For a known strictly increasing `G(·)`, `E[Y_t(0)|D] = G(α + βD + γ_t)` (Eq 2.6-2.7). PT holds on the index inside `G(·)`, not on levels. This is the key nonlinear extension. +- **Conditional NA, Staggered (CNAS):** `E[Y_t(g)|D_g = 1, X] = E[Y_t(∞)|D_g = 1, X]` for `t < g` (Eq 3.3). +- **Conditional IPT, Staggered (CIPTS):** `E[Y_t(∞)|D_q,...,D_T, X] = G(α + Σ β_g D_g + Xκ + Σ (D_g · X) η_g + γ_t + Xπ_t)` (Eq 3.4). The change in the index does not depend on cohort assignment `D`, conditional on `X`. +- **Overlap:** `P(D = 1|X = x) < 1` for all `x ∈ Supp(X)`. Equivalently, `Supp(X|D=1) ⊂ Supp(X|D=0)`. + +*Target parameter — two-period case (Equation 2.1):* + + τ_2 = E[Y_2(1) - Y_2(0) | D = 1] + +*Target parameter — staggered case (Equation 3.2):* + + τ_{gr} = E[Y_r(g) - Y_r(∞) | D_g = 1], r = g,...,T; g = q,...,T + +where `g` is the cohort (first treatment period), `r` is calendar time, `D_g` are cohort indicators, `D_∞ = 1` indicates never-treated. + +*Main estimator — two-period, no covariates (Equation 2.17):* + + τ̂_2 = Ȳ_12 - G(α̂ + β̂ + γ̂_2) + = Ȳ_12 - G(G⁻¹(Ȳ_11) + (G⁻¹(Ȳ_02) - G⁻¹(Ȳ_01))) + +where `Ȳ_{dt}` is the sample average for group `d` in period `t`. Special cases: +- Linear (`G(z) = z`): reduces to standard DiD `(Ȳ_12 - Ȳ_11) - (Ȳ_02 - Ȳ_01)` (Eq 2.18) +- Exponential (`G(z) = exp(z)`): `τ̂_2 = Ȳ_12 - Ȳ_11 · (Ȳ_02/Ȳ_01)` (Eq 2.19) + +*Index-scale effect (Equation 2.20):* + + δ_2 ≡ G⁻¹(E[Y_2(1)|D=1]) - G⁻¹(E[Y_2(0)|D=1]) + = G⁻¹(E[Y_2(1)|D=1]) - (α + β + γ_2) + +Interpretation depends on the link function: +- Exponential mean (`G(z) = exp(z)`, Poisson): `δ_2` is a log difference, i.e., a proportional/log rate effect on `E[Y|...]`. +- Logistic mean (`G(z) = Λ(z) = 1/(1 + exp(-z))`, logit): `δ_2` is a change in log-odds of `E[Y|...]`. + +*Imputation estimator — staggered with covariates (Procedure 1, Eq 3.10-3.11):* + + Step 1: Using W_{it} = 0 observations, estimate (α, β_q,...,β_T, κ, η_q,...,η_T, γ_2,...,γ_T, π_2,...,π_T) by pooled QMLE in the LEF. + + Step 2: For cohort g, impute counterfactual: + Ŷ_{igr}(∞) ≡ G(α̂ + β̂_g + X_i κ̂ + X_i η̂_g + γ̂_r + X_i π̂_r), r = g,...,T (3.10) + + Step 3: ATT estimator: + τ̂_{gr} = N_g⁻¹ Σ_i D_{ig} [Y_{ir} - Ŷ_{igr}(∞)] (3.11) + +*Pooled QMLE estimator — staggered with covariates (Procedure 2, Eq 3.12-3.15):* + + E(Y_{it}|D_{iq},...,D_{iT}, X_i, W_i) = G[α + Σ β_g D_{ig} + X_i κ + Σ (D_{ig} · X_i) η_g + + Σ γ_s f s_t + Σ (f s_t · X_i) π_s + + ΣΣ δ_{gs} (W_{it} · D_{ig} · f s_t) + + ΣΣ (W_{it} · D_{ig} · f s_t · X̂_{ig}) ξ_{gs}] (3.12) + +where `X̂_{ig} = X_i - X̄_g` are cohort-centred covariates, `W_{it} = D_{ig} · (f q_t + ... + f T_t)` is the time-varying treatment indicator, and `f s_t` are time period dummies. + + ATT estimator from pooled method: + τ̃_{gr} = N_g⁻¹ Σ_i D_{ig} [G(α̃ + β̃_g + X_i κ̃ + X_i η̃_g + γ̃_r + X_i π̃_r + δ̃_{gr} + X̂_{ig} ξ̃_{gr}) + - G(α̃ + β̃_g + X_i κ̃ + X_i η̃_g + γ̃_r + X_i π̃_r)] (3.15) + +This is the Average Structural Function (ASF) approach: `ATT(g,r) = mean_i[G(η_i + δ_{gr}) - G(η_i)]`. + +*Equivalence result (Proposition 3.1):* + +When `G⁻¹(·)` is the canonical link function for the chosen LEF density, and the pooled QMLE solution is unique: +- Parameter estimates from imputation (Procedure 1) and pooled (Procedure 2) are identical +- ATT estimates are identical: `τ̃_{gr} = τ̂_{gr}` + +This holds for: OLS (identity link), logit (logistic link with Bernoulli), Poisson QMLE (log link with Poisson). Proven in Appendix A. + +*Common timing simplification (Equation 3.17):* + +When all units are treated at the same time `g`, the model simplifies to: + + E(Y_{it}|D_i, X_i, W_i) = G[α + βD_i + X_i κ + (D_i · X_i) η + + Σ γ_s f s_t + Σ (f s_t · X_i) π_s + + Σ δ_s (W_{it} · D_i · f s_t) + Σ (W_{it} · D_i · f s_t · X̂_{ig}) ξ_{gs}] (3.17) + +*Standard errors (Section 3, p. C47):* +- Delta method for individual `τ̂_{gr}` and aggregated ATTs (Wooldridge 2010, problem 12.17) +- Panel bootstrap (resampling units) is also valid in the paper's framework — allows for serial dependence and model misspecification +- Cluster-robust SEs at unit level for pooled estimation +- For nonlinear models: standard sandwich `(X'WX)⁻¹ meat (X'WX)⁻¹` where `W = diag(μ_i(1-μ_i))` for logit or `W = diag(μ_i)` for Poisson +- **Note (shipped API restriction):** in `WooldridgeDiD`, `n_bootstrap > 0` is currently OLS-only (rejected for logit/Poisson at `diff_diff/wooldridge.py:432-437`) and rejected when `survey_design` is set (`diff_diff/wooldridge.py:441-444`). Use analytical SEs for nonlinear or survey paths. + +*Aggregation (Section 3.1, p. C50):* +- **Simple (static):** weighted average of immediate effects `τ̃_{sg}` across cohorts `g = q,...,T`, weights = cohort proportions among all eventually treated +- **Dynamic (event study):** weighted average of `τ̃_{g,g+h}` for horizon `h ≥ 0` across cohorts, weights = cohort proportions +- **Calendar time:** weighted average across cohorts for each calendar period +- **Group:** weighted average across periods for each cohort +- SEs for all aggregations via delta method or panel bootstrap +- **Note (current implementation deviation):** the shipped `WooldridgeDiD` aggregations use cell-level observation-count weights `n_{g,t}` (matching Stata `jwdid_estat`) rather than the cohort-share weights described conceptually in Section 3.1. The 2023 paper does not provide explicit aggregation-weight equations; the formal cohort-share equations referenced in `docs/methodology/REGISTRY.md` ("W2025 Eqs. 7.2-7.4") are from a later Wooldridge ETWFE source. See `docs/methodology/REGISTRY.md` "Aggregations" under WooldridgeDiD and the corresponding line in `TODO.md` ("Tech Debt from Code Reviews") for the tracked deviation. + +*Testing parallel trends (Section 4):* + +Two approaches: +1. **Event-study test (Section 4.1):** Include pre-treatment interactions `D_{ig} · f s_t` for `s = {2,...,g-1}` and test joint significance. With canonical link, test is the same whether using `W_{it} = 0` subsample or pooling all observations. +2. **Heterogeneous linear trends (Section 4.2):** Add cohort-specific linear trends `D_{ig} · t`. Conserves degrees of freedom vs event-study approach. Tests via cluster-robust Wald statistic. +- Adding cohort-specific linear trends `D_{ig} · t` to the expanded equation (Eq 4.1) provides a sensible correction when PT is violated + +*Edge cases:* +- **Binary outcomes:** Use logistic function `G(z) = Λ(z) = exp(z)/[1 + exp(z)]` with Bernoulli QLL. Fractional outcomes use same setup (Papke & Wooldridge 1996, 2008). +- **Count/nonnegative outcomes:** Use exponential `G(z) = exp(z)` with Poisson QLL. Does not require Poisson distribution — only exponential conditional mean. +- **Corner solutions (`Y ≥ 0` with mass at zero):** Poisson QMLE still valid. Exponential mean handles zeros naturally (p. C56-C57). +- **All units eventually treated (Section 7.1):** Methods work with minor modification. Last-treated cohort (`g = T`) serves as control. ATTs defined relative to `Y_t(T)` rather than `Y_t(∞)`. +- **Treatment exit (Section 7.2, extension):** Expand cohort definition to `D_{gh}` where `g` = start period, `h` = end period. ATTs `τ_{ghr}` defined for `r = g,...,T`. The paper notes this extension requires the additional restriction that future shocks to untreated potential outcomes do not drive exit; absent that condition, exit timing becomes endogenous and the standard ETWFE identification argument no longer carries over directly. +- **Time-varying covariates (Section 7.3):** Replace `X_i` with `X_{it}` in Eq 3.12. Only valid if covariates are not influenced by the intervention. +- **Multiple treatment levels (Section 7.4, extension):** Replace binary `W_{it}` with a set of treatment-level indicators; define cohorts by `D_{iga}` (first period `g`, initial level `a`). The paper describes this only as relatively straightforward, not fully general, and leaves the precise multi-level estimand to future work — so the exact ATT object under non-binary treatment is not pinned down in Wooldridge (2023). +- **Incidental parameters:** For logit, unit-specific dummies cause incidental parameters problem. Pooled QMLE with cohort dummies (not unit dummies) avoids this. For Poisson with exponential mean, FE Poisson estimator does NOT suffer from incidental parameters (Wooldridge 1999) — this is a unique exception. +- **No never-treated group:** When `g = T` is the last cohort, it serves as control. Cannot estimate ATT for this last cohort. + +*Algorithm (Procedure 2 — Pooled Estimation, recommended):* +1. Pool all observations (treated and untreated, all periods) +2. Construct design matrix: cohort dummies `D_{ig}`, time dummies `f s_t`, covariates `X_i`, cohort×covariate interactions `D_{ig} · X_i`, time×covariate interactions `f s_t · X_i`, treatment indicators `W_{it} · D_{ig} · f s_t`, treatment×centred-covariate interactions `W_{it} · D_{ig} · f s_t · X̂_{ig}` +3. Estimate by pooled QMLE in the LEF (OLS for linear, Bernoulli QLL for logit, Poisson QLL for exponential) +4. For each treated cell (g, r): compute ASF-based ATT via Eq 3.15 — average `G(η̂_i + δ̂_{gr}) - G(η̂_i)` over units in cohort `g` +5. For linear case: `δ̂_{gr}` coefficients on treatment interactions are directly the ATTs +6. Aggregate as desired (simple, dynamic/event-study, calendar, group) +7. SEs via delta method or panel bootstrap (cluster at unit level) + +**Table 1. Canonical link and log likelihood pairings (p. C44):** + +| Conditional Mean | LEF Density | Use Case | +|-----------------|-------------|----------| +| Linear | Normal | Any response; leads to OLS | +| Logistic | Bernoulli | Binary or fractional response | +| Logistic | Binomial | Nonnegative response with known upper bound | +| Logistic | Multinomial | Multinomial or multiple fractional response | +| Exponential | Poisson | Nonnegative response (count, corner), no natural upper bound | + +**Reference implementation(s):** +- Stata: `jwdid` package (Rios-Avila, 2021) +- R: `etwfe` package (McDermott, 2023) +- Simulations in paper performed in Stata 17 + +**Requirements checklist:** +- [ ] Pooled QMLE estimation with full saturated design matrix (Eq 3.12) +- [ ] Three estimation methods: OLS (identity link), logit (Bernoulli QLL), Poisson (Poisson QLL) +- [ ] ASF-based ATT computation for nonlinear models (Eq 3.15) +- [ ] Imputation-based ATT for comparison/equivalence check (Eq 3.10-3.11) +- [ ] Equivalence between imputation and pooled when using canonical link (Proposition 3.1) +- [ ] Cohort×time interaction design matrix with centred covariates +- [ ] Unit-level cluster-robust SEs +- [ ] Delta-method SEs for aggregated ATTs +- [ ] Panel bootstrap (unit resampling) as alternative inference +- [ ] Four aggregation types: simple (static), dynamic (event study), calendar, group +- [ ] Pre-treatment testing: event-study and heterogeneous trends approaches (Section 4) +- [ ] Support for both never-treated and all-eventually-treated designs (Section 7.1) +- [ ] Treatment exit support (Section 7.2) +- [ ] Time-varying covariates (Section 7.3) +- [ ] Index-scale effects `δ_{gr}` reported alongside level-scale ATTs `τ_{gr}` (Eq 3.16) + +--- + +## Implementation Notes + +### Data Structure Requirements + +*Paper notation:* `Y_{it}` (outcome), `D_g` (cohort indicator), `W_{it}` (time-varying treatment), `X_i` (time-invariant covariates). + +*Shipped API (`diff_diff/wooldridge.py:394-411`):* users provide outcome, unit ID, time, and `cohort` (or `first_treat`). The model derives `W_{it}` internally from `cohort` and `time` via `_build_interaction_matrix` (`diff_diff/wooldridge.py:165-189`) — users do NOT pass `W_{it}` as a column. + +*Covariates (richer than paper notation):* `exovar` (time-invariant, no interaction or demeaning), `xtvar` (time-varying, demeaned within cohort×period cells when `demean_covariates=True`), `xgvar` (covariates interacted with each cohort indicator). See `docs/methodology/REGISTRY.md` under WooldridgeDiD "Covariates". + +*Other contracts:* +- Balanced or unbalanced panel: N units observed over T fixed time periods. +- Treatment is absorbing: once treated, always treated (no exit unless using Section 7.2 extension). +- Cohorts defined by first treatment period `g ∈ {q, q+1, ..., T, ∞}`. + +### Computational Considerations +- Pooled estimation is a single regression over all N×T observations — O(N·T·K²) where K is number of parameters +- K grows as (number of cohorts) × (number of post-treatment periods) for the interaction terms +- For nonlinear models, IRLS convergence typically fast (< 25 iterations for logit/Poisson) +- ASF computation requires averaging over all units in each cohort — O(N_g) per (g,r) cell +- Delta-method gradient for nonlinear ATTs requires `G'(η̂_i + δ̂_{gr})` and `G'(η̂_i)` per unit +- Parallelization: bootstrap iterations are embarrassingly parallel; ASF computation per cell is independent + +### Tuning Parameters + +| Parameter | Type | Default | Selection Method | +|-----------|------|---------|-----------------| +| `G(·)` | function | identity (OLS) | Dictated by outcome type: logistic for binary/fractional, exponential for counts/nonneg | +| control_group | str | "not_yet_treated" (matches `diff_diff/wooldridge.py:305`) | Use "never_treated" only when a pure never-treated group is available and required | +| anticipation | int | 0 | Domain knowledge; test with pre-treatment indicators | +| exovar / xtvar / xgvar | list / list / list | None / None / None | `exovar` for time-invariant, `xtvar` for time-varying (Section 7.3; demeaned via `demean_covariates=True`, default), `xgvar` for cohort-interacted. See `diff_diff/wooldridge.py:394-411`. | + +### Relation to Existing diff-diff Estimators + +- **CallawaySantAnna:** OLS ETWFE ATT(g,t) estimates are equivalent to CS ATT(g,t) under linear PT with never-treated control (proven in Wooldridge 2021). However: CS uses only the period just prior to treatment as comparison, while ETWFE uses all pre-treatment periods — stronger assumption but more efficient. Key difference: ETWFE extends naturally to nonlinear outcomes; CS does not. +- **ImputationDiD (BJS):** In the linear case, the imputation procedure (Procedure 1) is numerically identical to BJS imputation (Section 3.2, Proposition 3.1). With canonical link in LEF, pooled and imputation are equivalent. +- **TwoStageDiD (Gardner):** Similar two-step logic (estimate FE on controls, impute counterfactuals). ETWFE pooled approach is a single-step alternative. +- **SunAbraham:** Both produce interaction-weighted estimates. SA uses TWFE with saturated interactions; ETWFE generalises to nonlinear. +- **EfficientDiD:** Chen, Sant'Anna & Xie (2025) achieve the semiparametric efficiency bound. ETWFE does not claim efficiency — but the pooled QMLE is typically more efficient than CS in simulations (SEs 45-69% smaller than CS for logit; 31-57% smaller for Poisson — Section 5). +- **Solvers in `linalg.py`:** `solve_logit` is reused for the logit outcome path; `solve_poisson` (`diff_diff/linalg.py:3431`) is the IRLS solver used by the Poisson path (`diff_diff/wooldridge.py:1085-1124`). +- **`within_transform` in `utils.py`:** Can be used for the OLS path to absorb unit+time FE. NOT suitable for nonlinear paths — logit/Poisson require explicit cohort+time dummy columns (no within-transformation for nonlinear models due to incidental parameters). + +--- + +## Simulation Findings (Section 5) + +### Binary outcomes with common timing (Section 5.1) +- N = 500, T = 6, q = 4 (3 pre-treatment periods), 5000 MC replications +- **Logit PMLE essentially unbiased** for all three SATTs regardless of trend strength +- **POLS (linear) shows large downward bias** with strong aggregate trends (biases -0.15 to -0.29 vs true SATTs of 0.06-0.16) +- **CS (2021) also shows nontrivial negative bias** in strong-trend scenarios +- POLS SEs at least 45% larger than logit PMLE SEs; CS SEs at least 69% larger +- PT tests (event-study and heterogeneous trends) fail to detect misspecification of the linear model — rejection rates around 5% even when LPT is violated + +### Nonnegative outcomes with staggered intervention (Section 5.2) +- Poisson QMLE is clear winner when exponential mean is correct +- **Poisson QMLE essentially unbiased** for all six SATTs +- **Pooled OLS shows downward biases over 30%** in all cases, over 50% in some +- **CS has less bias than POLS** but still very different from true SATTs +- Poisson QMLE precision notably better than both alternatives (SDs 31-57% smaller than CS) +- Event-study PT test rejects linear model 99.5% of the time; heterogeneous trends test rejects 99.9% +- Corner solution case (`P(Y = 0) ≈ 0.37`): Poisson QMLE still essentially unbiased despite mass at zero + +### Key takeaway +The simulations demonstrate that **when the outcome is binary or count-valued, using the correct nonlinear link function produces dramatically better estimates** than linear methods. The improvement is both in bias (nonlinear: ~0; linear: up to 50% bias) and precision (nonlinear SEs 30-70% smaller). + +--- + +## Gaps and Uncertainties + +1. **Fractional logit implementation details.** The paper mentions fractional responses (Papke & Wooldridge 1996, 2008) and the Bernoulli QLL with known upper bound `B_{it}`, but does not provide explicit formulas for the bounded case. Implementation should follow standard fractional logit/probit references. + +2. **Aggregation weight formulas not fully explicit.** Section 3.1 (p. C50) describes aggregation conceptually ("weighted averages... where the weights are the proportions of the treated cohorts") but does not provide explicit weight formulas with equation numbers. The `jwdid_estat` Stata command documentation should be consulted for exact weight definitions. + +3. **Delta-method gradient for nonlinear ATTs.** The paper states "by applying the delta method with averaging" (p. C47, citing Wooldridge 2010 problem 12.17) but does not write out the explicit gradient. For implementation: + - For `τ̃_{gr}` from Eq 3.15: `∂τ̃_{gr}/∂δ_{gr} = N_g⁻¹ Σ_i D_{ig} G'(η̂_i + δ̂_{gr})` + - For other parameters `θ_k`: `∂τ̃_{gr}/∂θ_k = N_g⁻¹ Σ_i D_{ig} [G'(η̂_i + δ̂_{gr}) · ∂η̂_i/∂θ_k - G'(η̂_i) · ∂η̂_i/∂θ_k]` + - Verify these against numerical gradients during implementation. + +4. **Covariate centring.** The paper centres covariates at cohort means `X̂_{ig} = X_i - X̄_g` where `X̄_g = E(X_i|D_{ig} = 1)` (p. C48). In practice, use sample cohort means. The centring affects interpretation of `δ_{gs}` (makes it the ATT on the log-odds/log scale) but does not affect `τ_{gr}` estimates. + +5. **Poisson FE vs pooled QMLE.** Section 3.3 (p. C51-C52) notes that FE Poisson (with unit dummies) does NOT suffer from incidental parameters — a unique property of the exponential mean. The paper notes that without covariates, FE Poisson and pooled QMLE give identical `γ_s` and `δ_{gs}` estimates, but with covariates they differ. The pooled approach is recommended for simplicity. + +6. **Online Appendix.** The paper references an Online Appendix with detailed simulation results (p. C56: "From the findings reported in the Online Appendix..."). This appendix is not included in the main PDF and may contain additional implementation-relevant details.