Add MultiPeriodDiD vs R (fixest) benchmark by igerber · Pull Request #135 · igerber/diff-diff

igerber · 2026-02-08T17:38:50Z

Summary

Add Python and R benchmark scripts for MultiPeriodDiD event-study estimation
Both run identical OLS: outcome ~ treated * time_f with cluster-robust SEs
Add generate_multiperiod_data() to benchmark utilities for simultaneous-treatment panel data
Integrate into run_benchmarks.py with --estimator multiperiod CLI option (5 scale configs)
Update documentation: METHODOLOGY_REVIEW.md, docs/benchmarks.rst, CLAUDE.md

Methodology references (required if estimator / math changes)

Method name(s): N/A — no estimator code changed; benchmark-only addition
Paper / source link(s): N/A
Any intentional deviations from the source (and why): None

Validation

Benchmark passes at small and 1k scales:
- ATT diff: < 1e-11 (machine precision)
- SE rel diff: 0.0% (exact match)
- Period effects correlation: 1.000000
- Period effects max diff: < 3e-11
All 141 existing tests pass (pytest tests/test_estimators.py)
No estimator code was modified

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Add benchmark infrastructure to validate MultiPeriodDiD event-study estimation against R's fixest::feols with treatment × period interactions. Both implementations run identical OLS regressions with cluster-robust SEs, enabling exact coefficient-level matching. Results: ATT diff < 1e-11, SE diff 0.0%, period effects correlation 1.0 at both small (200 units) and 1k scales. Python pure is ~7x faster than R. New files: - benchmarks/python/benchmark_multiperiod.py - benchmarks/R/benchmark_multiperiod.R Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-08T17:43:27Z

Overall assessment: ⚠️ Needs changes

Executive summary

P1 methodology mismatch: the new benchmark/docs validate a non-FE spec that conflicts with the MultiPeriodDiD reference spec in the Methodology Registry.
P2 inference edge case: the R benchmark computes p-values/CIs even when the average SE is zero or non‑finite; registry expects NaN propagation.
No estimator code changes, but the benchmark + docs updates materially affect methodological claims.

Methodology

P1 Benchmark spec omits unit fixed effects and diverges from the Methodology Registry’s reference implementation (unit + time FE, i(time, treatment, ref=...) | unit + time). The R script uses feols(outcome ~ treated * time_f) and the Python benchmark uses the default MultiPeriodDiD spec (no unit FE), while docs/METHODOLOGY_REVIEW now present this as the validated reference. Impact: The benchmark certifies a different estimand than the registry/papers, potentially masking bias and overstating methodological validation. Concrete fix: Align benchmarks and docs to the registry spec (e.g., feols(outcome ~ i(time, treated, ref=ref_period) | unit + time, cluster=~unit) in R, and use fixed_effects=["unit"]/absorb=["unit"] in Python with corresponding estimator changes to avoid collinearity), or explicitly update docs/methodology/REGISTRY.md and estimator docs to reflect the non‑FE spec if that’s intended.
Refs: benchmarks/R/benchmark_multiperiod.R:L82-L86, benchmarks/python/benchmark_multiperiod.py:L92-L102, docs/benchmarks.rst:L161-L164, METHODOLOGY_REVIEW.md:L136-L139, docs/methodology/REGISTRY.md:L125-L196.
P2 Average ATT inference in the R benchmark does not guard against zero/NaN SE; it always computes t/p/CI when n_post_periods > 0. Impact: If the SE is non‑finite or zero (e.g., rank deficiency, single cluster), the benchmark emits misleading inference and may disagree with diff‑diff’s NaN convention. Concrete fix: Mirror registry behavior by setting avg_t, avg_pval, and CIs to NA when avg_se is non‑finite or <= 0.
Refs: benchmarks/R/benchmark_multiperiod.R:L132-L139, docs/methodology/REGISTRY.md:L179-L183.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No findings.

P1: Add unit FE to match registry reference spec. R uses `| unit` (absorption), Python uses `absorb=["unit"]` (within-transformation) — both yield identical coefficients and SEs (0.0% diff). P2: Guard R script against non-finite avg_se by setting avg_t, avg_pval, and CI bounds to NA (matches REGISTRY.md convention). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-08T18:10:15Z

/ai-review

github-actions · 2026-02-08T18:16:50Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3eee33e870d2538c4181ebb3e6630418c11f2be9

Overall Assessment

✅ Looks good

Executive Summary

Benchmark-only PR; no estimator or core inference code changes.
Core MultiPeriodDiD regression spec aligns with the registry (unit FE + period dummies; reference period = last pre).
Two minor methodology alignment gaps in benchmark outputs: event_time definition and R-side df for avg ATT p/CI.
Consider a small guard for degenerate treatment shares and a smoke test for the new CLI path.

Methodology

P2 | Impact: event_time in benchmark JSON is computed as period - reference_period, which shifts labels by +1 relative to the registry’s event-time definition (0 at treatment). This can mislabel timing in downstream plots/comparisons. | Fix: compute event_time relative to treatment period (period - (n_pre + 1) for this DGP) or rename the field to clarify it’s reference-relative. benchmarks/python/benchmark_multiperiod.py:110, benchmarks/R/benchmark_multiperiod.R:112
P2 | Impact: Average ATT p-values/CI in the R benchmark use df = nobs - length(coefs), ignoring absorbed unit FE; this diverges from the registry’s “df adjusted for absorbed FE” and diff-diff’s inference. | Fix: use fixest’s residual df (model-provided) or explicitly subtract absorbed FE count; alternatively omit pvalue/CI until aligned. benchmarks/R/benchmark_multiperiod.R:141

Code Quality

P2 | Impact: generate_multiperiod_data can silently create zero-treated or all-treated datasets when treatment_fraction rounds to 0 or n_units, leading to rank-deficient benchmarks. | Fix: validate 0 < treatment_fraction < 1 and ensure n_treated is in [1, n_units-1] with a clear error. benchmarks/python/utils.py:369

Performance

P3 | Impact: No performance regressions observed in the new benchmark wiring. | Fix: None.

Maintainability

P3 | Impact: No maintainability issues identified in the new benchmark integration. | Fix: None.

Tech Debt

P3 | Impact: No new tech-debt issues identified. | Fix: None.

Security

P3 | Impact: No secrets/PII or unsafe IO patterns added. | Fix: None.

Documentation/Tests

P3 | Impact: No automated coverage for the new --estimator multiperiod CLI path; future wiring regressions could slip. | Fix: Add a lightweight smoke test or CI note for this path. benchmarks/run_benchmarks.py:846

igerber merged commit fcf9e28 into main Feb 8, 2026

igerber deleted the multi-period-r-comparison branch February 8, 2026 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MultiPeriodDiD vs R (fixest) benchmark#135

Add MultiPeriodDiD vs R (fixest) benchmark#135
igerber merged 2 commits intomainfrom
multi-period-r-comparison

igerber commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

igerber commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Feb 8, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Feb 8, 2026

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

igerber commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant