Skip to content

Add MultiPeriodDiD vs R (fixest) benchmark#135

Merged
igerber merged 2 commits intomainfrom
multi-period-r-comparison
Feb 8, 2026
Merged

Add MultiPeriodDiD vs R (fixest) benchmark#135
igerber merged 2 commits intomainfrom
multi-period-r-comparison

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Feb 8, 2026

Summary

  • Add Python and R benchmark scripts for MultiPeriodDiD event-study estimation
  • Both run identical OLS: outcome ~ treated * time_f with cluster-robust SEs
  • Add generate_multiperiod_data() to benchmark utilities for simultaneous-treatment panel data
  • Integrate into run_benchmarks.py with --estimator multiperiod CLI option (5 scale configs)
  • Update documentation: METHODOLOGY_REVIEW.md, docs/benchmarks.rst, CLAUDE.md

Methodology references (required if estimator / math changes)

  • Method name(s): N/A — no estimator code changed; benchmark-only addition
  • Paper / source link(s): N/A
  • Any intentional deviations from the source (and why): None

Validation

  • Benchmark passes at small and 1k scales:
    • ATT diff: < 1e-11 (machine precision)
    • SE rel diff: 0.0% (exact match)
    • Period effects correlation: 1.000000
    • Period effects max diff: < 3e-11
  • All 141 existing tests pass (pytest tests/test_estimators.py)
  • No estimator code was modified

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Add benchmark infrastructure to validate MultiPeriodDiD event-study
estimation against R's fixest::feols with treatment × period interactions.
Both implementations run identical OLS regressions with cluster-robust SEs,
enabling exact coefficient-level matching.

Results: ATT diff < 1e-11, SE diff 0.0%, period effects correlation 1.0
at both small (200 units) and 1k scales. Python pure is ~7x faster than R.

New files:
- benchmarks/python/benchmark_multiperiod.py
- benchmarks/R/benchmark_multiperiod.R

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Feb 8, 2026

Overall assessment: ⚠️ Needs changes

Executive summary

  • P1 methodology mismatch: the new benchmark/docs validate a non-FE spec that conflicts with the MultiPeriodDiD reference spec in the Methodology Registry.
  • P2 inference edge case: the R benchmark computes p-values/CIs even when the average SE is zero or non‑finite; registry expects NaN propagation.
  • No estimator code changes, but the benchmark + docs updates materially affect methodological claims.

Methodology

  • P1 Benchmark spec omits unit fixed effects and diverges from the Methodology Registry’s reference implementation (unit + time FE, i(time, treatment, ref=...) | unit + time). The R script uses feols(outcome ~ treated * time_f) and the Python benchmark uses the default MultiPeriodDiD spec (no unit FE), while docs/METHODOLOGY_REVIEW now present this as the validated reference. Impact: The benchmark certifies a different estimand than the registry/papers, potentially masking bias and overstating methodological validation. Concrete fix: Align benchmarks and docs to the registry spec (e.g., feols(outcome ~ i(time, treated, ref=ref_period) | unit + time, cluster=~unit) in R, and use fixed_effects=["unit"]/absorb=["unit"] in Python with corresponding estimator changes to avoid collinearity), or explicitly update docs/methodology/REGISTRY.md and estimator docs to reflect the non‑FE spec if that’s intended.
    Refs: benchmarks/R/benchmark_multiperiod.R:L82-L86, benchmarks/python/benchmark_multiperiod.py:L92-L102, docs/benchmarks.rst:L161-L164, METHODOLOGY_REVIEW.md:L136-L139, docs/methodology/REGISTRY.md:L125-L196.
  • P2 Average ATT inference in the R benchmark does not guard against zero/NaN SE; it always computes t/p/CI when n_post_periods > 0. Impact: If the SE is non‑finite or zero (e.g., rank deficiency, single cluster), the benchmark emits misleading inference and may disagree with diff‑diff’s NaN convention. Concrete fix: Mirror registry behavior by setting avg_t, avg_pval, and CIs to NA when avg_se is non‑finite or <= 0.
    Refs: benchmarks/R/benchmark_multiperiod.R:L132-L139, docs/methodology/REGISTRY.md:L179-L183.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • No findings.

P1: Add unit FE to match registry reference spec. R uses `| unit`
(absorption), Python uses `absorb=["unit"]` (within-transformation)
— both yield identical coefficients and SEs (0.0% diff).

P2: Guard R script against non-finite avg_se by setting avg_t,
avg_pval, and CI bounds to NA (matches REGISTRY.md convention).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 8, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 3eee33e870d2538c4181ebb3e6630418c11f2be9


Overall Assessment

  • ✅ Looks good

Executive Summary

  • Benchmark-only PR; no estimator or core inference code changes.
  • Core MultiPeriodDiD regression spec aligns with the registry (unit FE + period dummies; reference period = last pre).
  • Two minor methodology alignment gaps in benchmark outputs: event_time definition and R-side df for avg ATT p/CI.
  • Consider a small guard for degenerate treatment shares and a smoke test for the new CLI path.

Methodology

  • P2 | Impact: event_time in benchmark JSON is computed as period - reference_period, which shifts labels by +1 relative to the registry’s event-time definition (0 at treatment). This can mislabel timing in downstream plots/comparisons. | Fix: compute event_time relative to treatment period (period - (n_pre + 1) for this DGP) or rename the field to clarify it’s reference-relative. benchmarks/python/benchmark_multiperiod.py:110, benchmarks/R/benchmark_multiperiod.R:112
  • P2 | Impact: Average ATT p-values/CI in the R benchmark use df = nobs - length(coefs), ignoring absorbed unit FE; this diverges from the registry’s “df adjusted for absorbed FE” and diff-diff’s inference. | Fix: use fixest’s residual df (model-provided) or explicitly subtract absorbed FE count; alternatively omit pvalue/CI until aligned. benchmarks/R/benchmark_multiperiod.R:141

Code Quality

  • P2 | Impact: generate_multiperiod_data can silently create zero-treated or all-treated datasets when treatment_fraction rounds to 0 or n_units, leading to rank-deficient benchmarks. | Fix: validate 0 < treatment_fraction < 1 and ensure n_treated is in [1, n_units-1] with a clear error. benchmarks/python/utils.py:369

Performance

  • P3 | Impact: No performance regressions observed in the new benchmark wiring. | Fix: None.

Maintainability

  • P3 | Impact: No maintainability issues identified in the new benchmark integration. | Fix: None.

Tech Debt

  • P3 | Impact: No new tech-debt issues identified. | Fix: None.

Security

  • P3 | Impact: No secrets/PII or unsafe IO patterns added. | Fix: None.

Documentation/Tests

  • P3 | Impact: No automated coverage for the new --estimator multiperiod CLI path; future wiring regressions could slip. | Fix: Add a lightweight smoke test or CI note for this path. benchmarks/run_benchmarks.py:846

@igerber igerber merged commit fcf9e28 into main Feb 8, 2026
@igerber igerber deleted the multi-period-r-comparison branch February 8, 2026 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant