Skip to content

Refresh METHODOLOGY_REVIEW.md to reflect current estimator catalog#448

Open
igerber wants to merge 12 commits into
mainfrom
methodology-review
Open

Refresh METHODOLOGY_REVIEW.md to reflect current estimator catalog#448
igerber wants to merge 12 commits into
mainfrom
methodology-review

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 15, 2026

Summary

  • Audit the methodology-review tracker against __init__.py __all__, docs/methodology/REGISTRY.md, docs/methodology/papers/, and tests/test_methodology_*.py. Tracker had fallen ~9 estimators behind the library.
  • Reorganize into seven categories: Core / Staggered / Continuous & Universal-Treatment / Triple-Difference / Counterfactual / Diagnostics / Cross-Cutting Inference Features.
  • Tighten the "Complete" bar with an explicit definition (Verified Components + Corrections Made + Deviations + methodology test file).
  • Add In Progress entries with "Documentation in place" / "Outstanding for promotion" blocks for: ImputationDiD, TwoStageDiD, WooldridgeDiD (ETWFE), EfficientDiD, ContinuousDiD, ChaisemartinDHaultfoeuille (DCDH), HeterogeneousAdoptionDiD (HAD), TROP, StaggeredTripleDifference, ConleySpatialHAC, Survey Data Support, PlaceboTests.
  • Refresh SyntheticDiD last-review date to 2026-04-23 (PR Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 bootstrap-refit landing).
  • Refresh methodology-test counts on existing Complete entries (CallawaySantAnna 61, HonestDiD 27 methodology + 72 unit, DifferenceInDifferences 51, TripleDifference 45).
  • Document priority order for substantive review work: BaconDecomposition flagged as Add initial diff-diff library implementation #1 (chosen during this session); In-Progress promotion ladder enumerated (HAD largest surface, DCDH closest to ready).

Methodology references (required if estimator / math changes)

  • Method name(s): N/A - tracker / docs-only change; no estimator or math modifications
  • Paper / source link(s): N/A
  • Any intentional deviations from the source (and why): None

Validation

  • Tests added/updated: No test changes
  • Backtest / simulation / notebook evidence (if applicable): N/A
  • Existing Complete entries are unchanged in substance - only their test counts and SyntheticDiD's last-review date were refreshed to match current state

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Tracker had fallen ~9 estimators behind the library. Audit against
__init__.py __all__, docs/methodology/REGISTRY.md, docs/methodology/
papers/, and tests/test_methodology_*.py surfaced four "Not Started"
entries that are stale and ~10 missing entries entirely.

Changes:

- Reorganized Review Status Summary into seven categories (Core,
  Staggered, Continuous & Universal-Treatment, Triple-Difference,
  Counterfactual, Diagnostics, Cross-Cutting Inference Features).
- Added "What 'Complete' means" tier definition (Complete / In Progress
  / Not Started) so the bar is explicit.
- Added In-Progress entries for ImputationDiD, TwoStageDiD, WooldridgeDiD
  (ETWFE), EfficientDiD, ContinuousDiD, ChaisemartinDHaultfoeuille (DCDH),
  HeterogeneousAdoptionDiD (HAD), TROP, StaggeredTripleDifference,
  ConleySpatialHAC, Survey Data Support, PlaceboTests. Each "In Progress"
  block lists what's already in place (REGISTRY section, paper review,
  methodology test file, parity fixtures, test counts) and what's needed
  to promote to Complete.
- Updated SyntheticDiD last-review date to 2026-04-23 (PR #351
  bootstrap-refit landing date) to reflect the warm-start FW corrections.
- Refreshed methodology-test counts on the existing Complete entries to
  match current `grep` output (CallawaySantAnna 61, HonestDiD 27,
  TripleDifference 45, DifferenceInDifferences 51, HonestDiD unit 72).
- Updated Priority Order: BaconDecomposition flagged as next substantive
  review (chosen during this session); In-Progress promotion ladder
  documented (HAD largest, DCDH closest to ready, etc.).

No source code changes. No status flips on existing Complete entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. This is a docs-only PR, and the estimator catalog refresh itself looks directionally correct. The issues I found are tracker-accuracy problems inside METHODOLOGY_REVIEW.md, not estimator-methodology regressions.

Executive Summary

  • No estimator, weighting, variance, or inference code changed in this PR.
  • The refreshed catalog appears to cover the live estimator surface; I did not find a missing exported estimator/tool in the tracker.
  • P2: the new definition of **Complete** is stricter than what several entries still marked **Complete** actually satisfy.
  • P3: some of the newly refreshed counts are already inaccurate, which weakens the tracker as a current-state inventory.
  • The PR’s references to TODO-tracked methodology gaps remain appropriately non-blocking.

Methodology

  • Severity P2. Impact: the new tracker contract is internally inconsistent. The document now says a Complete entry must include Verified Components, Corrections Made, Deviations, and R/reference parity, but SyntheticDiD is still marked **Complete** without a Verified Components or Deviations block, and HonestDiD is marked **Complete** while its detailed entry still says “pending R comparison” and contains unchecked items. Several other Complete entries also still lack a Deviations section. That makes the tracker overstate review completeness for methodology-sensitive surfaces. Concrete fix: either downgrade those entries to In Progress until the missing blocks/parity are added, or relax the new Complete definition/legend so it matches the current document structure. References: METHODOLOGY_REVIEW.md:L17-L24, METHODOLOGY_REVIEW.md:L76-L84, METHODOLOGY_REVIEW.md:L93-L96, METHODOLOGY_REVIEW.md:L104-L152, METHODOLOGY_REVIEW.md:L154-L203, METHODOLOGY_REVIEW.md:L205-L275, METHODOLOGY_REVIEW.md:L719-L774, METHODOLOGY_REVIEW.md:L802-L859, METHODOLOGY_REVIEW.md:L888-L918.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the documentation drift called out below.

Tech Debt

No findings. The PR’s non-blocking references to deferred methodology work line up with existing tracking in TODO.md, which is the right treatment for those items. References: TODO.md:L74-L77, TODO.md:L118-L120.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: some refreshed “current state” counts are already wrong, so the tracker will drift again immediately. Survey Data Support says “8 dedicated test files” while enumerating 13 filenames, and StackedDiD says “72 tests ... across 11 test classes” while tests/test_stacked_did.py contains 10 Test* class declarations. Concrete fix: recompute these counts before merge, or stop hard-coding exact counts and instead generate them from the repo or use less brittle wording. References: METHODOLOGY_REVIEW.md:L470-L470, METHODOLOGY_REVIEW.md:L1079-L1083, tests/test_stacked_did.py:L66-L854.

P2 (tracker contract internal consistency): the new "What 'Complete'
means" definition was stricter than several existing Complete entries
satisfy (notably SyntheticDiD lacked a Verified Components block;
HonestDiD had unchecked Verified Components items and an awkward
"(pending R comparison)" status caveat; DiD / MultiPeriodDiD / TWFE /
TripleDifference lacked explicit Deviations blocks). Resolved by:

- Relaxing the legend to acknowledge format variation across the
  existing Complete entries (catalog grew incrementally; the
  invariant is documented walk-through against the academic source,
  not a fixed structural template).
- Aspiring to the fuller structure for new reviews going forward.
- Backfilling SyntheticDiD's Verified Components block (Frank-Wolfe
  on collapsed form, two-pass sparsification, auto-zeta from data
  noise level, pairs-bootstrap refit with warm-start, placebo + jackknife
  variance methods) and Deviations from R block (default
  variance_method, parameter renaming with deprecation aliases).
- Adding brief Deviations blocks to DiD / MultiPeriodDiD / TWFE /
  TripleDifference: "(None — verified equivalent)" for DiD and TWFE;
  HC1-default + e=-1 reference for MultiPeriodDiD; panel=FALSE-only
  for TripleDifference.
- Dropping the "(pending R comparison)" parenthetical from HonestDiD's
  status field; the R-parity-pending note remains in Outstanding
  Concerns where it belongs.

P3 (count drift): refreshed counts cited by the reviewer:

- StackedDiD: "72 tests across 11 test classes" -> "10 test classes"
  with an explicit category list. The 72 count predated this refresh
  and was stale; class count is now structurally pinned to the actual
  Test* declarations.
- Survey Data Support: "8 dedicated test files" -> "13 dedicated
  test files" (matches the existing enumeration).

Opportunistically replaced brittle unit-test counts in the existing
DiD / CallawaySantAnna / SunAbraham Complete entries with structural
wording ("Existing unit-test coverage in tests/test_X.py"), per the
reviewer's "less brittle wording" suggestion. Methodology-test counts
(which correspond to the Verified Components walk-through contract)
remain pinned.

No estimator, weighting, variance, or inference code changed. No
status flips on existing Complete entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 46661b8d762254da7095c66d6266ae9fd0dc5c31


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. This remains a docs-only PR, and the prior re-review concerns about the Complete contract and the old Survey/StackedDiD tracker inaccuracies are addressed. The remaining issues are P3 tracker-accuracy items only.

Executive Summary

  • No estimator implementation, weighting, variance, or inference code changed.
  • The prior tracker-consistency issue is resolved: the new Complete definition now matches the mixed historical entry formats, and SyntheticDiD now has full Complete scaffolding in the tracker. METHODOLOGY_REVIEW.md:L17-L29, METHODOLOGY_REVIEW.md:L819-L898
  • The previous stale-count findings called out for Survey Data Support and StackedDiD were corrected. METHODOLOGY_REVIEW.md:L482-L484, METHODOLOGY_REVIEW.md:L1116-L1120
  • P3: the new WooldridgeDiD “paper review on file” bullet points to a review of the 2023 nonlinear paper, while the tracker and REGISTRY name Wooldridge (2025) as the primary ETWFE source.
  • P3: several newly hard-coded inventory facts are already stale or incomplete (HonestDiD class count, T21 drift-test count, Survey section’s “every estimator” list / survey-theory.md line count).
  • I could not use pytest --collect-only here because pytest is not installed, so count checks are best-effort by source inspection.

Methodology

  • Severity P3. Impact: the WooldridgeDiD entry slightly overstates source-review readiness. METHODOLOGY_REVIEW.md:L594-L603 says a paper review is on file, but the cited file is docs/methodology/papers/wooldridge-2023-review.md, whose reviewed primary source is Wooldridge (2023), while both the tracker and REGISTRY designate Wooldridge (2025) as the primary ETWFE source. Concrete fix: either relabel this as a secondary-source review, or add a dedicated wooldridge-2025-review.md and point the tracker at that file. References: METHODOLOGY_REVIEW.md:L584-L603, docs/methodology/REGISTRY.md:L1308-L1313, docs/methodology/papers/wooldridge-2023-review.md:L1-L18.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the documentation brittleness noted below.

Tech Debt

No findings. The PR’s deferred-work references remain properly tracked in TODO.md, so the StaggeredTripleDifference R-parity/WIF items and the SyntheticDiD / Conley follow-ups stay non-blocking. References: TODO.md:L74-L76, TODO.md:L118-L120.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the tracker is already drifting again as a current-state inventory. Examples: METHODOLOGY_REVIEW.md:L951-L955 says tests/test_honest_did.py has 14 classes, but the file has 15 Test* classes (tests/test_honest_did.py:L157, L205, L255, L297, L329, L429, L457, L488, L553, L581, L597, L700, L1129, L1326, L1343); METHODOLOGY_REVIEW.md:L695-L698 says the HAD tutorial drift suites are 16 + 32 tests, but tests/test_t21_had_pretest_workflow_drift.py:L153-L389 contains 17 test_* functions; and METHODOLOGY_REVIEW.md:L1118-L1120 says docs/methodology/survey-theory.md is 805 lines and that the listed estimators cover “every estimator” with survey hooks, but the file is 858 lines and survey support is also present for SunAbraham, TwoStageDiD, and StackedDiD. Concrete fix: recompute these inventory facts from the repo before merge, or soften them to non-exact wording / generate them automatically. References: METHODOLOGY_REVIEW.md:L695-L698, METHODOLOGY_REVIEW.md:L951-L955, METHODOLOGY_REVIEW.md:L1118-L1120, docs/methodology/REGISTRY.md:L1136-L1196, docs/methodology/REGISTRY.md:L1214-L1305, docs/methodology/REGISTRY.md:L2833, diff_diff/sun_abraham.py:L485-L926, diff_diff/two_stage.py:L183-L846, diff_diff/stacked_did.py:L171-L597.

P3.1: WooldridgeDiD source mismatch. The "Paper review on file" bullet
pointed at docs/methodology/papers/wooldridge-2023-review.md, which
covers Wooldridge (2023) "Simple approaches to nonlinear DiD" -- the
companion nonlinear-extension paper, NOT the primary ETWFE source
Wooldridge (2025) Empirical Economics 69(5), 2545-2587. Relabeled
the bullet to "Companion-paper review on file" and explicitly noted
that a dedicated review for the primary ETWFE source is not yet on
file. Added a new "Outstanding for promotion" row calling for
docs/methodology/papers/wooldridge-2025-review.md. Updated the
Priority Order section's WooldridgeDiD line with the same precise
framing.

P3.2: count drift in newly added entries:
- HonestDiD: "72 tests in tests/test_honest_did.py (14 classes)"
  -> "15 test classes" with structural enumeration (DeltaSD/DeltaRM/
  DeltaSDRM bounds, FLCI, ARP, CS integration, edge cases). Brittle
  total-count number dropped in favor of structural description.
- HAD tutorials: "(16 + 32 tests)" -> reference companion test
  filenames instead of the soon-to-drift counts.
- survey-theory.md: dropped "(805 lines)" parenthetical; line count
  is brittle and adds no informational value.
- Survey support estimator list: expanded from 11 estimators to
  the complete 15 (added SunAbraham, StackedDiD, TwoStageDiD,
  TripleDifference and reordered for category grouping).

No status flips. No estimator, weighting, variance, or inference
code changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 899caace471dec702f4f913d8627bde2cbdcbc58


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. This is still a docs-only PR, and the prior Wooldridge primary-source mismatch from the last re-review is addressed. Remaining issues are P3 tracker-accuracy items only.

Executive Summary

  • No estimator implementation, weighting, variance, inference, or default-behavior code changed; the diff is confined to METHODOLOGY_REVIEW.md.
  • The prior Wooldridge source-review issue is fixed: the tracker now correctly treats wooldridge-2023-review.md as a companion-paper review and explicitly states that the primary Wooldridge (2025) review is still missing. METHODOLOGY_REVIEW.md:L594-L600, docs/methodology/REGISTRY.md:L1308-L1313
  • The earlier stale-count issues for StackedDiD and HonestDiD called out in the previous rerun are corrected. METHODOLOGY_REVIEW.md:L482-L484, METHODOLOGY_REVIEW.md:L952-L956
  • P3: the new Survey Data Support inventory still omits SyntheticDiD, even though the estimator is exported at the top level, accepts survey_design=, and has registry-documented survey support. METHODOLOGY_REVIEW.md:L1117-L1121, diff_diff/__init__.py:L83-L88, diff_diff/synthetic_did.py:L263-L323, docs/methodology/REGISTRY.md:L1688-L1690
  • P3: the new HAD test-inventory counts are already drifting again; the hard-coded 247, 12, and 11 figures do not match the current referenced files on source inspection. METHODOLOGY_REVIEW.md:L693-L699, tests/test_had_pretests.py:L76-L5473, tests/test_bandwidth_selector.py:L55-L540, tests/test_nprobust_port.py:L27-L653

Methodology

No findings. The PR does not modify any estimator or inference code, and the only methodology-adjacent concern from the prior re-review was resolved by clarifying Wooldridge 2023 vs. Wooldridge 2025 review status. METHODOLOGY_REVIEW.md:L594-L600, docs/methodology/REGISTRY.md:L1308-L1313

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the documentation brittleness noted below.

Tech Debt

No findings. The new tracked references for StaggeredTripleDifference and Conley/SyntheticDiD follow-ups point to real TODO entries and remain non-blocking. TODO.md:L74-L76, TODO.md:L119-L120

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the Survey Data Support section still understates the current survey-capable surface. METHODOLOGY_REVIEW.md:L1117-L1121 says the per-estimator survey hooks cover every estimator with survey support, but SyntheticDiD is omitted even though it is a top-level estimator and its fit() explicitly supports survey_design= for bootstrap, placebo, and jackknife, with registry-backed survey methodology notes. Concrete fix: add SyntheticDiD to the survey-support list; if the intent is to inventory every survey-capable surface rather than only estimators, decide explicitly whether survey-capable diagnostics such as BaconDecomposition belong there too. References: METHODOLOGY_REVIEW.md:L1117-L1121, diff_diff/__init__.py:L83-L88, diff_diff/synthetic_did.py:L263-L323, docs/methodology/REGISTRY.md:L1688-L1690

  • Severity P3. Impact: the HAD entry’s new hard-coded test counts are already stale or at least under-qualified. METHODOLOGY_REVIEW.md:L693-L699 claims tests/test_had_pretests.py has 247 tests and the bandwidth-port coverage is 12 tests in tests/test_bandwidth_selector.py plus 11 in tests/test_nprobust_port.py, but current source inspection finds 248 test_* definitions in tests/test_had_pretests.py, 45 in tests/test_bandwidth_selector.py, and 47 in tests/test_nprobust_port.py. Concrete fix: recompute these counts before merge, qualify them as subset counts if that is what was intended, or avoid hard-coded numbers and describe the files more generally. References: METHODOLOGY_REVIEW.md:L693-L699, tests/test_had_pretests.py:L76-L5473, tests/test_bandwidth_selector.py:L55-L540, tests/test_nprobust_port.py:L27-L653

P3.1: Survey support list omitted SyntheticDiD. SDiD is exported as a
top-level estimator, accepts survey_design=, and has REGISTRY-documented
survey support on its placebo + jackknife paths (refit bootstrap rejects
survey designs in this release, but the other variance methods work).
Added SyntheticDiD to the Survey Data Support estimator list and added
an explicit scope clarification: scope is *estimators*; survey-capable
diagnostics (BaconDecomposition Phase 3, HonestDiD survey-df handling)
are tracked in their own sections.

P3.2: HAD test counts drifted further (247 -> 248 in test_had_pretests.py;
the 12 and 11 counts for test_bandwidth_selector.py and
test_nprobust_port.py were sub-section subset counts I mis-cited as file
totals -- actual file totals are 45 and 47 respectively). De-brittled by
replacing hard-coded counts with structural file descriptions:
test_had.py for the HAD estimator, test_had_pretests.py for the
qug/stute/yatchew_hr/workflow surface, test_had_mc.py for Monte-Carlo
coverage, test_had_dual_knob_deprecation.py for deprecation paths,
test_bandwidth_selector.py for the public-API wrapper, and
test_nprobust_port.py for the full lprobust/lpbwselect_mse_dpi port.

No status flips. No estimator, weighting, variance, or inference code
changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a02c61c44509ea9df7a890723441ed0e90267fea


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. This is still a docs-only PR. The two prior P3 issues from the last re-review are addressed, and the remaining issues are minor tracker-accuracy mismatches.

Executive Summary

  • No estimator implementation, weighting, variance, inference, or default-behavior code changed; the diff is confined to METHODOLOGY_REVIEW.md.
  • The prior Survey Data Support omission is fixed: SyntheticDiD is now explicitly included in the survey-capable estimator list. METHODOLOGY_REVIEW.md:L1117-L1121, diff_diff/synthetic_did.py:L263-L323
  • The prior HAD stale-count issue is fixed: the tracker no longer hard-codes the incorrect HAD pretest / bandwidth counts and instead points to the relevant coverage files. METHODOLOGY_REVIEW.md:L693-L699
  • Severity P3: the new Conley section understates current R-parity coverage and references a non-existent generator pattern. METHODOLOGY_REVIEW.md:L1099-L1102, tests/test_conley_vcov.py:L2371-L2452, tests/test_conley_vcov.py:L3399-L3418, benchmarks/R/generate_conley_golden.R:L1-L10, docs/methodology/REGISTRY.md:L3063-L3069
  • Severity P3: the new StaggeredTripleDifference section incorrectly says there is no separate unit-test file. METHODOLOGY_REVIEW.md:L804-L808, tests/test_staggered_triple_diff.py:L1-L140

Methodology

  • Severity P3. Impact: METHODOLOGY_REVIEW.md:L1099-L1102 says Conley parity is still a self-reference baseline with unpinned R values, but the repo already ships committed R conleyreg goldens, dedicated parity tests, and a REGISTRY note stating parity to <= 1e-6 on six fixtures. That makes the new tracker entry materially understate the current methodology-validation state. Concrete fix: rewrite this bullet to reference the real generator script benchmarks/R/generate_conley_golden.R and describe the actual remaining gap, such as the missing dedicated methodology walk-through or missing summary table in METHODOLOGY_REVIEW.md. tests/test_conley_vcov.py:L2371-L2452, tests/test_conley_vcov.py:L3399-L3418, benchmarks/R/generate_conley_golden.R:L1-L10, docs/methodology/REGISTRY.md:L3063-L3069

Code Quality

No findings.

Performance

No findings.

Maintainability

No separate findings beyond the documentation drift noted above.

Tech Debt

No findings. The deferred items newly cited for StaggeredTripleDifference and SyntheticDiD are actually tracked in TODO.md, so they are correctly non-blocking. TODO.md:L74-L76, TODO.md:L118-L120

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: METHODOLOGY_REVIEW.md:L804-L808 says StaggeredTripleDifference has “no separate per-class unit test file,” but tests/test_staggered_triple_diff.py already exists as a dedicated unit-test suite. That understates the current test surface and makes the promotion checklist less trustworthy. Concrete fix: replace that sentence with an explicit reference to tests/test_staggered_triple_diff.py, ideally alongside tests/test_methodology_staggered_triple_diff.py so the tracker distinguishes unit coverage from methodology coverage. tests/test_staggered_triple_diff.py:L1-L140

P3.1: Conley R-parity status was understated. The tracker said parity
was a self-reference baseline with unpinned R values, but the repo
already ships:
- benchmarks/data/r_conleyreg_conley_golden.json (committed goldens)
- benchmarks/R/generate_conley_golden.R (generator)
- TestConleyParityR / TestConleySparseRParityForced parity-test classes
  asserting atol=1e-6 on the cross-sectional path
- TestConleyParitySpacetime asserting ~1e-14 on the panel path
  (matches conleyreg::time_dist.cpp form)

Reframed the Conley entry's Outstanding for promotion: the real gap
is a dedicated methodology test file and a summary R-parity table in
this tracker, not the R parity itself. Added a new "Documentation in
place (R parity)" sub-block.

P3.2: StaggeredTripleDifference entry incorrectly said no separate
unit test file exists. tests/test_staggered_triple_diff.py is a
dedicated unit-test suite (~680 lines covering DR/RA/IPW paths, both
control-group modes, GMM weighting, event-study aggregation, edge
cases). Fixed the bullet to point at the actual file.

No status flips. No estimator, weighting, variance, or inference code
changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 683ac1a637e0925bf3a7f56dd8130af7c04dbde1


Overall Assessment

✅ Looks good. There are no unmitigated P0/P1 findings. This remains a docs-only PR, and the prior re-review issues are largely fixed, but two non-blocking tracker-accuracy issues remain.

Executive Summary

  • The diff is confined to METHODOLOGY_REVIEW.md; no estimator, weighting, variance, inference, or default-behavior code changed.
  • The prior StaggeredTripleDifference tracker issue is fixed: the document now points to the dedicated unit-test suite. METHODOLOGY_REVIEW.md:L804-L808, tests/test_staggered_triple_diff.py:L1-L160
  • The prior Conley docs gap is mostly fixed: the tracker now references the committed conleyreg goldens and the real generator script. METHODOLOGY_REVIEW.md:L1099-L1100, tests/test_conley_vcov.py:L2371-L2452
  • Severity P3: the Conley R-parity bullets still misstate which tests cover which surface and overstate the panel parity tolerance. METHODOLOGY_REVIEW.md:L1099-L1102, tests/test_conley_vcov.py:L2429-L2479, tests/test_conley_vcov.py:L3399-L3450, tests/test_conley_vcov.py:L2632-L2635
  • Severity P3: the Survey Data Support section says the strata-vs-no-strata RNG-divergence limitation is tracked in TODO.md, but that documentation currently lives in the REGISTRY/HAD survey note instead. METHODOLOGY_REVIEW.md:L1131-L1132, TODO.md:L51-L121, docs/methodology/REGISTRY.md:L2472-L2474

Methodology

  • Severity P3. Impact: METHODOLOGY_REVIEW.md says cross-sectional Conley parity is covered by TestConleyParityR and TestConleySparseRParityForced, and says panel parity is asserted at ~1e-14 in TestConleyParitySpacetime. In the repo, TestConleyParityR is the cross-sectional 1e-6 R-parity harness, TestConleyParitySpacetime is the panel 1e-6 R-parity harness, and the machine-precision time_dist.cpp statement belongs to a separate hand-coded block-decomposition check. This overstates the current panel R-parity contract and mislabels the forced-sparse panel class as cross-sectional coverage. Concrete fix: split this into three bullets: cross-sectional R parity at 1e-6, panel R parity at 1e-6, and a separate internal block-decomposition check at machine precision. METHODOLOGY_REVIEW.md:L1099-L1102, tests/test_conley_vcov.py:L2429-L2479, tests/test_conley_vcov.py:L3399-L3450, tests/test_conley_vcov.py:L2632-L2635

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: the new Survey Data Support promotion checklist tells readers that the strata-vs-no-strata non-bit-exactness / RNG-divergence limitation is documented in TODO.md, but I could not find a corresponding TODO entry in TODO.md's Tech Debt from Code Reviews section. The actual documented note is in the HAD survey-pretest REGISTRY section. That sends reviewers to the wrong follow-up artifact. Concrete fix: either add the promised TODO row, or change the sentence to cite the existing REGISTRY note instead of TODO. METHODOLOGY_REVIEW.md:L1131-L1132, TODO.md:L51-L121, docs/methodology/REGISTRY.md:L2472-L2474, tests/test_had_pretests.py:L3948-L3959

Security

  • No findings.

Documentation/Tests

  • No additional findings. The prior StaggeredTripleDifference unit-test-file mismatch is resolved in this revision. METHODOLOGY_REVIEW.md:L804-L808, tests/test_staggered_triple_diff.py:L1-L160

P3.1: Conley R-parity bullets misstated which class covers which surface
and overstated panel parity tolerance. The previous wording bundled
TestConleyParityR with TestConleySparseRParityForced under "cross-sectional"
and claimed panel parity at ~1e-14, but:
- TestConleyParityR is cross-sectional R parity at atol=1e-6
- TestConleyParitySpacetime is panel (space-time) R parity at atol=1e-6
- TestConleySparseRParityForced is the panel R parity with sparse path
  forced, also at atol=1e-6
- The machine-precision claim belongs to a method INSIDE
  TestConleyParitySpacetime (test_panel_matches_block_decomposed_reference)
  that cross-checks the dense path against the hand-coded block
  decomposition from conleyreg::time_dist.cpp at inner tolerance atol=1e-12

Split the bullet into three accurate entries.

P3.2: Survey strata-vs-no-strata RNG-divergence pointer was wrong. I
said the documented impossibility lives in TODO.md, but the actual
documented note is in docs/methodology/REGISTRY.md HAD Stute
survey-bootstrap section ("Distributional parity, NOT bit-exact"
note). Updated the pointer and added the explanation that
distributional parity holds at large B but exact agreement at
atol=1e-10 does not.

No status flips. No estimator, weighting, variance, or inference code
changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b6697b6eba77c3672f99dfa23a0fd2b171bf45b4


Overall Assessment

✅ Looks good

Executive Summary

  • The diff is confined to METHODOLOGY_REVIEW.md; no estimator, weighting, variance, or inference code changed.
  • The two prior re-review tracker-accuracy issues are fixed: the Conley parity bullets now match the actual test surfaces, and the Survey Data Support section now points to the documented REGISTRY note rather than a nonexistent TODO entry.
  • Severity P3: the new status taxonomy is internally inconsistent. The legend says entries with REGISTRY/tests on file are In Progress, but several rows with exactly that scaffolding are still labeled Not Started.
  • Severity P3: the PowerAnalysis entry is stale. It says docs/methodology/REGISTRY.md lacks primary sources and uses that as a prioritization reason, but the registry already lists them.

Methodology

  • No findings. This remains a docs-only change, and the prior Conley/survey tracker issues are resolved. METHODOLOGY_REVIEW.md:L1099-L1103, tests/test_conley_vcov.py:L2371-L2479, tests/test_conley_vcov.py:L2632-L2635, tests/test_conley_vcov.py:L3399-L3450, METHODOLOGY_REVIEW.md:L1132-L1132, docs/methodology/REGISTRY.md:L2474-L2474

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity P3. Impact: the new legend makes Not Started and In Progress non-exclusive. It says any entry with a REGISTRY entry, tests, or a paper review is In Progress, but BaconDecomposition, PreTrendsPower, PowerAnalysis, and PlaceboTests are still marked Not Started while their sections immediately list REGISTRY coverage and tests. METHODOLOGY_REVIEW.md:L27-L29, METHODOLOGY_REVIEW.md:L95-L96, METHODOLOGY_REVIEW.md:L81-L85, METHODOLOGY_REVIEW.md:L913-L919, METHODOLOGY_REVIEW.md:L1025-L1031, METHODOLOGY_REVIEW.md:L1048-L1053, METHODOLOGY_REVIEW.md:L1070-L1075 Concrete fix: either tighten the In Progress definition so REGISTRY/tests alone do not qualify, or relabel those rows and convert them to the Documentation in place / Outstanding for promotion format.

Tech Debt

  • No findings. The new tracker references to deferred StaggeredTripleDifference and SyntheticDiD work do correspond to tracked TODO rows. TODO.md:L74-L76, TODO.md:L118-L120

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the PowerAnalysis tracker entry understates readiness and misprioritizes the review queue by claiming the REGISTRY Primary source: line is blank and that source confirmation is still needed, even though the registry already lists Bloom (1995) and Burlig et al. (2020). METHODOLOGY_REVIEW.md:L1043-L1050, METHODOLOGY_REVIEW.md:L1182-L1182, docs/methodology/REGISTRY.md:L2778-L2782 Concrete fix: remove the “primary source confirmation” language and replace it with the actual remaining gaps: no paper review, no methodology test file, and no documented reference-validation harness.

P3.1: status taxonomy inconsistency. The legend defined In Progress as
"REGISTRY entry + tests + ..." but BaconDecomposition, PreTrendsPower,
PowerAnalysis, and PlaceboTests were marked Not Started despite having
REGISTRY entries and tests on file. The reviewer's two options:
tighten the In Progress definition, or relabel the four rows. Chose
the relabel path:

- Moved BaconDecomposition / PreTrendsPower / PowerAnalysis /
  PlaceboTests from Not Started -> In Progress in the Status Summary.
- Converted their per-entry bodies from the old "Current state" /
  "Next step for promotion" two-section format to the standard
  In Progress format ("Documentation in place" / "Outstanding for
  promotion") used by the other 11 In Progress entries.
- Relaxed the legend's In Progress definition to acknowledge that the
  band is wide: some In Progress entries have only REGISTRY + tests
  (e.g., BaconDecomposition); others have paper review + methodology
  test file + R parity (e.g., DCDH). The Outstanding for promotion
  sub-section is what distinguishes them.
- Restructured the Priority Order section to a single ladder split
  into "Substantive-review-blocked" (needs paper review or methodology
  test file or R parity first) and "Consolidation-pass-blocked"
  (mostly Verified Components walk-through). BaconDecomposition stays
  #1 substantive target.

P3.2: PowerAnalysis tracker overclaimed "needs primary source
confirmation". REGISTRY's PowerAnalysis section already lists Bloom
(1995) and Burlig, Preonas & Woerman (2020) as primary sources.
Removed the source-confirmation claim from the entry and the
priority-order rationale; the actual remaining gap is no paper
review, no methodology test file, no documented reference-validation
harness against pwr/DeclareDesign.

No status flips on existing Complete entries. No estimator,
weighting, variance, or inference code changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 04b5cfa5c6dcde451a09ec072360c1d891fd59ab


Overall Assessment

✅ Looks good

No P0/P1 issues. This remains a docs-only re-review; the only unmitigated item is a P3 tracker-governance inconsistency.

Executive Summary

  • The diff is confined to METHODOLOGY_REVIEW.md; no estimator implementation, weighting, variance, inference, or identification code changed.
  • The previous PowerAnalysis tracker-staleness issue is fixed: the section now correctly says Bloom (1995) and Burlig et al. (2020) are already listed in docs/methodology/REGISTRY.md. METHODOLOGY_REVIEW.md:L1036-L1054, docs/methodology/REGISTRY.md:L2778-L2782
  • The earlier row-level status problem is partly fixed: the stale Not Started rows are gone, but the status policy itself is still internally inconsistent across the overview, legend, and process rules. METHODOLOGY_REVIEW.md:L27-L29, METHODOLOGY_REVIEW.md:L94-L97, METHODOLOGY_REVIEW.md:L1152-L1157
  • The Conley and survey cross-cutting sections now point to real committed review/test assets, and the deferred StaggeredTripleDifference / SyntheticDiD items still map to tracked TODO.md entries. METHODOLOGY_REVIEW.md:L1092-L1132, TODO.md:L74-L76, TODO.md:L118-L120

Methodology

  • No findings. This PR does not change any estimator, weighting, SE, or assumption-checking logic.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity P3. Impact: the tracker still defines statuses three different ways. The overview says In Progress means REGISTRY.md + unit-test coverage and that new estimators should enter In Progress once their registry entry lands; the legend says In Progress can mean any REGISTRY entry, tests, or paper review; and the process guide still instructs authors to add a Not Started stub for new estimators. That leaves future PRs without a single rule for classifying new surfaces and makes the tracker easy to drift back into inconsistent states. Concrete fix: pick one status contract and make the overview, legend, and “When adding a new estimator” rule match verbatim; if the intended policy is “new estimators enter In Progress once the registry section exists,” change METHODOLOGY_REVIEW.md:L1157-L1157 from Not Started to In Progress and resolve the and/or mismatch between METHODOLOGY_REVIEW.md:L27-L29 and METHODOLOGY_REVIEW.md:L94-L96.

Tech Debt

  • No findings. The deferred StaggeredTripleDifference and SyntheticDiD items cited by the tracker are already tracked in TODO.md, so they remain mitigated P3 debt rather than blockers. TODO.md:L74-L76, TODO.md:L118-L120

Security

  • No findings.

Documentation/Tests

  • No findings. The previous stale PowerAnalysis note is resolved, and the Conley/survey sections reference assets that exist in-repo. METHODOLOGY_REVIEW.md:L1036-L1054, docs/methodology/REGISTRY.md:L2778-L2782, METHODOLOGY_REVIEW.md:L1092-L1131

The status taxonomy was internally inconsistent across three locations:
- L27-29 overview: In Progress = REGISTRY + unit tests (wide band)
- L94-97 status legend: In Progress = REGISTRY entry, tests, OR paper review (looser definition)
- L1157 process rule: said to add new estimators as "Not Started" stub

Aligned all three to the single contract:
- Not Started = no REGISTRY entry (reserved for future use; no rows currently)
- In Progress = REGISTRY entry + unit tests; promotion path to Complete (wide band)
- Complete = documented walk-through in this tracker

Rewrote the status legend after the Status Summary table to match the
overview's "What Complete means" contract verbatim, with an explicit
cross-reference link.

Updated the "When adding a new estimator" process rule to direct
authors to add new surfaces as In Progress (with stub Documentation in
place / Outstanding for promotion sub-sections), not as Not Started.

No status flips on existing entries. No estimator, weighting,
variance, or inference code changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 58c09f2922a2473701a28451a0c0778e2345d924


Overall Assessment

✅ Looks good

Executive Summary

  • This remains a docs-only re-review. The diff is confined to METHODOLOGY_REVIEW.md, so there are no estimator, weighting, variance/SE, inference, or identification changes that could create new P0/P1 methodology defects.
  • The prior tracker-governance inconsistency is fixed: the overview, status legend, and process rules now all agree that new surfaces enter In Progress once the REGISTRY.md entry exists, and Complete requires a documented walk-through in this tracker. METHODOLOGY_REVIEW.md:L17-L29, METHODOLOGY_REVIEW.md:L94-L97, METHODOLOGY_REVIEW.md:L1152-L1157
  • The expanded catalog mostly matches repo state: the new In Progress entries have corresponding REGISTRY.md sections, and the cited tests/docs/assets are present.
  • One remaining P3 tracker-accuracy issue: the priority list says StaggeredTripleDifference “shares paper review with TripleDifference,” but there is no Ortiz-Villavicencio/Sant’Anna review file under docs/methodology/papers/, so the tracker overstates existing review coverage for that surface. METHODOLOGY_REVIEW.md:L794-L814, METHODOLOGY_REVIEW.md:L1142-L1148, METHODOLOGY_REVIEW.md:L1195-L1195, docs/methodology/REGISTRY.md:L1698-L1798
  • One minor P3 wording issue: the Conley entry says there are “three adjacent paper reviews” while naming four files. METHODOLOGY_REVIEW.md:L1092-L1095

Methodology

No findings. The PR does not change estimator code, formulas, weighting, SE computation, inference plumbing, or assumption checks.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3. Impact: StaggeredTripleDifference is prioritized as if an existing paper review already covers it (“shares paper review with TripleDifference”), but the tracker’s new process contract says missing paper reviews should be called out explicitly, and this one is not actually present. That makes the promotion ladder understate remaining methodology-review work for this estimator. Concrete fix: either add the missing Ortiz-Villavicencio/Sant’Anna paper review under docs/methodology/papers/ and reference it here, or change the text to “shares primary paper with TripleDifference” and add the missing paper-review task to the Outstanding for promotion / priority text. METHODOLOGY_REVIEW.md:L794-L814, METHODOLOGY_REVIEW.md:L1142-L1148, METHODOLOGY_REVIEW.md:L1195-L1195, docs/methodology/REGISTRY.md:L1698-L1798

Tech Debt

No findings. The deferred StaggeredTripleDifference items cited in the new tracker entry are genuinely tracked in TODO.md, so they remain mitigated P3 debt rather than blockers. TODO.md:L74-L76

Security

No findings.

Documentation/Tests

Severity P3. Impact: the Conley section says there are “three adjacent paper reviews” but then enumerates four files (butts-2021, butts-2023, clarke-2017, colella-et-al-2019). Harmless, but it adds noise to a tracker whose job is precise inventorying. Concrete fix: change “three” to “four” or remove the count entirely. METHODOLOGY_REVIEW.md:L1092-L1095

P3.1: StaggeredTripleDifference "shares paper review with TripleDifference"
was inaccurate. The two surfaces share the primary paper
(Ortiz-Villavicencio & Sant'Anna 2025), but no dedicated review file
exists under docs/methodology/papers/. Reworded the priority-order line
to "shares the primary paper" and added a new "Outstanding for
promotion" bullet for the missing review file.

P3.2: Conley entry said "three adjacent paper reviews" then enumerated
four (butts-2021, butts-2023, clarke-2017, colella-et-al-2019). Fixed
the count word.

No status flips. No estimator, weighting, variance, or inference code
changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ad41fc181a5f4ca67fb432e346c0a0773631cb0f


Overall Assessment

✅ Looks good

Executive Summary

  • This remains a docs-only re-review: the diff is confined to METHODOLOGY_REVIEW.md, so there are no estimator, weighting, variance/SE, inference, identification, or default-behavior changes that could create new P0/P1 methodology defects.
  • The two prior P3 tracker issues are fixed. The StaggeredTripleDifference priority text now explicitly says no dedicated paper-review file exists, and the Conley entry now correctly says there are four adjacent paper reviews.
  • The expanded catalog broadly matches repo state: the new In Progress entries have corresponding REGISTRY.md sections, and the concrete methodology-test counts I spot-checked match the files on disk.
  • One remaining P3 tracker-accuracy issue: the StaggeredTripleDifference entry points readers to triplediff::staggered_ddd(), but the repo’s actual R parity harness uses triplediff::ddd(panel=TRUE) plus agg_ddd() for aggregation. METHODOLOGY_REVIEW.md:L69-L69, METHODOLOGY_REVIEW.md:L800-L812, benchmarks/R/benchmark_staggered_triplediff.R:L56-L105, tests/test_methodology_staggered_triple_diff.py:L1-L90
  • Verification note: pytest is not available in this environment, so test-count checks were done by static inspection of the test files rather than runtime collection.

Methodology

No findings. The PR does not change estimator code, formulas, weighting, variance/SE computation, inference plumbing, or assumption checks.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The deferred StaggeredTripleDifference parity items referenced in the tracker are present in TODO.md, so they remain properly tracked non-blocking debt rather than review blockers. TODO.md:L74-L76

Security

No findings.

Documentation/Tests

  • Severity: P3
    Impact: The refreshed tracker names triplediff::staggered_ddd() as the R reference for StaggeredTripleDifference, but the checked-in cross-validation assets use the existing triplediff::ddd(panel=TRUE) entrypoint and agg_ddd() for aggregation. That can misdirect a future methodology reviewer to a nonexistent or wrong R surface when reproducing parity work. METHODOLOGY_REVIEW.md:L69-L69, METHODOLOGY_REVIEW.md:L800-L812, benchmarks/R/benchmark_staggered_triplediff.R:L56-L105, tests/test_methodology_staggered_triple_diff.py:L1-L90, docs/methodology/REGISTRY.md:L1965-L1985
    Concrete fix: Change the StaggeredTripleDifference R reference in the summary row and detail block to triplediff::ddd(panel=TRUE) (or just triplediff if you want to stay package-level), and update the parity bullet to reference ddd(..., panel=TRUE) plus agg_ddd() rather than staggered_ddd().

StaggeredTripleDifference R reference was wrong. The tracker cited
triplediff::staggered_ddd() (planned/non-existent), but the actual R
parity assets in benchmarks/R/benchmark_staggered_triplediff.R use
the existing triplediff::ddd(panel=TRUE) entrypoint plus agg_ddd()
for aggregation. Updated three locations:

- Status Summary row: changed R Reference to "triplediff::ddd(panel=TRUE) + agg_ddd()"
- Detailed section field table: same with attribution to the benchmark script
- "Outstanding for promotion" R-parity bullet: same with attribution

No status flips. No estimator, weighting, variance, or inference code
changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 49ddfa467be744465bda99d349488d2ed6cf719c


Overall Assessment

✅ Looks good

Executive Summary

  • This remains a docs-only re-review. The diff does not change estimator implementations, weighting, variance/SE formulas, inference plumbing, identification checks, or defaults, so I found no new P0/P1 methodology defects.
  • The prior StaggeredTripleDifference tracker issue is resolved: the summary row and detailed block now point to triplediff::ddd(panel=TRUE) plus agg_ddd(), which matches the checked-in benchmark harness. METHODOLOGY_REVIEW.md:69 METHODOLOGY_REVIEW.md:800 benchmark_staggered_triplediff.R:58
  • One remaining P3 tracker-accuracy issue: the new overview says examples like DCDH and HAD already have a paper review, a dedicated methodology test file, and R parity fixtures, but the HAD section itself still lists tests/test_methodology_had.py as outstanding. METHODOLOGY_REVIEW.md:27 METHODOLOGY_REVIEW.md:702
  • The tracker’s deferred StaggeredTripleDifference items are correctly tied to TODO.md, so those remain non-blocking tracked debt rather than review blockers. METHODOLOGY_REVIEW.md:812 TODO.md:74
  • Verification note: pytest is not installed in this environment, so count checks were done by static inspection of files/tests rather than runtime collection.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity: P3
    Impact: The new tracker overview overstates HAD’s review maturity. It says entries such as DCDH and HAD already have a paper review, a dedicated methodology test file, and R parity fixtures, but the HAD detail block later says the dedicated methodology file is still missing. That can mislead future reviewers about what remains before HAD can be promoted to Complete. METHODOLOGY_REVIEW.md:27 METHODOLOGY_REVIEW.md:693 METHODOLOGY_REVIEW.md:702
    Concrete fix: Remove HAD from that example, or rephrase the sentence so it says those entries have some combination of paper review, methodology tests, and R-parity artifacts rather than implying each named example has all three.

The In Progress band example overstated HAD's review maturity. The
text said "DCDH and HAD" both have "a paper review, a dedicated
methodology test file, and R parity fixtures", but HAD's detail block
lists a dedicated tests/test_methodology_had.py as still missing under
Outstanding for promotion.

Rephrased the example to make the variation explicit:
- DCDH: all three (paper review + methodology file + R parity)
- HAD: paper review + R parity, no dedicated methodology file
- ContinuousDiD: methodology file, no paper review

This communicates the same point (some In Progress entries have rich
scaffolding) without implying each named example has all three.

No status flips. No estimator, weighting, variance, or inference code
changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0544f3085a3bd3a28bfe493be6eafdf61c09c0f1


Overall Assessment

✅ Looks good

Executive Summary

  • This remains a docs-only re-review. The diff does not change estimator implementations, weighting, variance/SE formulas, inference plumbing, identification checks, or defaults, so I found no new P0/P1 methodology defects.
  • The prior HAD tracker-accuracy issue is resolved: the overview now correctly says HAD has paper review + R parity but no dedicated methodology file, matching the detailed HAD block.
  • One new P3 remains: the refresh overstates DCDH’s paper-review coverage. The tracker now reads as if DCDH already has a paper review on file, but the cited review file is the 2026 HAD / “no unit remains untreated” paper and still uses a template ## {EstimatorName} heading.
  • The refreshed counts and existence claims I spot-checked match the repo state, including the updated methodology-test counts and the new Conley/survey inventory.
  • The StaggeredTripleDifference reference and deferred-fixture wording now align with the benchmark harness and TODO.md.

Methodology

  • Severity: P3. Impact: DCDH’s review maturity is overstated. The overview and DCDH section now imply that DCDH already has a paper review on file, but the only cited review is the 2026 universal-rollout/HAD paper, not the 2020 AER or 2022/2024 intertemporal dCDH sources that define the core DID_M / DID_+ / DID_- and dynamic estimators. That can mislead future reviewers into treating DCDH’s primary-source audit as more complete than it is. Concrete fix: rephrase DCDH to say that only the 2026 universal-rollout extension has a paper review on file, and keep the 2020 and 2022/2024 dCDH papers explicitly outstanding for promotion. METHODOLOGY_REVIEW.md:L27 METHODOLOGY_REVIEW.md:L667-L669 METHODOLOGY_REVIEW.md:L1191-L1192 docs/methodology/papers/dechaisemartin-2026-review.md:L1-L18

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The refreshed StaggeredTripleDifference deferred items still map cleanly to TODO.md and remain correctly treated as tracked, non-blocking debt.

Security

  • No findings.

Documentation/Tests

  • No additional findings. I verified the referenced files and sampled test-count claims by static inspection of the repo; no runtime test execution was needed for this docs-only diff.

DCDH paper-review claim was overstated. The cited review file
docs/methodology/papers/dechaisemartin-2026-review.md covers de
Chaisemartin / Ciccia / D'Haultfoeuille / Knau (2026)
"Difference-in-Differences Estimators When No Unit Remains Untreated"
-- the 2026 universal-rollout extension that is HAD's primary source,
NOT a primary-source review for DCDH itself. DCDH's core DID_M /
DID_+ / DID_- and dynamic estimators are defined in:
- de Chaisemartin & D'Haultfoeuille (2020) AER 110(9), 2964-2996
- de Chaisemartin & D'Haultfoeuille (2022, revised 2024) NBER WP 29873

Neither of these has a dedicated review on disk. Same pattern as the
WooldridgeDiD R2 finding: a companion-paper review existed for an
adjacent paper but no primary-source review for the estimator itself.

Updates (three locations):

1. In Progress band example in the legend: changed "DCDH has all
   three (paper review + methodology file + R parity)" to specify
   "companion-paper review for the 2026 universal-rollout extension"
   so the example honestly reflects what's on disk.
2. DCDH "Documentation in place" bullet: relabeled as
   "Companion-paper review on file" with an explicit note that the
   2026 paper is HAD's primary source, not DCDH's, and that DCDH's
   2020 AER + 2022/2024 NBER WP 29873 primary sources have no
   reviews on file.
3. DCDH "Outstanding for promotion" gained an explicit row for the
   two missing primary-source reviews.
4. Priority Order line 8 (DCDH): qualified the "paper review"
   shorthand to specify it's a companion-paper review and that
   primary-source reviews are still outstanding.

No status flips. No estimator, weighting, variance, or inference code
changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 20f377897eadbfa124d7dfea7c1467c1260bfe90


Overall Assessment

✅ Looks good

Executive Summary

  • This re-review remains docs-only. The diff refreshes the methodology tracker/catalog and does not change estimator implementations, weighting, variance/SE code, identification checks, or defaults.
  • The prior DCDH tracker-accuracy issue is resolved. The DCDH section now correctly treats the 2026 universal-rollout review as a companion paper review and keeps the 2020/2022/2024 DCDH primary-source reviews explicitly outstanding in METHODOLOGY_REVIEW.md and docs/methodology/REGISTRY.md.
  • The new “In Progress” catalog entries are anchored in the registry. I verified corresponding REGISTRY.md sections exist for the added surfaces, including DCDH, HAD, TROP, PlaceboTests, ConleySpatialHAC, and Survey Data Support.
  • The refreshed methodology-test counts I spot-checked match the repo by static inspection: DiD 51, Callaway 61, TripleDifference 45, HonestDiD 27, SyntheticDiD 157, ContinuousDiD 15, and StaggeredTripleDifference 6.
  • Deferred items called out as tracked remain tracked in TODO.md, including the StaggeredTripleDifference R-fixture gap, the SyntheticDiD bootstrap parity anchor, and Conley+survey follow-up work.

Methodology

Code Quality

  • No findings. The reorganized status tables and detailed sections are internally consistent with the tracker contract introduced at the top of METHODOLOGY_REVIEW.md.

Performance

  • No findings.

Maintainability

  • No findings. The category split makes the tracker easier to reconcile against REGISTRY.md and the current public estimator/tool surface.

Tech Debt

  • No findings. Where the refresh cites deferred work, those items are already tracked in TODO.md and are correctly presented as non-blocking follow-ups.

Security

  • No findings.

Documentation/Tests

  • No findings. The referenced paper-review files, theory notes, tutorials, benchmark scripts, and test files all exist.
  • I did not execute tests in this environment because pytest is not installed; validation here was static repo inspection, which is sufficient for this docs-only diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant