Skip to content

OC concept reintegration: overlay Eric's material/object-type mappings onto the wide (#272, fixes #260)#275

Merged
rdhyee merged 6 commits into
isamplesorg:mainfrom
rdhyee:feat/oc-concept-reintegration-272
Jun 11, 2026
Merged

OC concept reintegration: overlay Eric's material/object-type mappings onto the wide (#272, fixes #260)#275
rdhyee merged 6 commits into
isamplesorg:mainfrom
rdhyee:feat/oc-concept-reintegration-272

Conversation

@rdhyee

@rdhyee rdhyee commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Implements #272 per the decision thread there: OC wins unconditionally for OC pids. Stacked on #274 (the rigorous pipeline) — merge that first and this diff reduces to the enrichment commits.

What's in here

Piece File
Enrichment script (deterministic, manifest'd, hard-fails on grain violations) scripts/enrich_wide_with_oc_concepts.py
Independent trust gate (re-derives expectations from inputs; keyed row-hash comparison scales to 20.7M rows) scripts/validate_oc_concept_enrichment.py
22 fixture tests incl. 9 adversarial regression tests — each verified to fool an earlier validator before the fix tests/test_oc_concept_enrichment.py
make all-272 chain (wide + oc-wide → enrich → gate → derived → gate), -j-safe Makefile
#260 sentinel parameterized by data vintage scripts/validate_frontend_derived.py
Explorer reads isamples_202606_* (incl. pinned versioned wide for popups) explorer.qmd
Stage 3 split into 3a thumbnails / 3b OC concepts DATA_PROVENANCE.md

Real-data run (make all-272, both gates PASS)

Review provenance

Dual AI sign-off: Claude (author) + Codex 4 adversarial review rounds — 11 findings (2 BLOCKER, 6 MAJOR, 2 MINOR, 1 NIT), every one verified by execution and closed with a regression test. Codex verdict round 4: LGTM.

Data is staged on R2 (isamples_202606_* via data.isamples.org); no production cutovercurrent/ aliases still point at 202604. Staging inspection: rdhyee#8.

Fixes #260. Implements the overlay phase of #272. Part of #273.

🤖 Generated with Claude Code

rdhyee and others added 6 commits June 10, 2026 18:11
…esorg#272, fixes isamplesorg#260)

Overlays Eric Kansa's OC PQG concept mappings onto the unified wide:
p__has_material_category / p__has_sample_object_type are REPLACED for OC
pids — OC wins unconditionally (RY decision 2026-06-10, isamplesorg#272). Mints
IdentifiedConcept rows for URIs the frozen export never had (e.g.
otheranthropogenicmaterial — the correct isamplesorg#260 value, absent entirely).

- scripts/enrich_wide_with_oc_concepts.py: deterministic single-pass DuckDB
  overlay; ordered-list preservation; hard-fails on dup pids/row_ids and
  unresolved OC concept refs; emits .manifest.json (input shas, counts).
- scripts/validate_oc_concept_enrichment.py: INDEPENDENT trust gate — re-derives
  expected URI lists from (src, oc) with its own SQL; non-overlay rows must be
  byte-identical; minted set must be exactly the missing URIs; isamplesorg#260 sentinel.
- tests/test_oc_concept_enrichment.py: 13 fixture tests incl. unconditional-win,
  order preservation, determinism (bit-identical), validator tamper-detection,
  hard-failure modes.
- validate_frontend_derived.py: isamplesorg#260 sentinel parameterized by data vintage
  (--sentinel-material); default now the post-isamplesorg#272 corrected value.
- Makefile: oc-wide / enrich / validate-enrich / all-272 chain; CI runs both
  fixture suites.
- DATA_PROVENANCE.md: Stage 3 split into 3a (thumbnails) + 3b (OC concepts).

Scope (documented): overlay only — ~75K new OC records not ingested;
p__has_context_category untouched. Both follow-ups tracked in isamplesorg#272.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…acy sentinel

- validator: ALL src rows now compared on ALL non-replaced columns (keyed
  row_id hash join — scales to 20.7M, no full-table EXCEPT) [BLOCKER 1 + perf 4]
- validator: overlay pid SET equality + distinctness; sentinel absence is a
  FAILURE when present in inputs (was a silent N/A) [BLOCKER 2]
- validator: minted rows must carry OC label/scheme metadata
- Makefile: legacy chain passes --sentinel-material (pre-isamplesorg#272 value); all-272
  clears it to use the enriched default [MAJOR 3]
- enrich: document []->NULL normalization (pqg #8 convention) [MINOR 5]
- tests: +3 — both Codex attacks reproduced (verified to fool the OLD
  validator, caught by the fixed one) + empty-array normalization pin

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…n, make -j safety

- validator: minted rows now compared FULL-ROW against a re-derived
  expectation (deterministic ids max(src)+rank(uri); NULLs everywhere except
  pid/otype/label/scheme) — shifted-id and smuggled-column outputs now fail
- enricher + validator: hard-reject duplicate OC IdentifiedConcept row_ids
  (one reference must never fan out into several URIs)
- Makefile: $(ENRICHED) is a real file target; validate-enrich depends on it;
  all/all-272 use ordered sub-makes — safe under make -j
- tests: +3 regression (both round-2 attacks verified to fool the previous
  gate; dup-concept-row_id input rejected by both scripts)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- validator: reject unresolved OC concept refs (inner joins were silently
  dropping them from the expectation — standalone runs could pass a wrong
  output) and duplicate OC MSR pids (grain parity with the enricher)
- validator: minted expectation uses NOT EXISTS (NULL-pid src concept made
  NOT IN evaluate UNKNOWN -> false failure)
- tests: +3 regression for all three findings

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…#272)

All tagged data URLs 202601->202606; wide_url pinned to the explicit
versioned file (popups read corrected OC material/object-type from it).
current/wide.parquet alias stays on the previous wide until the production
cutover decision.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@rdhyee rdhyee force-pushed the feat/oc-concept-reintegration-272 branch from f21e1d1 to 8125afc Compare June 11, 2026 01:12
@rdhyee rdhyee merged commit c7eb727 into isamplesorg:main Jun 11, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

interactive explorer bug: material category incorrect on sample

1 participant