Skip to content

feat(sinks): partition iceberg exports by day with conversation sort#91

Merged
philcunliffe merged 3 commits into
masterfrom
worktree-iceberg-sink
Jun 11, 2026
Merged

feat(sinks): partition iceberg exports by day with conversation sort#91
philcunliffe merged 3 commits into
masterfrom
worktree-iceberg-sink

Conversation

@philcunliffe

@philcunliffe philcunliffe commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Lay out @hypaware/format-iceberg exports for an archive's job, not the cache's:

  • Partition by day(primaryTimestampColumn) — a writer-owned default, derived independently of the cache's cachePartitioning. The cache partitions by conversation_id:identity, which sets an unbounded ~1-file-per-conversation floor that compaction can't beat. Day grain bounds file count by time (~dozens of multi-MB files/day) and prunes on time-range predicates.
  • Sort each day partition by the dataset's lookup columns (conversation_id-led, from its declared identity columns). A conversation lookup then prunes data files / row groups by conversation_id min/max — preserving lookup speed without the file-count cost of partitioning on it.

Rationale, the measurement that drove it, and the abandoned "inherit cachePartitioning" decision it replaces are in LLP 0022.

✅ Resolved — icebird pin (was merge blocker)

The clustering (#22) and read-pruning (#20/#21) benefits required a published icebird containing commit 3edb15b ("Scan pruning and sort-on-write").

  • Publish icebird 3edb15b — released as 0.8.9
  • Bump the package.json pin 0.8.50.8.9 (shared-engine bump — the cache rides the same icebird and gains read-pruning for free)

Re-verified on 0.8.9: npm test (809 pass / 0 fail) and npm run smoke -- iceberg_export_partitioned_local_fs both pass. The sort order recorded in metadata is now active on write.

What's in here

Docs

  • llp/0022 rewritten to the day-grain + sort decision; xrefs added to llp/0014 (sinks) and llp/0003 (the helper promotion is core surface).

Core

  • Promote partitionSpecForDeclaration + validatePartitionSpecStability + the declaration type out of src/core/cache/iceberg to a shared src/core/iceberg, re-exported from src/core/index.js. Move + re-export; cache rewired; behavior identical.

format-iceberg

  • partitioning.js derives the day grain (from primaryTimestampColumn) + sort order (from the dataset's identity columns) per dataset at commit time.
  • commit.js creates the table with partitionSpec + sortOrder and rejects partition-spec drift on append (iceberg_partition_spec_drift).
  • table-format.js emits hyp_partition_spec + hyp_sort_order on the commit span.
  • maintenance.js compaction reframed: available via icebergRewrite, but not run in-daemon and not needed for a day grain (was "blocked by icebird").

Test plan

  • npm test809 pass / 0 fail (10 new: derivePartitioning derivation + drift, exercised through the real icebird write path via commitBatch over a real local-fs BlobStore — day partition, conversation sort, 2-files-for-2-days, drift rejected, no false drift).
  • npm run smoke -- iceberg_export_partitioned_local_fspasses: drives the production sink, asserts day(message_created_at) spec + conversation_id-led sort, 4 rows → 2 day-partition files, and hyp_partition_spec / hyp_sort_order on the span.
  • /ref-check — 74 @ref annotations, 0 broken.

Note: pre-existing spool bug (not introduced here)

The existing iceberg_export_local_fs smoke fails in this environment because storage.appendRowsflushTable({force:true})readRows yields 0 rows. Confirmed to reproduce on pristine master (changes stashed) and on icebird 0.8.5 — i.e. independent of this work and of the icebird version. The new partitioned smoke sidesteps it by populating the cache via the direct write path. Worth a separate investigation.

🤖 Generated with Claude Code

philcunliffe and others added 2 commits June 9, 2026 15:35
Lay out @hypaware/format-iceberg exports for an archive's job, not the
cache's: partition by day(primaryTimestampColumn) — a writer-owned
default, not the cache's conversation_id-identity cachePartitioning,
which sets an unbounded ~1-file-per-conversation floor compaction can't
beat — and sort each day partition by the dataset's lookup columns
(conversation_id-led) so a conversation lookup prunes row groups by
min/max instead of needing a partition per conversation.

- Promote partitionSpecForDeclaration + validatePartitionSpecStability
  (and the declaration type) from src/core/cache/iceberg to a shared
  src/core/iceberg home, re-exported from src/core/index.js: they are
  core surface consumed by the registry, cache, plugin types, and now
  the export (LLP 0003).
- format-iceberg derives the day grain + sort order per dataset at
  commit time, creates the table with both, and rejects partition-spec
  drift on append (iceberg_partition_spec_drift). Emits hyp_partition_spec
  and hyp_sort_order on commit spans.
- Reframe maintenance compaction: available via icebergRewrite but not
  run in-daemon and not needed for a day grain (was "blocked by icebird").

Spec: LLP 0022 (rewritten from the abandoned cache-parity decision);
xrefs in LLP 0014 and 0003. Tests: 10 (derivation + drift through the
real icebird write path) plus a passing iceberg_export_partitioned_local_fs
smoke asserting the layout and hyp_partition_spec.

Clustering (icebird #22) and read pruning (#20/#21) require a published
icebird containing commit 3edb15b; the package.json pin must move off
0.8.5 before those benefits land. The code degrades gracefully on 0.8.5
— partitioning and drift work; the sort order is recorded but inert.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Activates the conversation sort within day partitions and read-side
scan pruning that the partitioned export records in metadata. Clears
the merge blocker: 0.8.9 contains icebird 3edb15b.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@philcunliffe

Copy link
Copy Markdown
Contributor Author

Dual-agent review — request_changes

  • Verdict: request_changes
  • Risk class: medium
  • Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Codex Reverse partition-drift skipped when partitioning is null on a partitioned table (major; commit.js:99, partitioning.js:29-34) Yes — Risks bullet 2; Direct callers (commitBatch guard at commit.js:99); missing test beside iceberg-partitioning.test.js:181
Codex Lockfile not updated for icebird 0.8.9 → npm ci breakage (major; package.json:41, package-lock.json) False positivepackage-lock.json is gitignored and untracked in this repo; the stale 0.8.5 lockfile is local environment state, not a PR defect. Residual real risk (validation ran against 0.8.5) is captured in Risks bullet 3, not as a lockfile defect
Claude Conversation sort asserted only via metadata, never via row order or sort_order_id; suite passes on installed 0.8.5 which writes unsorted (major; iceberg-partitioning.test.js, partitioned smoke :533-538) Yes — Risks bullet 3; Cross-package usage (shared icebird engine); confirmed locally: node_modules/icebird resolves 0.8.5 at PR head
Claude hyp sink maintain CLI text still says "compaction not supported by icebird", contradicting the PR's reframing (minor; core_commands.js:2336-2339, 2251-2253, 2326) Yes — Direct callers (compactionSupported consumer at core_commands.js:2326,2336-2339 confirmed unchanged); consistency gap only, not counted as a risk bullet
Claude Stray </content> artifact at end of LLP 0022 (minor; llp/0022-iceberg-export-partitioning.spec.md:270) No — doc-only; touches no code surface mapped above (confirmed present in the diff at the file's final line)
Codex review

Fix Validations

No bug-fix validations. The claimed icebird dependency update is reviewed below as a release-safety finding because the lockfile contract is incomplete.

Findings

2) Contract & Interface Fidelity

  • Severity: major
  • Confidence: high
  • Evidence: hypaware-core/plugins-workspace/format-iceberg/src/partitioning.js:29, hypaware-core/plugins-workspace/format-iceberg/src/partitioning.js:34, hypaware-core/plugins-workspace/format-iceberg/src/commit.js:99
  • Why it matters: If a dataset stops deriving partitioning for an already day-partitioned export table, the drift guard is skipped and the commit span reports unpartitioned even though the existing table still has a partition spec.
  • Suggested fix: On append, always inspect the existing partition spec; if input.partitioning is null and currentPartitionSpec(metadata).fields.length > 0, throw iceberg_partition_spec_drift or explicitly preserve/report the existing layout. Add the reverse-drift test beside the existing unpartitioned-to-partitioned drift test at test/plugins/iceberg-partitioning.test.js:181.

8) Release Safety

  • Severity: major
  • Confidence: high
  • Evidence: package.json:41, package-lock.json:15, package-lock.json:774
  • Why it matters: The manifest requests icebird 0.8.9, but lockfile installs still resolve 0.8.5, so npm ci will either fail or run without the sort-on-write/read-pruning dependency this PR requires.
  • Suggested fix: Update and commit package-lock.json for icebird@0.8.9, then rerun npm test and the partitioned smoke from a clean install path.

No Finding

  1. Behavioral Correctness
  2. Change Impact / Blast Radius
  3. Concurrency, Ordering & State Safety
  4. Error Handling & Resilience
  5. Security Surface
  6. Resource Lifecycle & Cleanup
  7. Test Evidence Quality
  8. Architectural Consistency
  9. Debuggability & Operability

Evidence Bundle

  • Changed hot paths: exportDataset layout derivation and span attributes; commitRowStream partitioning propagation; commitBatch create/drift behavior; promoted core partition-spec helper; icebird dependency pin.
  • Impacted callers: hypaware-core/plugins-workspace/format-iceberg/src/table-format.js:202, hypaware-core/plugins-workspace/format-iceberg/src/table-format.js:302, hypaware-core/plugins-workspace/format-iceberg/src/commit.js:191, src/core/cache/iceberg/store.js:105, src/core/cache/iceberg/store.js:126.
  • Impacted tests: test/plugins/iceberg-partitioning.test.js:65, test/plugins/iceberg-partitioning.test.js:146, test/plugins/iceberg-partitioning.test.js:181, test/plugins/iceberg-partitioning.test.js:211.
  • Unresolved uncertainty: I did not run tests; review was limited to the supplied diff plus targeted tracing.
Claude review

Claude review

Conversation sort is asserted only as metadata — no test verifies rows are actually sorted

  • Severity: major
  • Confidence: 90
  • Evidence: test/plugins/iceberg-partitioning.test.js:1421-1423; hypaware-core/smoke/flows/iceberg_export_partitioned_local_fs.js:533-538
  • Why it matters: Both the unit test and the smoke assert only that sort-orders/default-sort-order-id metadata was recorded — neither reads a data file back to assert conversation rows are actually ordered, nor checks the per-data-file sort_order_id; the gap is demonstrably real because the full suite passes against the currently installed icebird, which hardcodes sort_order_id: 0 and writes rows unsorted (no src/write/sort.js), so a missing or regressed sort-on-write in the dependency is invisible to tests and smokes.
  • Suggested fix: In the commitBatch integration test or the partitioned smoke, read one day-partition parquet file back with hyparquet and assert the conversation_id column is non-decreasing, and/or assert the manifest entry's sort_order_id === 1; re-run tests and the iceberg smokes against the published icebird 0.8.9 (a fresh npm install), per LLP 0022's own landing requirement.

hyp sink maintain CLI text contradicts the PR's own compaction reframing

  • Severity: minor
  • Confidence: 85
  • Evidence: src/core/cli/core_commands.js:2336-2339 (also 2251-2253, 2326)
  • Why it matters: The PR redefines compactionSupported: false to mean "not run by this sink (out-of-band only), not impossible" (maintenance.js docstring, smoke message, and LLP 0022#compaction all updated), but the user-facing CLI still prints "compaction not supported by icebird — data-file rewriting will be available in a future release" and labels the action compaction_unsupported, which is factually false under the icebird 0.8.9 pin this PR lands.
  • Suggested fix: Update the runSinkMaintain summary line, the compaction_unsupported action label, and the docstring at core_commands.js:2251-2253 to the "not run by this sink — out-of-band only (LLP 0022)" framing, matching the already-updated smoke assertion.

Stray </content> artifact committed at the end of LLP 0022

  • Severity: minor
  • Confidence: 90
  • Evidence: llp/0022-iceberg-export-partitioning.spec.md:270
  • Why it matters: A literal </content> paste artifact is committed as the final line of the new spec document, which the repo's LLP tooling (/ref-check, /ref-story) parses and CLAUDE.md's living-docs discipline exists to keep clean.
  • Suggested fix: Delete the trailing </content> line from llp/0022-iceberg-export-partitioning.spec.md.

Cross-checked and dropped: three reviewers flagged "package-lock.json not updated for the icebird 0.8.9 bump" (rated up to blocker) — verified false positive: package-lock.json is gitignored in this repo (.gitignore:9) and untracked, so the stale 0.8.5 lockfile is local environment state, not a PR defect; a fresh install resolves the exact 0.8.9 pin from package.json. The residual real risk (local validation ran against a non-0.8.9 engine) is folded into the sort-coverage finding above.


Reports: .git/dual-review/pr-91

… assertions, CLI text

- commitBatch now rejects reverse partition-spec drift: appending with no
  derived partitioning onto an already-partitioned table throws
  iceberg_partition_spec_drift instead of silently skipping the guard
  (and mislabeling spans as unpartitioned). LLP 0022#drift-rejection
  updated to record the guard as bidirectional; new test alongside the
  forward-drift case.
- The conversation sort is now asserted on disk, not just in metadata:
  both the commitBatch integration test and the partitioned smoke read a
  day-partition parquet file back with hyparquet and assert conversation_id
  row order — fails on an icebird that records the sort order but writes
  unsorted. Verified against icebird 0.8.9.
- hyp sink maintain CLI text reframed to match maintenance.js / LLP 0022:
  compaction is not run by this sink (out-of-band via icebergRewrite), not
  'unsupported by icebird'; action label is now compaction_out_of_band.
- Removed stray </content> artifact from the end of LLP 0022.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@philcunliffe

Copy link
Copy Markdown
Contributor Author

Review fixes pushed — 4290b09

Addressed all four actionable findings from the dual-agent review (the lockfile finding was already marked a false positive — package-lock.json is gitignored here):

Reverse partition-drift (Codex, major)commitBatch now always inspects the existing table when appending: if the dataset derives no partitioning but the table on disk is partitioned, it throws iceberg_partition_spec_drift instead of silently skipping the guard and reporting unpartitioned on the span. New test added beside the forward-drift case (commitBatch rejects reverse drift: null partitioning onto a partitioned table). LLP 0022#drift-rejection updated to record the guard as bidirectional.

Sort asserted only as metadata (Claude, major) — both the commitBatch integration test and the partitioned smoke now read a day-partition parquet file back with hyparquet and assert the conversation_id rows come back in sorted order (cA,cB from cB,cA input). This fails on an icebird that records the sort order but writes unsorted (e.g. 0.8.5). Verified locally against installed icebird 0.8.9: npm test 810 pass / 0 fail, npm run smoke -- iceberg_export_partitioned_local_fs passes.

hyp sink maintain CLI text (Claude, minor) — docstring, per-dataset action label (compaction_unsupportedcompaction_out_of_band), and summary line reframed to "compaction not run by this sink — out-of-band only, see LLP 0022", matching maintenance.js and the smoke.

Stray </content> in LLP 0022 (Claude, minor) — removed.

🤖 Generated with Claude Code

@philcunliffe

Copy link
Copy Markdown
Contributor Author

🧭 Decision map — where to spend your attention

Companion to the dual-review verdict. This casts no verdict — it points at the 6 forks where the author made a real choice, so you can skim the rest.

Scanned: 33 hunks across 21 files. Most is mechanical: the verbatim promotion of partitionSpecForDeclaration/validatePartitionSpecStability from src/core/cache/iceberg/ to src/core/iceberg/ with re-export shims, ~300 lines of LLP 0022/0003/0014 documentation, new test/smoke fixtures, and signature propagation of the partitioning input. The decisions worth your eyes, in order:

1. Existing tables hard-fail on upgrade — drift is rejected, not migrated · unhappy-path policy

hypaware-core/plugins-workspace/format-iceberg/src/commit.js:108

throw newError(
  'iceberg_partition_spec_drift',
  `iceberg-format: partition spec drift at '${input.tableUrl}': ${message}`
)
  • Decision: an append whose derived day-grain spec doesn't match the table's current spec throws a permanent iceberg_partition_spec_drift; partition evolution is unsupported, and the documented operator path is "export to a new prefix" (LLP 0022#drift-rejection).
  • Alternative not taken: detect "existing unpartitioned table + newly derived partitioning" and continue unpartitioned with a logged warning (keeping a working pre-upgrade export alive), or auto-roll to a new table prefix.
  • Check: the first post-upgrade append to every pre-existing V1 export table whose dataset declares primaryTimestampColumn (ai-gateway, gascity, otel all do) lands exactly here — the PR's own test (iceberg-partitioning.test.js:181) proves this shape fires — and the daemon then re-fails every tick with no in-product migration. Confirm that is the intended upgrade story.

2. icebird 0.8.5 → 0.8.9 — one line that swaps the shared write engine · new dependency

package.json:41

"icebird": "0.8.9",
  • Decision: bump the engine that both the export plugin and the intrinsic cache ride, to pick up sort-on-append, scan pruning, and icebergRewrite (icebird [codex] Record AI gateway message context #20/[codex] Add Codex metadata projection #21/[codex] Support positional attach clients #22).
  • Alternative not taken: hold the pin until the published 0.8.9 is validated end-to-end here (LLP 0022's own landing requirement). There is no export-only bump — the cache shares the pin, so its write behavior changes with zero cache code in this diff.
  • Check: the worktree still resolves 0.8.5 locally, so the suite and smokes validated the old engine. After a fresh install, re-run npm test and the three iceberg smokes — and note sort-on-append, which entry 5 depends on, is asserted nowhere at the row level.

3. Partition axis = writer-owned day(primaryTimestampColumn) · contract / format

hypaware-core/plugins-workspace/format-iceberg/src/partitioning.js:40

iceberg: { fields: [{ column: tsColumn, transform: 'day', required: true }] },
  • Decision: the export synthesizes its own day-grain declaration from primaryTimestampColumn, ignoring cachePartitioning for the partition axis.
  • Alternatives not taken: inherit cachePartitioning (conversation-identity ⇒ a hard floor of ~1 file per conversation, ~4.5K today and unbounded, that compaction cannot beat), or identity(date) (only works where a day string is precomputed; couples the export to a cache-ism).
  • Check: this layout is permanent once written (entry 1 makes drift fatal). Partition values are UTC day ordinals (…_day=20608), not date strings — confirm downstream consumers are fine with that, and that ~25–39K rows/day is the right grain.

4. The drift guard only runs in one direction · unhappy-path policy

hypaware-core/plugins-workspace/format-iceberg/src/commit.js:99

} else if (input.partitioning && priorState.metadata) {
  • Decision: spec-stability validation is gated on input.partitioning being truthy — an append with null partitioning is never checked.
  • Alternative not taken: a symmetric guard that also rejects (or explicitly preserves-and-reports) when partitioning derives to null but the existing table's spec is non-empty.
  • Check: a dataset that loses its primaryTimestampColumn, or whose timestamp column drops from the schema (partitioning.js:34 returns null), silently appends to a partitioned table while the commit span reports hyp_partition_spec: 'unpartitioned'. Reverse drift is undetected and untested.

5. Sort key borrowed from the cache's identity columns, in declared order · contract

hypaware-core/plugins-workspace/format-iceberg/src/partitioning.js:71

const declared = reg.cachePartitioning?.iceberg?.fields ?? []
// ...
if (f.transform !== 'identity') continue
  • Decision: the within-partition sort reuses cachePartitioning's identity columns in declared order (asc, nulls-last) — the one place the export reads the cache declaration. Non-identity and schema-missing columns are silently skipped; none ⇒ unsorted.
  • Alternative not taken: a writer-owned sort key (matching the partition-axis philosophy) — e.g. conversation_id alone, or with the timestamp as a tiebreaker so rows within a conversation are time-ordered.
  • Check: clustering quality hangs on declaration order — conversation_id leads for ai-gateway only because it is declared first — and the sort only materializes if 0.8.9's sort-on-append works (entry 2); tests pin it as metadata, never as row order.

6. Compaction is now possible but deliberately never run by the sink · concurrency / lifecycle

hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:120

 * @ref LLP 0022#compaction  icebird now exposes `icebergRewrite`
 * (read-rewrite compaction), but the export deliberately does not run it:
  • Decision: keep compactionSupported: false, redefining it from "impossible (icebird limitation)" to "not run by this sink — out-of-band only," because day partitions already yield large files and an in-daemon read-rewrite risks the parquet-sink OOM failure mode.
  • Alternative not taken: wire icebergRewrite into maintainExportTables now that the engine exposes it, tightening the local-vs-global sort.
  • Check: the field's meaning changed while its name and value did not — hyp sink maintain (src/core/cli/core_commands.js:2336-2339) still prints "compaction not supported by icebird," now factually false; and no out-of-band compaction affordance exists yet.

Honorable mentions (real but lower-stakes): partitioning.js:34 — missing/undeclared timestamp ⇒ silently unpartitioned rather than an error; commit.js:262currentPartitionSpec quietly falls back to the last spec, then an empty one, before the drift comparison; iceberg-partitioning.test.js:168 — the sort contract is pinned via default-sort-order-id metadata only, so an unsorted write passes; src/core/iceberg/types.d.ts:13 — the promoted core type keeps its Cache… name and cache-iceberg: error prefixes (rename explicitly deferred).

Generated by /decision-map. Advisory — directs attention, casts no verdict.

@philcunliffe philcunliffe merged commit 6b4b269 into master Jun 11, 2026
6 checks passed
@philcunliffe philcunliffe deleted the worktree-iceberg-sink branch June 11, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant