Skip to content

fix: floor track-write window at the input region (#233 follow-up)#236

Merged
d-laub merged 8 commits into
mainfrom
fix/track-write-window-input-floor
Jun 20, 2026
Merged

fix: floor track-write window at the input region (#233 follow-up)#236
d-laub merged 8 commits into
mainfrom
fix/track-write-window-input-floor

Conversation

@d-laub

@d-laub d-laub commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Problem

The variant writers set each region's stored chromEnd to the furthest retained variant's end. When a region's tail is variant-free, that end falls below the user's input window, so regions.npy — which the track writers read — truncates the window. Functional-genomic tracks (annotation and sample BigWigs) were therefore written over a truncated region and read back with a zeroed tail past each region's rightmost variant.

This is a #233 follow-up: the input-region tail was silently dropped for any variant + track dataset.

Fix

Floor the stored chromEnd at the input window end at all three writer sites:

  • _region_end / _region_ends_from_list (VCF + PGEN, both extend_to_length branches) — max(fallback_end, v_ends[max idx])
  • _write_from_svarpl.max_horizontal(pl.Series(max_ends), pl.col("chromEnd"))

The read path already clips to the input window, so no read change is needed and the on-disk format is unchanged (no DATASET_FORMAT_VERSION bump). Existing corrupt datasets must be rewritten — an open-time warning (_warn_truncated_tracks) flags them with a clear remediation message.

What's in this PR

  • The fix + a parametrized regions.npy floor test across VCF/PGEN/SVAR
  • End-to-end annot-track tail regression (constant-0.5 bigWig reads back fully covered over the input width)
  • Open-time warning for legacy variant-truncated datasets (clean datasets stay silent)
  • Dataset.regions == input invariant guard
  • Regenerated 4 test_flat_getitem_snapshot track snapshots that had silently encoded the truncation bug (the fixture's pre-fix windows truncated inside its own 20bp regions, e.g. region 0 stored chromEnd 111 vs input end 121); post-fix ragged track output is the correct 20bp = full input window. The 6 non-track snapshots are untouched.

Testing

Full tree green: 774 passed, 25 skipped, 4 xfailed. Lint + format + pyrefly clean.

🤖 Generated with Claude Code

d-laub and others added 8 commits June 19, 2026 19:47
Variant writers truncate stored chromEnd to the furthest retained variant,
so tracks are written against variant-truncated windows but read against the
true input windows -> annot/sample track tails zero out under realign=False.
Spec: floor stored chromEnd at the input window in the three variant writers,
plus a soft open-time warning for silently-corrupt existing variant+track
datasets.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…llow-up)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Variant writers set stored chromEnd to the furthest retained variant, which
can fall short of the input region end, so tracks were written over a
truncated window. Floor chromEnd at the input window in _region_end,
_region_ends_from_list, and _write_from_svar.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…233)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…#233)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Appends test_dataset_regions_match_input_with_variants to
tests/integration/tracks/test_track_window_floor.py. The test writes a
two-region dataset with a VCF variant source, opens it, and asserts that
ds.regions reflects the original input BED exactly. The invariant was
already true (input_regions.arrow always stores the user-supplied BED);
this test guards against any future regression that would feed the
variant-extended storage window into input_regions.arrow instead.

Skill re-check: no public-API surface changed, and the Common gotchas
section in skills/genvarloader/SKILL.md has no entry that cleanly
accommodates a one-sentence note about track storage window vs. input
region. Skill left unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…#233)

The flat-getitem characterization snapshots for the four track-output cases
(tracks_ragged, tracks_fixed, haps_tracks_fixed, haps_tracks_ragged) were
recorded against the pre-fix writer, which truncated each region's stored
track window at the rightmost retained variant. In the snap_dataset fixture
that truncation fell *inside* the 20 bp input regions (e.g. region 0 stored
chromEnd 111 vs input window end 121), so the snapshots silently encoded the
bug: ragged track output was shorter than the input region.

With the chromEnd floor (2750040), stored track windows now cover the full
input window and ragged track output is the expected 20 bp per region.
Regenerate the four affected snapshots; the other six (no tracks) are
unchanged and untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The non-empty branch now returns max(fallback_end, v_ends[max idx]); the
docstring still described the pre-floor return value.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@d-laub d-laub merged commit 2539b72 into main Jun 20, 2026
8 checks passed
@d-laub d-laub deleted the fix/track-write-window-input-floor branch June 20, 2026 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant