fix: floor track-write window at the input region (#233 follow-up)#236
Merged
Conversation
Variant writers truncate stored chromEnd to the furthest retained variant, so tracks are written against variant-truncated windows but read against the true input windows -> annot/sample track tails zero out under realign=False. Spec: floor stored chromEnd at the input window in the three variant writers, plus a soft open-time warning for silently-corrupt existing variant+track datasets. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…llow-up) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Variant writers set stored chromEnd to the furthest retained variant, which can fall short of the input region end, so tracks were written over a truncated window. Floor chromEnd at the input window in _region_end, _region_ends_from_list, and _write_from_svar. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…233) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…#233) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Appends test_dataset_regions_match_input_with_variants to tests/integration/tracks/test_track_window_floor.py. The test writes a two-region dataset with a VCF variant source, opens it, and asserts that ds.regions reflects the original input BED exactly. The invariant was already true (input_regions.arrow always stores the user-supplied BED); this test guards against any future regression that would feed the variant-extended storage window into input_regions.arrow instead. Skill re-check: no public-API surface changed, and the Common gotchas section in skills/genvarloader/SKILL.md has no entry that cleanly accommodates a one-sentence note about track storage window vs. input region. Skill left unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…#233) The flat-getitem characterization snapshots for the four track-output cases (tracks_ragged, tracks_fixed, haps_tracks_fixed, haps_tracks_ragged) were recorded against the pre-fix writer, which truncated each region's stored track window at the rightmost retained variant. In the snap_dataset fixture that truncation fell *inside* the 20 bp input regions (e.g. region 0 stored chromEnd 111 vs input window end 121), so the snapshots silently encoded the bug: ragged track output was shorter than the input region. With the chromEnd floor (2750040), stored track windows now cover the full input window and ragged track output is the expected 20 bp per region. Regenerate the four affected snapshots; the other six (no tracks) are unchanged and untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The non-empty branch now returns max(fallback_end, v_ends[max idx]); the docstring still described the pre-floor return value. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The variant writers set each region's stored
chromEndto the furthest retained variant's end. When a region's tail is variant-free, that end falls below the user's input window, soregions.npy— which the track writers read — truncates the window. Functional-genomic tracks (annotation and sample BigWigs) were therefore written over a truncated region and read back with a zeroed tail past each region's rightmost variant.This is a #233 follow-up: the input-region tail was silently dropped for any variant + track dataset.
Fix
Floor the stored
chromEndat the input window end at all three writer sites:_region_end/_region_ends_from_list(VCF + PGEN, bothextend_to_lengthbranches) —max(fallback_end, v_ends[max idx])_write_from_svar—pl.max_horizontal(pl.Series(max_ends), pl.col("chromEnd"))The read path already clips to the input window, so no read change is needed and the on-disk format is unchanged (no
DATASET_FORMAT_VERSIONbump). Existing corrupt datasets must be rewritten — an open-time warning (_warn_truncated_tracks) flags them with a clear remediation message.What's in this PR
regions.npyfloor test across VCF/PGEN/SVARDataset.regions == inputinvariant guardtest_flat_getitem_snapshottrack snapshots that had silently encoded the truncation bug (the fixture's pre-fix windows truncated inside its own 20bp regions, e.g. region 0 storedchromEnd111 vs input end 121); post-fix ragged track output is the correct 20bp = full input window. The 6 non-track snapshots are untouched.Testing
Full tree green: 774 passed, 25 skipped, 4 xfailed. Lint + format + pyrefly clean.
🤖 Generated with Claude Code