feat!: Rust COITrees backend for gvl.Table; promote to public API#237
Merged
Conversation
Full port of gvl.Table + annot_overlap off polars-bio onto a COITrees-backed Rust module: fixes max_mem disrespect during write/update, removes the non-deterministic polars-bio segfault (#395), drops the [table] extra, and promotes Table out of experimental into the public API (CI-covered). Phase 4 continuation of the Rust-migration roadmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add /// Panics section to count() describing that sample codes >= n_samples and chrom_code >= n_contigs cause panics; callers must pass factor-encoded codes from the Table's own lists. Add debug_assert! before tree indexing to make the trust-the-Python-boundary contract explicit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds hypothesis property test (100 examples) asserting byte-identical parity between the Rust COITrees Table backend and an independent brute-force numpy oracle across count_intervals and _intervals_from_offsets. Oracle uses t._df (the frame as stored by Table after its stable sort on chrom/sample_id/start) to match the Rust's stored-index ordering for equal-start interval ties. Also removes unused pytest imports left by earlier tasks (ruff check fix). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extends tests/unit/test_table_parity.py with a Hypothesis property test (test_annot_overlap_matches_oracle) that verifies annot_overlap returns correct starts/ends/values per region vs a brute-force numpy oracle, matching tie-breaking via the same internal Table._df sorted ordering. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Remove @pytest.mark.skipif from test_annot_tracks (polars-bio gone; annot DataFrame path now uses Rust COITrees backend); also drop the no-longer-needed `os` and `pytest` imports. - Add early-return guard in annot_overlap for empty annot DataFrames (0 rows), returning a well-formed all-empty RaggedIntervals of shape (n_regions, None) instead of crashing when the internal __annot__ sample is missing from the Table. - Add regression test test_annot_overlap_empty_annot in tests/unit/test_write_annot.py. - Remove dead n_samples struct field from RustTable in src/tables.rs (was set but never read; eliminates the dead_code warning). - Move `from ._table import Table` to alphabetically correct position in python/genvarloader/__init__.py (after ._ragged, before ._torch). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces
gvl.Table's polars-bio overlap backend with a self-contained Rust COITrees module (src/tables.rs,RustTablePyO3 class), and promotesTablefromgenvarloader.experimentalto the public API.This:
max_memblow-up duringgvl.write()/update()for Table/annot tracks — the Rust streaming writer bounds the working set to one region's overlaps + one contig's lazily-built trees, and raises if a single region exceeds the budget.var_ranges, so it is not fully removed from the tree.)Tableto the public API (from genvarloader import Table); deletes theexperimentalsubpackage and the[table]extra.Architecture
src/tables.rs— immutable interval store grouped by(chrom_code, sample_code), a COITrees overlap engine (count + materialize), and a streaming writer. Trees are built lazily one contig at a time and dropped on contig change.python/genvarloader/_table.py— thin polars constructor/validator that factor-encodes the frame (sorted by(chrom, sample_id, start)) and delegates all overlap toRustTable.INTERVAL_DTYPE, 12 bytes/row LE; i64 LE offsets), so datasets read back unchanged vianp.memmap.Breaking change
genvarloader.experimental.Tableis removed; usegenvarloader.Table. The[table]extra is gone.Testing
count_intervals,_intervals_from_offsets, andannot_overlap; amax_memregression test; the end-to-end annot-DataFrame integration test is now un-skipped and runs in CI.cargo test --release12 passed; ruff + pyrefly clean.Notes / follow-up
annot_overlapnow returns an all-emptyRaggedIntervalsfor an empty annotation BED (previously crashed).🤖 Generated with Claude Code