Skip to content

RFC - Beta Filter For Disk Search#1101

Open
dyhyfu wants to merge 8 commits into
microsoft:mainfrom
dyhyfu:rfc_disk_beta_filter
Open

RFC - Beta Filter For Disk Search#1101
dyhyfu wants to merge 8 commits into
microsoft:mainfrom
dyhyfu:rfc_disk_beta_filter

Conversation

@dyhyfu
Copy link
Copy Markdown

@dyhyfu dyhyfu commented May 25, 2026

Summary

Adds beta-biased filtering on the disk search path by introducing a SearchPlan enum that replaces the existing (vector_filter, is_flat_search) parameter pair on searcher.search(). The new enum is hierarchical (SearchPlan { FlatScan, Graph(GraphMode) }), makes invalid combinations unrepresentable, and gives future filter algorithms a single extension point without further growing the public search() signature.

Motivation

  • Beta-biased graph search is unavailable on the disk path. The in-memory side has a BetaFilter strategy; the disk side has no equivalent.
  • A raw closure can't carry beta. Bolting it on as a third Option<f32> parameter creates meaningless combinations (is_flat_search=true + beta=Some(_)) the type system can't reject.
  • No clean integration point for future filter algorithms (MultihopSearch exists in diskann but the disk API has no way to select it).

What's in the RFC

  • SearchPlan { FlatScan { filter: Option<Predicate> }, Graph(GraphMode) } + GraphMode { Unfiltered, PostFilter(p), BetaFilter { predicate, beta } }.
  • Predicate = Box<dyn Fn(u32) -> bool + Send + Sync> — closure-based.
  • Single GraphMode match site in search_strategy() that projects (predicate, beta); downstream DiskAccessor and RerankAndFilter read those fields directly — adding a new GraphMode variant touches one place.
  • Invalid combinations (beta on flat scan, beta without a predicate) unrepresentable by construction.
  • Zero allocation and zero per-iteration overhead preserved on the no-filter graph path.

Caller migration

Today (vector_filter, is_flat_search) New plan
(None, false) SearchPlan::graph()
(None, true) SearchPlan::flat()
(Some(p), false) SearchPlan::graph_with(GraphMode::post_filter(p))
(Some(p), true) SearchPlan::flat_filtered(p)
(new capability) SearchPlan::graph_with(GraphMode::beta_filter(p, β))

Known follow-ups (deferred)

  • Disk's GraphMode::BetaFilter applies bias + hard post-filter; in-memory BetaFilter only biases. Future work to align.

See rfcs/01101-disk-beta-filter.md for the full design.

yaohongdeng and others added 3 commits May 22, 2026 14:48
Adds an RFC proposing a `SearchPlan` enum to replace the disk search API's
`(vector_filter, is_flat_search)` parameter pair, and introducing beta-biased
graph search as a new capability. The hierarchical `SearchPlan { FlatScan,
Graph(GraphMode) }` shape makes invalid combinations unrepresentable and
provides a single extension point for future graph algorithms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reframe Motivation around supporting BetaFilter on the disk path
  and designing an extension point for future filter algorithms.
- Move the full design doc content under the Proposal section.
- Set Benchmark Results, Future Work, References to N/A initially,
  then add one Future Work item: align disk and in-memory BetaFilter
  semantics (disk applies bias + post-filter; in-memory only biases).
- Drop the Multihop worked example and its back-references.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dyhyfu dyhyfu requested review from a team and Copilot May 25, 2026 06:00
@dyhyfu dyhyfu changed the title Rfc disk beta filter Rfc- Beta Filter For Disk Search May 25, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds an RFC proposing a new SearchPlan/GraphMode API for disk search to support beta-biased filtering, replace (vector_filter, is_flat_search) with a single typed parameter, and create an extensible hook for future graph search variants.

Changes:

  • Introduces a hierarchical SearchPlan { FlatScan, Graph(GraphMode) } model for disk search configuration.
  • Defines Predicate, GraphMode variants (including BetaFilter), and migration examples from the current API.
  • Documents intended internal plumbing changes (strategy projection, pq_distances beta application, and post-filtering behavior).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rfcs/01101-disk-beta-filter.md Outdated
Comment thread rfcs/01101-disk-beta-filter.md Outdated
Comment thread rfcs/01101-disk-beta-filter.md Outdated
Comment thread rfcs/01101-disk-beta-filter.md Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.53%. Comparing base (e2dc9a0) to head (9379b39).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1101      +/-   ##
==========================================
+ Coverage   89.46%   90.53%   +1.06%     
==========================================
  Files         482      482              
  Lines       91082    91082              
==========================================
+ Hits        81491    82461     +970     
+ Misses       9591     8621     -970     
Flag Coverage Δ
miri 90.53% <ø> (+1.06%) ⬆️
unittests 90.49% <ø> (+1.37%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 41 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@suri-kumkaran suri-kumkaran added the RFC Request For Comments label May 25, 2026
Copy link
Copy Markdown
Contributor

@suri-kumkaran suri-kumkaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through this — I’m aligned with the design. I do feel we can trim some of the redundant or overly explanatory parts of the RFC and leave a few of the implementation details for the implementation PRs.

Comment thread rfcs/01101-disk-beta-filter.md Outdated
Comment thread rfcs/01101-disk-beta-filter.md Outdated
Comment thread rfcs/01101-disk-beta-filter.md
- Add lifetime parameter to Predicate/SearchPlan/GraphMode so callers
  can pass closures borrowing non-'static data, matching the existing
  VectorFilter<'a, Data> flexibility.
- Make GraphMode::beta_filter fallible (returns Result<Self, BetaError>)
  so invalid beta from JSON/CLI surfaces as a validation error, not
  a process crash.
- Migrate the benchmark JSON schemas in the same PR (no external
  consumers; no grace period needed).
- Restructure for outside-in narrative: search() signature → projection
  function → DiskSearchStrategy struct change.
- Trim implementation-detail sections (pq_distances code block,
  data-flow diagram, plumbing table, testing list) that belong in
  the implementation PR.
- Rename "The five cases" to "Supported configurations".
- Add a Future Work item: align disk vs. in-memory BetaFilter
  semantics (disk applies bias + post-filter; in-memory only biases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@suri-kumkaran suri-kumkaran changed the title Rfc- Beta Filter For Disk Search RFC - Beta Filter For Disk Search May 26, 2026
@microsoft-github-policy-service
Copy link
Copy Markdown

@dyhyfu please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

Contribution License Agreement

This Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
and conveys certain license rights to Microsoft Corporation and its affiliates (“Microsoft”) for Your
contributions to Microsoft open source projects. This Agreement is effective as of the latest signature
date below.

  1. Definitions.
    “Code” means the computer software code, whether in human-readable or machine-executable form,
    that is delivered by You to Microsoft under this Agreement.
    “Project” means any of the projects owned or managed by Microsoft and offered under a license
    approved by the Open Source Initiative (www.opensource.org).
    “Submit” is the act of uploading, submitting, transmitting, or distributing code or other content to any
    Project, including but not limited to communication on electronic mailing lists, source code control
    systems, and issue tracking systems that are managed by, or on behalf of, the Project for the purpose of
    discussing and improving that Project, but excluding communication that is conspicuously marked or
    otherwise designated in writing by You as “Not a Submission.”
    “Submission” means the Code and any other copyrightable material Submitted by You, including any
    associated comments and documentation.
  2. Your Submission. You must agree to the terms of this Agreement before making a Submission to any
    Project. This Agreement covers any and all Submissions that You, now or in the future (except as
    described in Section 4 below), Submit to any Project.
  3. Originality of Work. You represent that each of Your Submissions is entirely Your original work.
    Should You wish to Submit materials that are not Your original work, You may Submit them separately
    to the Project if You (a) retain all copyright and license information that was in the materials as You
    received them, (b) in the description accompanying Your Submission, include the phrase “Submission
    containing materials of a third party:” followed by the names of the third party and any licenses or other
    restrictions of which You are aware, and (c) follow any other instructions in the Project’s written
    guidelines concerning Submissions.
  4. Your Employer. References to “employer” in this Agreement include Your employer or anyone else
    for whom You are acting in making Your Submission, e.g. as a contractor, vendor, or agent. If Your
    Submission is made in the course of Your work for an employer or Your employer has intellectual
    property rights in Your Submission by contract or applicable law, You must secure permission from Your
    employer to make the Submission before signing this Agreement. In that case, the term “You” in this
    Agreement will refer to You and the employer collectively. If You change employers in the future and
    desire to Submit additional Submissions for the new employer, then You agree to sign a new Agreement
    and secure permission from the new employer before Submitting those Submissions.
  5. Licenses.
  • Copyright License. You grant Microsoft, and those who receive the Submission directly or
    indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license in the
    Submission to reproduce, prepare derivative works of, publicly display, publicly perform, and distribute
    the Submission and such derivative works, and to sublicense any or all of the foregoing rights to third
    parties.
  • Patent License. You grant Microsoft, and those who receive the Submission directly or
    indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license under
    Your patent claims that are necessarily infringed by the Submission or the combination of the
    Submission with the Project to which it was Submitted to make, have made, use, offer to sell, sell and
    import or otherwise dispose of the Submission alone or with the Project.
  • Other Rights Reserved. Each party reserves all rights not expressly granted in this Agreement.
    No additional licenses or rights whatsoever (including, without limitation, any implied licenses) are
    granted by implication, exhaustion, estoppel or otherwise.
  1. Representations and Warranties. You represent that You are legally entitled to grant the above
    licenses. You represent that each of Your Submissions is entirely Your original work (except as You may
    have disclosed under Section 3). You represent that You have secured permission from Your employer to
    make the Submission in cases where Your Submission is made in the course of Your work for Your
    employer or Your employer has intellectual property rights in Your Submission by contract or applicable
    law. If You are signing this Agreement on behalf of Your employer, You represent and warrant that You
    have the necessary authority to bind the listed employer to the obligations contained in this Agreement.
    You are not expected to provide support for Your Submission, unless You choose to do so. UNLESS
    REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, AND EXCEPT FOR THE WARRANTIES
    EXPRESSLY STATED IN SECTIONS 3, 4, AND 6, THE SUBMISSION PROVIDED UNDER THIS AGREEMENT IS
    PROVIDED WITHOUT WARRANTY OF ANY KIND, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTY OF
    NONINFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
  2. Notice to Microsoft. You agree to notify Microsoft in writing of any facts or circumstances of which
    You later become aware that would make Your representations in this Agreement inaccurate in any
    respect.
  3. Information about Submissions. You agree that contributions to Projects and information about
    contributions may be maintained indefinitely and disclosed publicly, including Your name and other
    information that You submit with Your Submission.
  4. Governing Law/Jurisdiction. This Agreement is governed by the laws of the State of Washington, and
    the parties consent to exclusive jurisdiction and venue in the federal courts sitting in King County,
    Washington, unless no federal subject matter jurisdiction exists, in which case the parties consent to
    exclusive jurisdiction and venue in the Superior Court of King County, Washington. The parties waive all
    defenses of lack of personal jurisdiction and forum non-conveniens.
  5. Entire Agreement/Assignment. This Agreement is the entire agreement between the parties, and
    supersedes any and all prior agreements, understandings or communications, written or oral, between
    the parties relating to the subject matter hereof. This Agreement may be assigned by Microsoft.

Copy link
Copy Markdown
Contributor

@suri-kumkaran suri-kumkaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting close!

Comment thread rfcs/01101-disk-beta-filter.md
Comment thread rfcs/01101-disk-beta-filter.md
Comment thread rfcs/01101-disk-beta-filter.md Outdated
- Introduce FilterMode<'a> sum type at the strategy layer: replaces
  the (Option<&Predicate>, Option<f32>) tuple with one enum
  (None | Filter | BetaFilter), making (None, Some(β)) unrepresentable
  and extending better for future variants (Multihop, LabelBeta, ...).
- Fix the internal/external ID mismatch in RerankAndFilter::post_process:
  predicate is keyed on ExternalId per DataProvider contract; flat_search
  is correct; RerankAndFilter and the new pq_distances both add
  to_external_id conversion. Today's identity mapping hides the bug.
- Add Non-Goal: newtyping DiskProvider::ExternalId is deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@suri-kumkaran suri-kumkaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligned on the design. ID convention is a clean fix and the FilterMode<'a> sum type makes the invariants compiler-enforced.

On Predicate<'a> vs Predicate ('static-only): fine either way — lifetime preserves borrow flexibility, 'static-only is simpler. Author's call.

Approving once the inline comments are resolved.

Comment thread rfcs/01101-disk-beta-filter.md Outdated
Current state:

- **Input predicate is keyed on `ExternalId`** by `DataProvider` contract.
- **`Index::flat_search`** is correct — calls `to_external_id` before invoking the predicate ([diskann/src/graph/index.rs](../diskann/src/graph/index.rs)).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no flat_search in diskann/src/graph/index.rs (grep confirms — only disk_provider.rs has one). I think this might be a conflation with diskann::flat::FlatIndex::knn_search in diskann/src/flat/index.rs (the module from RFC 00983-flat-search) — that's a separate, parallel abstraction that the disk path doesn't delegate to.

The disk's own flat_search (disk_provider.rs:912) is a standalone implementation, and it invokes the predicate with positional/internal IDs directly:

let mut iter = (0..provider.num_points as u32).filter(vector_filter);

No to_external_id call anywhere in the disk flat_search path (confirmed: grep for to_external_id in disk_provider.rs returns only the trait definition and one unit test). The 0..num_points idiom looks generic but is the positional/internal space by construction — there's no abstraction between integer counting and positional storage. Identity translation masks this today; under any future non-identity provider, flat-scan and graph paths would diverge silently.

So all three sites (flat_search, pq_distances, RerankAndFilter::post_process) are currently incorrect — this RFC needs to fix all three.

Suggested edits:

  • Remove the "Sites already correct" bullet (line 269).
  • Add flat_search to "Sites updated in the implementation PR" (line 262):
    • flat_search (the disk-local one in disk_provider.rs) — call to_external_id on each id in (0..num_points) before invoking the predicate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both BetaFilter and MultihopSearch take QueryLabelProvider<DP::InternalId> — their predicates run in the internal-id space. The only to_internal_id / to_external_id translations in index.rs happen at the insert/delete API boundary, not on the search path.
We can keep Predicate keyed on internal-id here too.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking latest code, flat_search has been moved to disk_provider.rs and is using internal_id as input param. The "ID convension" part will be removed and Predicate will be keyed on internal_id.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for external-keyed predicates was broader than the flat_search audit — that was a symptom, not the core argument. External-keyed is still the right contract because:

  • Matches the DataProvider abstraction; doesn't leak the positional internal representation.
  • Matches in-memory BetaFilter (keyed on DP::ExternalId via QueryLabelProvider) — closes one corner of the divergence the Future Work item flags.
  • Future-proof: non-identity translation (sparse IDs, deletion compaction, u64u32) stays disk-internal; callers don't break silently.

Practically: how do external clients construct a predicate over internal IDs? They only see external IDs — from search outputs, ingest, the DataProvider contract. Asking them to maintain a bitmap over a space they can't observe pushes implementation details out as caller responsibility, and silently couples every consumer to today's identity-translation invariant.

The fix is small and inlines to a no-op under identity translation. Worth keeping in the RFC.

Copy link
Copy Markdown
Contributor

@wuw92 wuw92 May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DiskProvider doesn't really expose an Internal/External contract today. Both types are pinned to u32, VectorIdType = u32 is a hard where bound on the impl (disk_provider.rs:95-103), I'd suggest dropping the ID-convention change from this RFC.

  impl<Data> DataProvider for DiskProvider<Data>
  where Data: GraphDataType<VectorIdType = u32>,
  {
      type InternalId = u32;
      type ExternalId = u32;

      fn to_internal_id(&self, _: &DefaultContext, gid: &u32) -> Result<u32, _> { Ok(*gid) }
      fn to_external_id(&self, _: &DefaultContext, id: u32)   -> Result<u32, _> { Ok(id) }
  }

For Practical, external clients implements to_internal_id / to_external_id on their provider, so they define the translation and own both sides. The question isn't whether they can observe internal id — it's which side of the boundary is the natural place to do the translation.

For the 1:N domain model (which is what our team has: external = doc_id, internal = vector_id, one doc-id → many vector-ids), "caller side" is a feasible option.

  • DataProvider's ExternalId ↔ InternalId is 1:1 by trait signature; a 1:N domain doesn't fit through it.
  • So we keep the doc_id ↔ {vector_id} map outside the library, for example in a filtered scenario:
    1. Filtered doc-ids → vector-id bitmap (outside DiskANN).
    2. DiskANN filter-search returns vector-ids.
    3. Vector-ids → doc-ids (outside DiskANN).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two separate things getting mixed up — the outer mapping (domain ↔ vector IDs) and the inner consistency (which ID space DiskANN uses internally).

Outer mapping (1:N): agreed, this lives outside the library. DataProvider::ExternalId ↔ InternalId is 1:1 by signature, so doc-id ↔ {vector-id} has to be maintained caller-side. The flow you describe (filtered doc-ids → vector-id bitmap → DiskANN search → vector-ids → doc-ids) is the only way it can work. No disagreement there.

Inner consistency (the RFC's change): by the time DiskANN receives the predicate, it's already in vector-id space (= external ID). The question is whether DiskANN invokes that predicate with ExternalId or InternalId. Today three sites — RerankAndFilter::post_process, pq_distances, flat_search — pass internal directly. This is a latent bug: it produces wrong results the moment DiskProvider ever uses non-identity translation (e.g., for deletion compaction or sparse IDs added in place to the existing provider — no fork required).

Re: "DiskProvider doesn't expose the distinction today" — true, and that's exactly why the bug is invisible. The trait exists precisely so future in-place changes don't break consumers. Fixing now is essentially free (translation inlines to a no-op under identity). Leaving it means the next person extending DiskProvider has to remember to also audit every predicate site, with no compile-time help.

Want to keep the ID-convention section in this RFC — it pre-existing-bug we surface and fix here. Fix lands in the impl PR.

Comment thread rfcs/01101-disk-beta-filter.md Outdated
Comment thread rfcs/01101-disk-beta-filter.md
Comment thread rfcs/01101-disk-beta-filter.md Outdated
Comment thread rfcs/01101-disk-beta-filter.md Outdated
yaohongdeng and others added 2 commits May 27, 2026 17:35
- Switch the documented ID convention back to internal IDs, matching
  the implementation: all three invocation sites (flat_search,
  pq_distances, RerankAndFilter) pass internal IDs uniformly; no
  to_external_id conversion at the predicate boundary. Identity
  mapping (InternalId == ExternalId == u32) makes this transparent
  to callers either way.
- Remove the §ID convention section, the Non-Goal about newtyping
  ExternalId, and the related note in the Future Work item -- all
  obsolete after the contract flip.
- Trim the "Project at the boundary" bullet in §Key design decisions:
  it dove into FilterMode internals before the reader had context,
  and the rationale already lives in §search_strategy projection.
  Replace with a shorter bullet phrased as "variant matching is
  delegated to search_strategy()" -- avoids the "accessor"
  terminology clash with the repo's existing Accessor trait.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

RFC Request For Comments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants