Skip to content

feat: chunk-aware hybrid search in all text adapters (#854)#1173

Open
pyramation wants to merge 3 commits into
mainfrom
feat/hybrid-search-chunks
Open

feat: chunk-aware hybrid search in all text adapters (#854)#1173
pyramation wants to merge 3 commits into
mainfrom
feat/hybrid-search-chunks

Conversation

@pyramation
Copy link
Copy Markdown
Contributor

Summary

Implements chunk-aware text search across all 4 search adapters (tsvector, BM25, trgm, pgvector) via the @hasChunks smart tag (#854). This enables hybrid search — applications can query both vector and text indexes on chunks tables simultaneously, then combine results with RRF or other fusion strategies.

New shared module — adapters/chunks.ts:

  • ChunksInfo interface with all chunk metadata fields: chunksTable, parentFk, parentPk, embeddingField, contentField, searchField, searchIndexes
  • getChunksInfo(codec) — extracts and validates chunk metadata from @hasChunks smart tag
  • Handles both object and JSON-string tag formats

Adapter changes:

Adapter Chunk query pattern Score combinator includeChunks option
tsvector (SELECT MAX(ts_rank(search, tsquery)) FROM chunks WHERE fk = parent.id) GREATEST(parent, chunk) — higher is better ✅ default true
BM25 (SELECT MIN(score <@> bm25query) FROM chunks WHERE fk = parent.id) LEAST(parent, chunk) — lower is better ✅ default true
trgm (SELECT MAX(similarity(content, value)) FROM chunks WHERE fk = parent.id) GREATEST(parent, chunk) — higher is better ✅ default true
pgvector Already had chunk support — refactored to use shared ChunksInfo LEAST(parent, chunk) — lower is better ✅ (existing)

Each adapter checks searchIndexes from @hasChunks to verify the chunks table has the relevant index type before enabling chunk-aware querying.

Integration tests:

  • finds parent via chunk tsvector match (term only in chunks) — verifies parent is found when search term only exists in chunk content
  • finds parent via chunk trgm similarity (term only in chunks) — verifies fuzzy matching through chunks
  • Setup SQL updated with content, search (tsvector), BM25, and trgm indexes on posts_chunks table

Also includes the prerequisite #856 work (search_indexes parameter on ProcessChunks/ProcessFileEmbedding).

Review & Testing Checklist for Human

Medium-high risk — core search infrastructure change affecting all 4 adapters.

  • Verify the lateral subquery pattern generates correct SQL for each adapter by inspecting the generated queries (enable PostGraphile query logging)
  • Test with a real provisioned database: create a table with ProcessChunks + search_indexes, insert parent + chunk rows, and query via each adapter
  • Verify includeChunks: false correctly disables chunk querying for all text adapters (tsvector, BM25, trgm)
  • Check performance: lateral subqueries on large chunks tables should use the appropriate indexes (GIN for tsvector/trgm, BM25 for pg_textsearch, HNSW for pgvector)
  • Verify the BM25 chunks index name convention ({chunks_table}_{content_field}_bm25_idx) matches what the DB generator creates

Notes

  • This PR includes the (WIP) Feat/v5 relations cont #856 commit (search_indexes parameter) since that work hasn't been merged to main yet
  • The companion constructive-db PR (#1165) expands the @hasChunks smart tag with the search metadata these adapters consume
  • The actual search orchestration (dual queries, RRF fusion) lives in application code — this PR enables that capability by making both text and vector search paths available on chunks

Link to Devin session: https://app.devin.ai/sessions/2b5a29d83d3f478e8d3d972653b4879c
Requested by: @pyramation

…bedding (#856)

Adds search_indexes to ProcessChunks parameter_schema and ProcessFileEmbedding's
chunks sub-config with default ['fulltext']. Enables hybrid RAG by opting into
fulltext (tsvector), bm25, or trigram search on the chunks content column.

Companion to constructive-io/constructive-db#1164
)

- Extract shared getChunksInfo/ChunksInfo into adapters/chunks.ts
- tsvector adapter: lateral subquery for MAX(ts_rank) across chunks
- BM25 adapter: lateral subquery for MIN(bm25_score) across chunks
- trgm adapter: lateral subquery for MAX(similarity) across chunks
- All adapters respect includeChunks option (default: true when @hasChunks present)
- pgvector adapter refactored to use shared ChunksInfo
- Integration tests: chunk-aware tsvector and trgm queries
- Setup SQL: add content, search (tsvector), BM25, trgm indexes on posts_chunks
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant