feat: chunk-aware hybrid search in all text adapters (#854) by pyramation · Pull Request #1173 · constructive-io/constructive

pyramation · 2026-05-15T05:52:56Z

Summary

Implements chunk-aware text search across all 4 search adapters (tsvector, BM25, trgm, pgvector) via the @hasChunks smart tag (#854). This enables hybrid search — applications can query both vector and text indexes on chunks tables simultaneously, then combine results with RRF or other fusion strategies.

New shared module — adapters/chunks.ts:

ChunksInfo interface with all chunk metadata fields: chunksTable, parentFk, parentPk, embeddingField, contentField, searchField, searchIndexes
getChunksInfo(codec) — extracts and validates chunk metadata from @hasChunks smart tag
Handles both object and JSON-string tag formats

Adapter changes:

Adapter	Chunk query pattern	Score combinator	`includeChunks` option
tsvector	`(SELECT MAX(ts_rank(search, tsquery)) FROM chunks WHERE fk = parent.id)`	`GREATEST(parent, chunk)` — higher is better	✅ default true
BM25	`(SELECT MIN(score <@> bm25query) FROM chunks WHERE fk = parent.id)`	`LEAST(parent, chunk)` — lower is better	✅ default true
trgm	`(SELECT MAX(similarity(content, value)) FROM chunks WHERE fk = parent.id)`	`GREATEST(parent, chunk)` — higher is better	✅ default true
pgvector	Already had chunk support — refactored to use shared `ChunksInfo`	`LEAST(parent, chunk)` — lower is better	✅ (existing)

Each adapter checks searchIndexes from @hasChunks to verify the chunks table has the relevant index type before enabling chunk-aware querying.

Integration tests:

finds parent via chunk tsvector match (term only in chunks) — verifies parent is found when search term only exists in chunk content
finds parent via chunk trgm similarity (term only in chunks) — verifies fuzzy matching through chunks
Setup SQL updated with content, search (tsvector), BM25, and trgm indexes on posts_chunks table

Also includes the prerequisite #856 work (search_indexes parameter on ProcessChunks/ProcessFileEmbedding).

Review & Testing Checklist for Human

Medium-high risk — core search infrastructure change affecting all 4 adapters.

Verify the lateral subquery pattern generates correct SQL for each adapter by inspecting the generated queries (enable PostGraphile query logging)
Test with a real provisioned database: create a table with ProcessChunks + search_indexes, insert parent + chunk rows, and query via each adapter
Verify includeChunks: false correctly disables chunk querying for all text adapters (tsvector, BM25, trgm)
Check performance: lateral subqueries on large chunks tables should use the appropriate indexes (GIN for tsvector/trgm, BM25 for pg_textsearch, HNSW for pgvector)
Verify the BM25 chunks index name convention ({chunks_table}_{content_field}_bm25_idx) matches what the DB generator creates

Notes

This PR includes the (WIP) Feat/v5 relations cont #856 commit (search_indexes parameter) since that work hasn't been merged to main yet
The companion constructive-db PR (#1165) expands the @hasChunks smart tag with the search metadata these adapters consume
The actual search orchestration (dual queries, RRF fusion) lives in application code — this PR enables that capability by making both text and vector search paths available on chunks

Link to Devin session: https://app.devin.ai/sessions/2b5a29d83d3f478e8d3d972653b4879c
Requested by: @pyramation

…bedding (#856) Adds search_indexes to ProcessChunks parameter_schema and ProcessFileEmbedding's chunks sub-config with default ['fulltext']. Enables hybrid RAG by opting into fulltext (tsvector), bm25, or trigram search on the chunks content column. Companion to constructive-io/constructive-db#1164

) - Extract shared getChunksInfo/ChunksInfo into adapters/chunks.ts - tsvector adapter: lateral subquery for MAX(ts_rank) across chunks - BM25 adapter: lateral subquery for MIN(bm25_score) across chunks - trgm adapter: lateral subquery for MAX(similarity) across chunks - All adapters respect includeChunks option (default: true when @hasChunks present) - pgvector adapter refactored to use shared ChunksInfo - Integration tests: chunk-aware tsvector and trgm queries - Setup SQL: add content, search (tsvector), BM25, trgm indexes on posts_chunks

devin-ai-integration · 2026-05-15T05:52:59Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

pyramation added 3 commits May 15, 2026 03:35

fix: update unit tests for expanded ChunksInfo type

a38b3ed

devin-ai-integration Bot assigned pyramation May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: chunk-aware hybrid search in all text adapters (#854)#1173

feat: chunk-aware hybrid search in all text adapters (#854)#1173
pyramation wants to merge 3 commits into
mainfrom
feat/hybrid-search-chunks

pyramation commented May 15, 2026

Uh oh!

devin-ai-integration Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pyramation commented May 15, 2026

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented May 15, 2026

🤖 Devin AI Engineer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant