feat(inference): allow local embeddings route by shiju-nv · Pull Request #1774 · NVIDIA/OpenShell

shiju-nv · 2026-06-05T11:37:41Z

Summary

Adds openai_embeddings as a first-class OpenAI-compatible protocol so sandboxed workloads can reach POST /v1/embeddings through the local inference proxy, the same way chat, completion, responses, and model discovery already route. The response is served buffered with an accurate Content-Length rather than through the SSE streaming path, which would corrupt a single-JSON-object body if the stream were truncated.

Related Issue

Closes #1771.

Changes

Add openai_embeddings to the OpenAI-compatible protocol set so providers (openai, nvidia) advertise embeddings routing.
Classify POST /v1/embeddings as the openai_embeddings protocol in the sandbox L7 inference patterns.
Serve embeddings through a buffered response path with an accurate Content-Length. An embeddings response is one JSON object; the streaming path appends an SSE error frame on a size-cap or idle-timeout truncation, which would corrupt a body the client parses whole.
Add a validation probe for an embeddings backend against /v1/embeddings, ordered after the chat and completion probes so a multi-protocol route still prefers those.
Extract shared helpers: http_status_text() (adding 401/422/429/503 for embeddings passthrough and router error mapping) and write_inference_router_error() across the streaming and buffered paths.
Return an OpenAI-shaped embeddings body from the mock route.

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

--

Follow-up PRs (stacked on this branch)

This is the first commit in a stacked series; each later PR builds on the one before it, so they are easiest to review and merge in order:

Cap the buffered response body. proxy_to_backend reads the upstream body with no size limit, unlike the streaming path; a size cap closes a DoS/OOM path.
Validate embeddings-only models by trying every advertised protocol probe, so a managed route whose profile also lists chat does not fail verification.
Serve model discovery (GET /v1/models) buffered as well, since it is also a single JSON object exposed to the same truncation corruption (Model discovery JSON can be corrupted by the inference proxy's SSE streaming path #1772).

Route OpenAI-compatible embeddings through the local inference proxy so sandboxed vector workloads reach a configured provider via the same route classification and auth path that chat, completion, and model discovery already use. - Add openai_embeddings to the OpenAI-compatible protocol set so providers (openai, nvidia) advertise embeddings routing. - Classify POST /v1/embeddings as the openai_embeddings protocol in the sandbox L7 patterns. - Serve embeddings buffered with an accurate Content-Length, since the response is a single JSON object rather than an SSE token stream. The streaming path appends an SSE error frame on a size-cap or idle-timeout truncation, which would corrupt a one-object body the client parses whole. protocol_returns_buffered_body() selects the path. - Probe an embeddings-only backend against /v1/embeddings during validation, after the chat and completion protocols so a multi-protocol route still prefers those. - Extract two shared helpers. http_status_text() backs both response formatters and adds 401/422/429/503 for embeddings passthrough and router error mapping; write_inference_router_error() backs the streaming and buffered routing paths. - Return an OpenAI-shaped embeddings body from the mock route. Tests cover profile lookup, L7 pattern detection, the mock body, and buffered Content-Length framing with no chunked transfer-encoding and no SSE error frame. Signed-off-by: Shiju <shiju@nvidia.com>

The buffered proxy path read the whole upstream response into memory with no size bound. The route timeout bounds elapsed time but not memory, so a misbehaving or oversized upstream could force unbounded allocation in the sandbox proxy. The streaming path already caps each response at 32 MiB; the buffered path did not. Cap the buffered read at the same 32 MiB. An advertised over-cap body is rejected from its Content-Length before any bytes are read, and chunks accumulate under the same bound so a chunked or mislabeled body cannot slip past. An over-cap response fails as an upstream protocol error, surfaced as HTTP 502 at the proxy boundary, and is never partially returned. Tests - cargo test -p openshell-router \ proxy_to_backend_rejects_over_cap_response_body Signed-off-by: Shiju <shiju@nvidia.com>

…tocols A managed route resolves to its provider profile's full protocol set, so an embeddings model such as text-embedding-3-small lists openai_chat_completions alongside openai_embeddings. Route verification probed only the first writable protocol and stopped on its failure. It sent a chat probe with the embedding model, the provider rejected it as wrong-shape, and the route failed validation before the embeddings probe ran. Embeddings-only configs could not be verified. Try the advertised protocols in preference order. A request-shape rejection (HTTP 400, 404, 405, 422) falls through to the next protocol, so an embeddings model validates against /v1/embeddings even when the chat probe rejects it. Credential, rate-limit, connectivity, and upstream-health failures stay terminal and stop validation at the first probe, so a bad key or a down backend is reported as itself rather than masked by a later probe. validation_probe becomes validation_probes, which returns the ordered list, and the per-probe fallback retry (max_completion_tokens versus max_tokens) moves into a shared helper. Tests - cargo test -p openshell-router \ verify_embeddings_model_falls_through_chat_probe - cargo test -p openshell-router verify_stops_on_credentials_failure Signed-off-by: Shiju <shiju@nvidia.com>

GET /v1/models returns a single JSON model list, the same response shape as embeddings. The sandbox inference proxy was routing it through the SSE streaming path. A streaming size-cap or idle-timeout truncation appends an SSE error frame to the body, which corrupts a payload the client parses as one JSON object. Make response framing a property of the protocol. A new ResponseFraming field on InferenceApiPattern is set once per pattern in default_patterns. model_discovery and openai_embeddings are now Buffered, while chat completions, completions, responses, and Anthropic messages stay Streaming. The proxy dispatch gates on pattern.is_buffered(), which replaces the stringly-typed protocol_returns_buffered_body predicate so the streaming-versus-buffered decision lives in one place and cannot drift across the sites that read it. Model discovery now flows through the same buffered path as embeddings, framed with an accurate Content-Length and bounded by the buffered-read size cap that path already enforces. Tests - cargo test -p openshell-sandbox protocol_framing_classification - cargo test -p openshell-sandbox \ inference_model_discovery_served_buffered_with_content_length Signed-off-by: Shiju <shiju@nvidia.com>

johntmyers · 2026-06-06T00:15:25Z

gator-agent

Gator validation: this PR is project-valid and linked issue #1771 is valid for the same work. The change is concentrated in the inference routing/proxy path and addresses a clear sandbox UX gap for /v1/embeddings.

Independent code review found blocking feedback, so this remains in gator:in-review:

Critical: crates/openshell-router/src/backend.rs buffers the full upstream response in proxy_to_backend() without a size cap, and the new embeddings path calls that buffered route from crates/openshell-sandbox/src/proxy.rs. This is a CWE-400 resource-consumption risk and regresses from the streaming path's capped behavior. Please cap buffered responses before returning them, including checking Content-Length when present and enforcing the same limit while reading chunks.
Warning: validation still picks only one probe. Managed OpenAI/NVIDIA routes now advertise chat before embeddings, so an embeddings-only model such as text-embedding-3-small can fail default verification on /v1/chat/completions even though /v1/embeddings would work. Please try compatible probes in order and continue after request-shape failures, while stopping on credential/connectivity/rate-limit failures. Add coverage for a multi-protocol route where chat fails with a request-shape status and embeddings succeeds.
Warning: this is a direct user-facing API pattern change. Please update docs/sandboxes/inference-routing.mdx to list POST /v1/embeddings in the OpenAI-compatible supported patterns and adjust the nearby “generation” wording where it now also covers embeddings. docs/index.yml does not need navigation changes because the existing page is already linked.

Current CI is green for branch checks, DCO, Markdown, Rust, Python, Helm Lint, and published gate statuses. E2E is currently skipped because no test:e2e label is applied; because this touches inference proxy/network routing behavior, gator will require test:e2e after the review feedback is addressed.

No local tests were run by gator.

johntmyers · 2026-06-06T01:08:07Z

gator-agent

PR Review Status

Validation: this PR is project-valid and linked issue #1771 is valid for the same work. The change is concentrated in the inference routing/proxy path and addresses a clear sandbox UX gap for POST /v1/embeddings.
Head SHA: 5f66c3cf83316e107c3f9b9c066d1d739747eeb1

Review findings:

crates/openshell-router/src/backend.rs: embedding-only OpenAI/NVIDIA models can still fail verified setup because the validation probe chooses chat/completions/responses before embeddings and does not try the embeddings probe after request-shape failures. Please make validation try all advertised writable probes in preference order, continuing on ValidationFailureKind::RequestShape, and add coverage where chat fails with a request-shape status while embeddings succeeds.
crates/openshell-router/src/backend.rs / crates/openshell-sandbox/src/proxy.rs: the new buffered embeddings path exposes proxy_to_backend()'s unbounded response.bytes().await read to sandbox workloads. This is a CWE-400 uncontrolled resource consumption risk. Please cap buffered inference response bodies before or with this PR, rejecting responses whose Content-Length or accumulated body exceeds the limit.

Docs: missing for a direct user-facing capability change. Please update the existing docs/sandboxes/inference-routing.mdx supported-patterns table to include POST /v1/embeddings; no navigation change appears necessary.

Next state: gator:in-review

Signed-off-by: Shiju <shiju@nvidia.com>

johntmyers · 2026-06-07T16:01:48Z

gator-agent

Re-check After Author Update

I re-evaluated latest head 33d0924950e144be1008edd7e7b41ab72d79d43c after the author pushed commits addressing the prior gator feedback.

Disposition: resolved for gator review.

Remaining items:

No blocking review findings remain. The buffered response path now enforces a response-size cap, validation now falls through request-shape probe failures to embeddings-compatible probes, and docs/sandboxes/inference-routing.mdx now lists POST /v1/embeddings in the supported OpenAI-compatible patterns.
Independent review noted follow-up polish for a dedicated embeddings E2E scenario and two remaining docs references to “generation requests”; these are not blocking this gator transition. Because this PR touches inference proxy/network routing behavior, I applied test:e2e and will watch the E2E gate.

Next state: gator:watch-pipeline

github-actions · 2026-06-07T16:01:54Z

Label test:e2e applied for 33d0924. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

johntmyers · 2026-06-07T16:38:23Z

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete.

Validation: PR #1774 is project-valid and linked issue #1771 is already gator:validated; the change is concentrated in inference routing/proxy support for POST /v1/embeddings.
Review: No blocking gator review findings remain at head 33d0924950e144be1008edd7e7b41ab72d79d43c. Prior findings on bounded buffered responses, embeddings validation fallback, and docs coverage were resolved.
Docs: docs/sandboxes/inference-routing.mdx was updated for the new supported embeddings pattern; no navigation update was needed.
Checks: Required gates are green for Branch Checks, Helm Lint, and E2E.
E2E: test:e2e is applied and the Core E2E result passed. GPU E2E is not required for this PR.

Human maintainer approval or merge decision is now required.

shiju-nv requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners June 5, 2026 11:37

shiju-nv added 4 commits June 5, 2026 18:16

shiju-nv force-pushed the fix/inference-embeddings-route branch from 6f2d8ab to 5f66c3c Compare June 5, 2026 14:34

johntmyers added the gator:in-review Gator is reviewing or awaiting PR review feedback label Jun 6, 2026

docs(inference): document embeddings route in supported patterns

33d0924

Signed-off-by: Shiju <shiju@nvidia.com>

johntmyers added gator:watch-pipeline Gator is monitoring PR CI/CD status test:e2e Requires end-to-end coverage and removed gator:in-review Gator is reviewing or awaiting PR review feedback labels Jun 7, 2026

johntmyers added gator:approval-needed Gator completed review; maintainer approval needed and removed gator:watch-pipeline Gator is monitoring PR CI/CD status labels Jun 7, 2026

johntmyers approved these changes Jun 7, 2026

View reviewed changes

johntmyers merged commit 25abc9e into NVIDIA:main Jun 7, 2026
56 of 58 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference): allow local embeddings route#1774

feat(inference): allow local embeddings route#1774
johntmyers merged 5 commits into
NVIDIA:mainfrom
shiju-nv:fix/inference-embeddings-route

shiju-nv commented Jun 5, 2026

Uh oh!

johntmyers commented Jun 6, 2026

Uh oh!

johntmyers commented Jun 6, 2026

Uh oh!

johntmyers commented Jun 7, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

johntmyers commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shiju-nv commented Jun 5, 2026

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

johntmyers commented Jun 6, 2026

Uh oh!

johntmyers commented Jun 6, 2026

PR Review Status

Uh oh!

johntmyers commented Jun 7, 2026

Re-check After Author Update

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

johntmyers commented Jun 7, 2026

Maintainer Approval Needed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants