Skip to content

feat(inference): allow local embeddings route#1774

Merged
johntmyers merged 5 commits into
NVIDIA:mainfrom
shiju-nv:fix/inference-embeddings-route
Jun 7, 2026
Merged

feat(inference): allow local embeddings route#1774
johntmyers merged 5 commits into
NVIDIA:mainfrom
shiju-nv:fix/inference-embeddings-route

Conversation

@shiju-nv
Copy link
Copy Markdown
Contributor

@shiju-nv shiju-nv commented Jun 5, 2026

Summary

Adds openai_embeddings as a first-class OpenAI-compatible protocol so sandboxed workloads can reach POST /v1/embeddings through the local inference proxy, the same way chat, completion, responses, and model discovery already route. The response is served buffered with an accurate Content-Length rather than through the SSE streaming path, which would corrupt a single-JSON-object body if the stream were truncated.

Related Issue

Closes #1771.

Changes

  • Add openai_embeddings to the OpenAI-compatible protocol set so providers (openai, nvidia) advertise embeddings routing.
  • Classify POST /v1/embeddings as the openai_embeddings protocol in the sandbox L7 inference patterns.
  • Serve embeddings through a buffered response path with an accurate Content-Length. An embeddings response is one JSON object; the streaming path appends an SSE error frame on a size-cap or idle-timeout truncation, which would corrupt a body the client parses whole.
  • Add a validation probe for an embeddings backend against /v1/embeddings, ordered after the chat and completion probes so a multi-protocol route still prefers those.
  • Extract shared helpers: http_status_text() (adding 401/422/429/503 for embeddings passthrough and router error mapping) and write_inference_router_error() across the streaming and buffered paths.
  • Return an OpenAI-shaped embeddings body from the mock route.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

--

Follow-up PRs (stacked on this branch)

This is the first commit in a stacked series; each later PR builds on the one before it, so they are easiest to review and merge in order:

  1. Cap the buffered response body. proxy_to_backend reads the upstream body with no size limit, unlike the streaming path; a size cap closes a DoS/OOM path.
  2. Validate embeddings-only models by trying every advertised protocol probe, so a managed route whose profile also lists chat does not fail verification.
  3. Serve model discovery (GET /v1/models) buffered as well, since it is also a single JSON object exposed to the same truncation corruption (Model discovery JSON can be corrupted by the inference proxy's SSE streaming path #1772).

shiju-nv added 4 commits June 5, 2026 18:16
Route OpenAI-compatible embeddings through the local inference proxy so
sandboxed vector workloads reach a configured provider via the same
route classification and auth path that chat, completion, and model
discovery already use.

- Add openai_embeddings to the OpenAI-compatible protocol set so
  providers (openai, nvidia) advertise embeddings routing.
- Classify POST /v1/embeddings as the openai_embeddings protocol in the
  sandbox L7 patterns.
- Serve embeddings buffered with an accurate Content-Length, since the
  response is a single JSON object rather than an SSE token stream. The
  streaming path appends an SSE error frame on a size-cap or idle-timeout
  truncation, which would corrupt a one-object body the client parses
  whole. protocol_returns_buffered_body() selects the path.
- Probe an embeddings-only backend against /v1/embeddings during
  validation, after the chat and completion protocols so a multi-protocol
  route still prefers those.
- Extract two shared helpers. http_status_text() backs both response
  formatters and adds 401/422/429/503 for embeddings passthrough and
  router error mapping; write_inference_router_error() backs the streaming
  and buffered routing paths.
- Return an OpenAI-shaped embeddings body from the mock route.

Tests cover profile lookup, L7 pattern detection, the mock body, and
buffered Content-Length framing with no chunked transfer-encoding and no
SSE error frame.

Signed-off-by: Shiju <shiju@nvidia.com>
The buffered proxy path read the whole upstream response into memory with
no size bound. The route timeout bounds elapsed time but not memory, so a
misbehaving or oversized upstream could force unbounded allocation in the
sandbox proxy. The streaming path already caps each response at 32 MiB;
the buffered path did not.

Cap the buffered read at the same 32 MiB. An advertised over-cap body is
rejected from its Content-Length before any bytes are read, and chunks
accumulate under the same bound so a chunked or mislabeled body cannot
slip past. An over-cap response fails as an upstream protocol error,
surfaced as HTTP 502 at the proxy boundary, and is never partially
returned.

Tests

- cargo test -p openshell-router \
    proxy_to_backend_rejects_over_cap_response_body

Signed-off-by: Shiju <shiju@nvidia.com>
…tocols

A managed route resolves to its provider profile's full protocol set, so
an embeddings model such as text-embedding-3-small lists
openai_chat_completions alongside openai_embeddings. Route verification
probed only the first writable protocol and stopped on its failure. It
sent a chat probe with the embedding model, the provider rejected it as
wrong-shape, and the route failed validation before the embeddings probe
ran. Embeddings-only configs could not be verified.

Try the advertised protocols in preference order. A request-shape
rejection (HTTP 400, 404, 405, 422) falls through to the next protocol,
so an embeddings model validates against /v1/embeddings even when the
chat probe rejects it. Credential, rate-limit, connectivity, and
upstream-health failures stay terminal and stop validation at the first
probe, so a bad key or a down backend is reported as itself rather than
masked by a later probe.

validation_probe becomes validation_probes, which returns the ordered
list, and the per-probe fallback retry (max_completion_tokens versus
max_tokens) moves into a shared helper.

Tests

- cargo test -p openshell-router \
    verify_embeddings_model_falls_through_chat_probe
- cargo test -p openshell-router verify_stops_on_credentials_failure

Signed-off-by: Shiju <shiju@nvidia.com>
GET /v1/models returns a single JSON model list, the same response shape
as embeddings. The sandbox inference proxy was routing it through the SSE
streaming path. A streaming size-cap or idle-timeout truncation appends
an SSE error frame to the body, which corrupts a payload the client
parses as one JSON object.

Make response framing a property of the protocol. A new ResponseFraming
field on InferenceApiPattern is set once per pattern in default_patterns.
model_discovery and openai_embeddings are now Buffered, while chat
completions, completions, responses, and Anthropic messages stay
Streaming. The proxy dispatch gates on pattern.is_buffered(), which
replaces the stringly-typed protocol_returns_buffered_body predicate so
the streaming-versus-buffered decision lives in one place and cannot
drift across the sites that read it.

Model discovery now flows through the same buffered path as embeddings,
framed with an accurate Content-Length and bounded by the buffered-read
size cap that path already enforces.

Tests

- cargo test -p openshell-sandbox protocol_framing_classification
- cargo test -p openshell-sandbox \
    inference_model_discovery_served_buffered_with_content_length

Signed-off-by: Shiju <shiju@nvidia.com>
@shiju-nv shiju-nv force-pushed the fix/inference-embeddings-route branch from 6f2d8ab to 5f66c3c Compare June 5, 2026 14:34
@johntmyers
Copy link
Copy Markdown
Collaborator

gator-agent

Gator validation: this PR is project-valid and linked issue #1771 is valid for the same work. The change is concentrated in the inference routing/proxy path and addresses a clear sandbox UX gap for /v1/embeddings.

Independent code review found blocking feedback, so this remains in gator:in-review:

  1. Critical: crates/openshell-router/src/backend.rs buffers the full upstream response in proxy_to_backend() without a size cap, and the new embeddings path calls that buffered route from crates/openshell-sandbox/src/proxy.rs. This is a CWE-400 resource-consumption risk and regresses from the streaming path's capped behavior. Please cap buffered responses before returning them, including checking Content-Length when present and enforcing the same limit while reading chunks.
  2. Warning: validation still picks only one probe. Managed OpenAI/NVIDIA routes now advertise chat before embeddings, so an embeddings-only model such as text-embedding-3-small can fail default verification on /v1/chat/completions even though /v1/embeddings would work. Please try compatible probes in order and continue after request-shape failures, while stopping on credential/connectivity/rate-limit failures. Add coverage for a multi-protocol route where chat fails with a request-shape status and embeddings succeeds.
  3. Warning: this is a direct user-facing API pattern change. Please update docs/sandboxes/inference-routing.mdx to list POST /v1/embeddings in the OpenAI-compatible supported patterns and adjust the nearby “generation” wording where it now also covers embeddings. docs/index.yml does not need navigation changes because the existing page is already linked.

Current CI is green for branch checks, DCO, Markdown, Rust, Python, Helm Lint, and published gate statuses. E2E is currently skipped because no test:e2e label is applied; because this touches inference proxy/network routing behavior, gator will require test:e2e after the review feedback is addressed.

No local tests were run by gator.

@johntmyers johntmyers added the gator:in-review Gator is reviewing or awaiting PR review feedback label Jun 6, 2026
@johntmyers
Copy link
Copy Markdown
Collaborator

gator-agent

PR Review Status

Validation: this PR is project-valid and linked issue #1771 is valid for the same work. The change is concentrated in the inference routing/proxy path and addresses a clear sandbox UX gap for POST /v1/embeddings.
Head SHA: 5f66c3cf83316e107c3f9b9c066d1d739747eeb1

Review findings:

  • crates/openshell-router/src/backend.rs: embedding-only OpenAI/NVIDIA models can still fail verified setup because the validation probe chooses chat/completions/responses before embeddings and does not try the embeddings probe after request-shape failures. Please make validation try all advertised writable probes in preference order, continuing on ValidationFailureKind::RequestShape, and add coverage where chat fails with a request-shape status while embeddings succeeds.
  • crates/openshell-router/src/backend.rs / crates/openshell-sandbox/src/proxy.rs: the new buffered embeddings path exposes proxy_to_backend()'s unbounded response.bytes().await read to sandbox workloads. This is a CWE-400 uncontrolled resource consumption risk. Please cap buffered inference response bodies before or with this PR, rejecting responses whose Content-Length or accumulated body exceeds the limit.

Docs: missing for a direct user-facing capability change. Please update the existing docs/sandboxes/inference-routing.mdx supported-patterns table to include POST /v1/embeddings; no navigation change appears necessary.

Next state: gator:in-review

@johntmyers johntmyers added gator:watch-pipeline Gator is monitoring PR CI/CD status test:e2e Requires end-to-end coverage and removed gator:in-review Gator is reviewing or awaiting PR review feedback labels Jun 7, 2026
@johntmyers
Copy link
Copy Markdown
Collaborator

gator-agent

Re-check After Author Update

I re-evaluated latest head 33d0924950e144be1008edd7e7b41ab72d79d43c after the author pushed commits addressing the prior gator feedback.

Disposition: resolved for gator review.

Remaining items:

  • No blocking review findings remain. The buffered response path now enforces a response-size cap, validation now falls through request-shape probe failures to embeddings-compatible probes, and docs/sandboxes/inference-routing.mdx now lists POST /v1/embeddings in the supported OpenAI-compatible patterns.
  • Independent review noted follow-up polish for a dedicated embeddings E2E scenario and two remaining docs references to “generation requests”; these are not blocking this gator transition. Because this PR touches inference proxy/network routing behavior, I applied test:e2e and will watch the E2E gate.

Next state: gator:watch-pipeline

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 7, 2026

Label test:e2e applied for 33d0924. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@johntmyers johntmyers added gator:approval-needed Gator completed review; maintainer approval needed and removed gator:watch-pipeline Gator is monitoring PR CI/CD status labels Jun 7, 2026
@johntmyers
Copy link
Copy Markdown
Collaborator

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete.

Validation: PR #1774 is project-valid and linked issue #1771 is already gator:validated; the change is concentrated in inference routing/proxy support for POST /v1/embeddings.
Review: No blocking gator review findings remain at head 33d0924950e144be1008edd7e7b41ab72d79d43c. Prior findings on bounded buffered responses, embeddings validation fallback, and docs coverage were resolved.
Docs: docs/sandboxes/inference-routing.mdx was updated for the new supported embeddings pattern; no navigation update was needed.
Checks: Required gates are green for Branch Checks, Helm Lint, and E2E.
E2E: test:e2e is applied and the Core E2E result passed. GPU E2E is not required for this PR.

Human maintainer approval or merge decision is now required.

@johntmyers johntmyers merged commit 25abc9e into NVIDIA:main Jun 7, 2026
56 of 58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gator:approval-needed Gator completed review; maintainer approval needed test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable embeddings for sandboxed AI workloads

2 participants