Problem Statement
Workloads running inside the sandbox can't generate embeddings; the numeric representations of text behind semantic search, retrieval-augmented generation (RAG), and similarity matching. Chat, completion, and model-listing calls already work, but embedding requests are rejected before they ever reach a provider. As a result, any feature that "searches by meaning" rather than exact keywords silently fails for anything running in the sandbox.
Proposed Design
Treat embeddings as a first-class request type in the local inference proxy, the same way chat and completions already are: recognize an embeddings request, route it to the configured AI provider with the right credentials, and return the result. Because an embeddings result is one complete response (not a streamed feed of tokens), serve it in a single piece so it can't be corrupted by being cut short mid-response.
Alternatives Considered
- Reuse the existing streaming path for embeddings: rejected, that path is built for incremental token streams and can corrupt a single all-at-once response if it's truncated.
- Leave embeddings unsupported in the sandbox and require an external endpoint: rejected, defeats the purpose of sandboxed AI workloads and splits configuration across two places.
Agent Investigation
No response
Checklist
Problem Statement
Workloads running inside the sandbox can't generate embeddings; the numeric representations of text behind semantic search, retrieval-augmented generation (RAG), and similarity matching. Chat, completion, and model-listing calls already work, but embedding requests are rejected before they ever reach a provider. As a result, any feature that "searches by meaning" rather than exact keywords silently fails for anything running in the sandbox.
Proposed Design
Treat embeddings as a first-class request type in the local inference proxy, the same way chat and completions already are: recognize an embeddings request, route it to the configured AI provider with the right credentials, and return the result. Because an embeddings result is one complete response (not a streamed feed of tokens), serve it in a single piece so it can't be corrupted by being cut short mid-response.
Alternatives Considered
Agent Investigation
No response
Checklist