Skip to content

Support Kubernetes-native Service/Endpoint exposure for sandbox agents #1791

@prakashmirji

Description

@prakashmirji

Problem Statement

Problem

The openshell CLI provides options to expose sandbox agent endpoints, but these are designed for interactive/short-lived sessions. For long-running agents deployed as always-on services (e.g., LangChain agents serving OpenAI-compatible APIs), the CLI-based port exposure is unreliable — connections drop, ports are not discoverable via DNS, and there's no integration with Kubernetes Service objects.

Current Behavior

openshell sandbox expose creates a temporary port forward
No Kubernetes Service or Endpoints object is created
Clients cannot discover the agent via standard ..svc.cluster.local DNS
If the gateway pod restarts, the exposure is lost
No integration with Istio VirtualService for external ingress

Proposed Design

Desired Behavior
When a sandbox is created with a declared port (e.g., port: 8000), OpenShell should:

  1. Create a Kubernetes Service and Endpoints (or EndpointSlice) pointing to the sandbox pod's IP and declared port
  2. The Service should be stable across pod restarts (gateway recreates the pod, Service stays)
  3. Optionally support type: ClusterIP for internal access and integration with Ingress/Istio for external access
  4. The Sandbox CRD could accept a service stanza:
apiVersion: agents.x-k8s.io/v1alpha1
kind: Sandbox
metadata:
  name: my-agent
spec:
  podTemplate:
    spec:
      containers:
      - name: agent
        ports:
        - containerPort: 8000
  service:
    enabled: true
    port: 8000
    type: ClusterIP

Alternatives Considered

Use Case
Enterprise deployment of AI agents as long-running microservices that need to be accessible by other services in the cluster, load balancers, and API gateways — not as interactive CLI sessions.

Workaround
Our controller patches labels onto the sandbox pod (agentplatform.hpe.com/agent: ) and creates a selector-based ClusterIP Service in the same namespace. This works for in-namespace routing, but the Service is not managed by OpenShell itself — our external controller must discover the sandbox pod, patch labels, and reconcile the Service independently. If the gateway recreates the pod (e.g., after eviction), the new pod has no labels until our controller re-reconciles, causing a brief service blackout.

Agent Investigation

No response

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions