Skip to content

Antifragility pass: Retry decision, session drop accounting, requireSandbox, flake forensics (#4495-#4498)#199

Open
Skobeltsyn wants to merge 5 commits into
mainfrom
feat/antifragility-pass
Open

Antifragility pass: Retry decision, session drop accounting, requireSandbox, flake forensics (#4495-#4498)#199
Skobeltsyn wants to merge 5 commits into
mainfrom
feat/antifragility-pass

Conversation

@Skobeltsyn

Copy link
Copy Markdown
Contributor

Antifragility pass — four resilience hardenings

One theme: failures should be survivable, observable, or refusable — never silent.

Ticket Change
#4495 LlmErrorDecision.Retry(maxAttempts, initialBackoffMillis) — third onLLMError decision: exponential-backoff retries of a failed model call (500ms → 1s → 2s …). Handler consulted per failed attempt (can switch to RespondWith/Rethrow mid-schedule); attempt budget per model turn; exhaustion rethrows the ORIGINAL error, identity preserved. recoverAgenticLlmError inlined into the per-turn retry loop.
#4496 AgentSession.droppedEvents — the trySend losses on the non-suspending emitter path are now counted (live, public) with ONE summary log at session close, replacing per-event WARNING spam. Both producers covered (agent.session + composition agentSessionScope). Terminal Completed/Failed still always deliver. Internals adjunct corrected (claimed the producer suspends when full — it drops).
#4497 ProcessSandbox.run(requireSandbox = true) — fail-closed strict mode on the low-level API: no OS sandbox backend → IllegalStateException before the subprocess starts, instead of the UNCONFINED ProcessBuilder fallback (default unchanged; processTool stays the high-level fail-closed path).
#4498 Flake forensics for #4370 — the mac live network probe now embeds exit code, probe stdout/stderr, python3 path, and the exact generated Seatbelt profile in its assertion message, so an unreproducible runner failure is diagnosable from the CI log alone.

Default behavior unchanged everywhere — Retry and requireSandbox are opt-in, drop accounting is additive, #4498 is test-only.

Gates: full ./gradlew test (all modules, TEST-*.xml scanned clean) + ./gradlew build -x test (detekt) — both green. 9 new tests. Docs: CHANGELOG, streaming.md, tool-policy-enforcement.md, production-hardening.md, applications.md, error-recovery adjuncts.

Note: CodeQL java-kotlin is expected-red on Kotlin 2.4 (upstream codeql#21938, Redmine #4383) — build is the real gate.

🤖 Generated with Claude Code

Skobeltsyn and others added 5 commits June 13, 2026 01:42
…retries

Third decision next to Rethrow/RespondWith: re-run the failed model call
with exponential backoff (initialBackoffMillis * 2^(n-1), overflow-safe).
Handler consulted per failed attempt (can switch decisions mid-schedule);
attempt budget is per model turn; exhaustion rethrows the ORIGINAL error,
identity preserved. recoverAgenticLlmError inlined into the per-turn retry
loop at the chatOrStream catch site. Default (no handler) unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ounting

The non-suspending emitter path (trySend into the 64-slot buffer) now
counts losses on a SessionDropCounter instead of logging one WARNING per
dropped event: one summary line at session close (count + first dropped
type), and the live count is public as AgentSession.droppedEvents so
callers can assert on event loss instead of scraping logs. Both producers
covered (agent.session and the shared composition agentSessionScope).
Terminal Completed/Failed still always deliver via suspending send.
Internals adjunct corrected (it claimed the producer suspends when full).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…strict mode

When no OS sandbox backend exists (no Seatbelt/bwrap/firejail), the flag
throws IllegalStateException before the subprocess starts instead of the
UNCONFINED plain-ProcessBuilder fallback. Default false preserves the
historical behavior; processTool remains the fail-closed high-level path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…robe

The ProcessSandboxMacTest live network test (flake #4370) now embeds exit
code, probe stdout/stderr, the python3 path, and the exact generated
Seatbelt profile in its assertion message — an unreproducible runner
failure becomes diagnosable from the CI log alone.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The doc covered only tool errors; onLLMError (incl. the new Retry
decision) now has its own section with the per-turn/per-attempt semantics
and the v1 routing-scope caveat.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant