Antifragility pass: Retry decision, session drop accounting, requireSandbox, flake forensics (#4495-#4498)#199
Open
Skobeltsyn wants to merge 5 commits into
Open
Antifragility pass: Retry decision, session drop accounting, requireSandbox, flake forensics (#4495-#4498)#199Skobeltsyn wants to merge 5 commits into
Skobeltsyn wants to merge 5 commits into
Conversation
…retries Third decision next to Rethrow/RespondWith: re-run the failed model call with exponential backoff (initialBackoffMillis * 2^(n-1), overflow-safe). Handler consulted per failed attempt (can switch decisions mid-schedule); attempt budget is per model turn; exhaustion rethrows the ORIGINAL error, identity preserved. recoverAgenticLlmError inlined into the per-turn retry loop at the chatOrStream catch site. Default (no handler) unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ounting The non-suspending emitter path (trySend into the 64-slot buffer) now counts losses on a SessionDropCounter instead of logging one WARNING per dropped event: one summary line at session close (count + first dropped type), and the live count is public as AgentSession.droppedEvents so callers can assert on event loss instead of scraping logs. Both producers covered (agent.session and the shared composition agentSessionScope). Terminal Completed/Failed still always deliver via suspending send. Internals adjunct corrected (it claimed the producer suspends when full). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…strict mode When no OS sandbox backend exists (no Seatbelt/bwrap/firejail), the flag throws IllegalStateException before the subprocess starts instead of the UNCONFINED plain-ProcessBuilder fallback. Default false preserves the historical behavior; processTool remains the fail-closed high-level path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…robe The ProcessSandboxMacTest live network test (flake #4370) now embeds exit code, probe stdout/stderr, the python3 path, and the exact generated Seatbelt profile in its assertion message — an unreproducible runner failure becomes diagnosable from the CI log alone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The doc covered only tool errors; onLLMError (incl. the new Retry decision) now has its own section with the per-turn/per-attempt semantics and the v1 routing-scope caveat. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Antifragility pass — four resilience hardenings
One theme: failures should be survivable, observable, or refusable — never silent.
LlmErrorDecision.Retry(maxAttempts, initialBackoffMillis)— thirdonLLMErrordecision: exponential-backoff retries of a failed model call (500ms → 1s → 2s …). Handler consulted per failed attempt (can switch toRespondWith/Rethrowmid-schedule); attempt budget per model turn; exhaustion rethrows the ORIGINAL error, identity preserved.recoverAgenticLlmErrorinlined into the per-turn retry loop.AgentSession.droppedEvents— thetrySendlosses on the non-suspending emitter path are now counted (live, public) with ONE summary log at session close, replacing per-event WARNING spam. Both producers covered (agent.session+ compositionagentSessionScope). TerminalCompleted/Failedstill always deliver. Internals adjunct corrected (claimed the producer suspends when full — it drops).ProcessSandbox.run(requireSandbox = true)— fail-closed strict mode on the low-level API: no OS sandbox backend →IllegalStateExceptionbefore the subprocess starts, instead of the UNCONFINEDProcessBuilderfallback (default unchanged;processToolstays the high-level fail-closed path).Default behavior unchanged everywhere —
RetryandrequireSandboxare opt-in, drop accounting is additive, #4498 is test-only.Gates: full
./gradlew test(all modules, TEST-*.xml scanned clean) +./gradlew build -x test(detekt) — both green. 9 new tests. Docs: CHANGELOG, streaming.md, tool-policy-enforcement.md, production-hardening.md, applications.md, error-recovery adjuncts.Note: CodeQL java-kotlin is expected-red on Kotlin 2.4 (upstream codeql#21938, Redmine #4383) — build is the real gate.
🤖 Generated with Claude Code