Antifragility pass: Retry decision, session drop accounting, requireSandbox, flake forensics (#4495-#4498) by Skobeltsyn · Pull Request #199 · Deep-CodeAI/Agents.KT

Skobeltsyn · 2026-06-12T22:54:23Z

Antifragility pass — four resilience hardenings

One theme: failures should be survivable, observable, or refusable — never silent.

Ticket	Change
#4495	`LlmErrorDecision.Retry(maxAttempts, initialBackoffMillis)` — third `onLLMError` decision: exponential-backoff retries of a failed model call (500ms → 1s → 2s …). Handler consulted per failed attempt (can switch to `RespondWith`/`Rethrow` mid-schedule); attempt budget per model turn; exhaustion rethrows the ORIGINAL error, identity preserved. `recoverAgenticLlmError` inlined into the per-turn retry loop.
#4496	`AgentSession.droppedEvents` — the `trySend` losses on the non-suspending emitter path are now counted (live, public) with ONE summary log at session close, replacing per-event WARNING spam. Both producers covered (`agent.session` + composition `agentSessionScope`). Terminal `Completed`/`Failed` still always deliver. Internals adjunct corrected (claimed the producer suspends when full — it drops).
#4497	`ProcessSandbox.run(requireSandbox = true)` — fail-closed strict mode on the low-level API: no OS sandbox backend → `IllegalStateException` before the subprocess starts, instead of the UNCONFINED `ProcessBuilder` fallback (default unchanged; `processTool` stays the high-level fail-closed path).
#4498	Flake forensics for #4370 — the mac live network probe now embeds exit code, probe stdout/stderr, python3 path, and the exact generated Seatbelt profile in its assertion message, so an unreproducible runner failure is diagnosable from the CI log alone.

Default behavior unchanged everywhere — Retry and requireSandbox are opt-in, drop accounting is additive, #4498 is test-only.

Gates: full ./gradlew test (all modules, TEST-*.xml scanned clean) + ./gradlew build -x test (detekt) — both green. 9 new tests. Docs: CHANGELOG, streaming.md, tool-policy-enforcement.md, production-hardening.md, applications.md, error-recovery adjuncts.

Note: CodeQL java-kotlin is expected-red on Kotlin 2.4 (upstream codeql#21938, Redmine #4383) — build is the real gate.

🤖 Generated with Claude Code

…retries Third decision next to Rethrow/RespondWith: re-run the failed model call with exponential backoff (initialBackoffMillis * 2^(n-1), overflow-safe). Handler consulted per failed attempt (can switch decisions mid-schedule); attempt budget is per model turn; exhaustion rethrows the ORIGINAL error, identity preserved. recoverAgenticLlmError inlined into the per-turn retry loop at the chatOrStream catch site. Default (no handler) unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ounting The non-suspending emitter path (trySend into the 64-slot buffer) now counts losses on a SessionDropCounter instead of logging one WARNING per dropped event: one summary line at session close (count + first dropped type), and the live count is public as AgentSession.droppedEvents so callers can assert on event loss instead of scraping logs. Both producers covered (agent.session and the shared composition agentSessionScope). Terminal Completed/Failed still always deliver via suspending send. Internals adjunct corrected (it claimed the producer suspends when full). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…strict mode When no OS sandbox backend exists (no Seatbelt/bwrap/firejail), the flag throws IllegalStateException before the subprocess starts instead of the UNCONFINED plain-ProcessBuilder fallback. Default false preserves the historical behavior; processTool remains the fail-closed high-level path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…robe The ProcessSandboxMacTest live network test (flake #4370) now embeds exit code, probe stdout/stderr, the python3 path, and the exact generated Seatbelt profile in its assertion message — an unreproducible runner failure becomes diagnosable from the CI log alone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The doc covered only tool errors; onLLMError (incl. the new Retry decision) now has its own section with the per-turn/per-attempt semantics and the v1 routing-scope caveat. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Skobeltsyn and others added 5 commits June 13, 2026 01:42

Skobeltsyn mentioned this pull request Jun 13, 2026

Streaming hardening: cancellation leak + concurrent-composition name collision (#4499, #4500) #200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Antifragility pass: Retry decision, session drop accounting, requireSandbox, flake forensics (#4495-#4498)#199

Antifragility pass: Retry decision, session drop accounting, requireSandbox, flake forensics (#4495-#4498)#199
Skobeltsyn wants to merge 5 commits into
mainfrom
feat/antifragility-pass

Skobeltsyn commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Skobeltsyn commented Jun 12, 2026

Antifragility pass — four resilience hardenings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant