Wait out auto-repair chain before failing on invoke --wait#5
Wait out auto-repair chain before failing on invoke --wait#5jackcbrown89 wants to merge 2 commits into
Conversation
When a workflow execution fails, `workflows invoke --wait` no longer counts
it as a failure immediately. If the account has auto-repair enabled and the
workflow is deterministic, the CLI now follows the post-failure event
timeline (summarization -> repair -> re-execution) before deciding the exit
code:
- summarization classifies it as an app issue -> fail
- summarization is not successful (inconclusive) -> fail
- repair is not successful -> fail
- repair succeeds and the re-run passes -> pass (self-healed)
- repair succeeds but the re-run fails -> fail
AI-driven workflows and accounts without auto-repair keep the previous
fail-fast behavior. The `--timeout` now covers the whole wait, including the
repair chain.
Adds GET /settings and GET /workflows/{id}/summarizations/{id} to the client,
a pollWorkflowRepairChain() poller, the WorkflowSummarization/Settings types,
and the last_summarization_* fields on WorkflowResource. Also registers the
previously-unwired `skills` command.
Proposed Lark tests1. getlark skills install subcommand is recognized and delegates to npx The diff adds a brand-new 2. invoke --wait follows auto-repair chain and exits 0 on self-heal The diff changes 3. invoke --wait labels app-issue failures distinctly in stderr The diff changes the failure output so that when the auto-repair summarization classifies the root cause as an app defect ( Reply with |
jackcbrown89
left a comment
There was a problem hiding this comment.
Adversarial review — auto-repair chain on invoke --wait
Intent is sound and the README documents it well. Typecheck passes clean. A few findings, posted inline. Summary by severity:
High — contract/reliability
- #1 Timeout silently exits
0unless--timeoutis explicitly passed.pollWorkflowRepairChain/pollWorkflowExecutionsignal timeout viathrow TimeoutError, but that throw is consumed byPromise.allSettled(which never rejects), lands in the rejected branch as aconsole.error, and is not pushed tofailedWorkflowIds— so the run falls through toprocess.exit(0). The only path that yields the documented exit-2 is the outertimeoutPromiserace, which is only createdif (cmdOpts.timeout). So--waitalone (default 600s) → a timed-out/hung run passes CI. Thecatch (TimeoutError) → exit(2)block is effectively dead for this command. (Partly pre-existing, but the new feature leans entirely onTimeoutErrorfor the repair chain.)
Medium — correctness & simplification
- #2 The "raced-ahead summarization" guard can spin until timeout instead of recovering — see inline on
client.ts. - #3 Two overlapping timeout mechanisms. The branch threads a single
deadlinethrough both polls (good), making the inner timeout authoritative — the outerPromise.race(..., timeoutPromise)is now redundant and is the only thing honoring exit-2 (#1). RoutingTimeoutError→ exit 2 in the result loop fixes #1 and lets the outer race be deleted: one change, fewer layers. - #4 The verdict is reconstructed client-side from the event timeline with fragile assumptions — see inline on
client.ts. If the backend can return the repair verdict directly, most of this 110-line state machine could be deleted. Worth confirming before it hardens into a contract.
Low
- #5
skills installspawn behavior on Windows — see inline. - #6
reexecution_failedreturns the summarization summary rather than the re-execution's own failure summary — see inline.
Verified: typecheck clean; new types consistent and used; skills follows the documented (program) registration pattern; success/repaired/failure/cancelled exit paths and messaging match the README.
| } | ||
|
|
||
| let workflowExecutionResults: PromiseSettledResult<WorkflowExecutionResource>[] = | ||
| let workflowExecutionResults: PromiseSettledResult<WorkflowOutcome>[] = |
There was a problem hiding this comment.
#1 + #3 (High): a repair-chain timeout exits 0 unless --timeout is explicitly passed.
pollWorkflowExecution/pollWorkflowRepairChain throw TimeoutError expecting exit code 2 (the documented contract). But that throw happens inside the promises consumed by Promise.allSettled, which never rejects — it becomes a {status: "rejected"} result handled at the else branch below (console.error("Error: " + result.reason)) and is not added to failedWorkflowIds. With nothing failed/cancelled, the run hits process.exit(0).
The only path that produces exit 2 is the outer timeoutPromise race — created only if (cmdOpts.timeout). So --wait alone (default 600s) → a timed-out/hung run silently passes CI, and the catch (TimeoutError) → exit(2) block is dead for this command.
Suggested fix (also resolves #3): detect TimeoutError in the rejected branch and process.exit(2), then delete the now-redundant outer Promise.race/timeoutPromise (the inner deadline is already authoritative). One change, fewer layers, contract honored regardless of --timeout.
| const elapsedMs = Date.now() - startTime; | ||
| await onPoll?.(stage, elapsedMs); | ||
|
|
||
| const summ = chain.find((e) => e.event_type === "summarization"); |
There was a problem hiding this comment.
#2 (Medium): the raced-ahead guard can spin until timeout instead of recovering.
chain.find(e => e.event_type === "summarization") returns the oldest summarization after failedAt. If that one belongs to a different execution (the exact race the comment below at L617-619 guards against), the code continues and retries — but every retry re-finds the same oldest event, so it never advances to our summarization and just times out. To actually "keep waiting for ours" you'd need to consider all summarization candidates and match workflow_execution_id, not only the first.
| summary: null, | ||
| }; | ||
| } | ||
| const detail = await this.getWorkflowSummarization(workflowId, summ.id); |
There was a problem hiding this comment.
#4 (Medium): verdict reconstructed client-side from the event timeline — fragile assumptions worth confirming against the API.
- This assumes a workflow event's
idequals the underlying summarization resource id. If events carry their own ids,getWorkflowSummarization(workflowId, summ.id)404s every poll. - Ordering relies on millisecond timestamp comparisons with mixed strict/non-strict operators (
> failedAt,>= summ.created_at,>= repair.stopped_at). The original-execution exclusion is double-guarded by therepair.stopped_atcheck today, but it's brittle. limit: 50, newest-first, no pagination — fine for a fresh single chain, silently lossy if the window fills (e.g. concurrent invocations of the same workflow).
If the platform can expose the repair verdict directly (on the execution or a single endpoint), most of this ~110-line state machine could be deleted. Worth a backend conversation before it hardens into a client-side contract.
| result: "failure", | ||
| reason: "reexecution_failed", | ||
| executionId: reExecution.id, | ||
| summary: detail.summary, |
There was a problem hiding this comment.
#6 (Low): reexecution_failed returns summary: detail.summary (the summarization's text) rather than the re-execution's own failure summary, which would be more actionable for the user reading the CI output.
| .action(() => { | ||
| const cmdArgs = ["-y", "skills", "add", SKILLS_PACKAGE]; | ||
|
|
||
| const child = spawn("npx", cmdArgs, { |
There was a problem hiding this comment.
#5 (Low): spawn("npx", …, { shell: false }) fails with ENOENT on Windows (npx resolves to npx.cmd). The published @getlark/cli will hit this on Windows. The child.on("error") handler degrades it to a clear message rather than a crash, and shell: false is the right call for safety — so this is acceptable, but consider shell: process.platform === "win32" or documenting the limitation.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c6b1fc1. Configure here.
| limit: 50, | ||
| }); | ||
| const chain = workflow_events | ||
| .filter((e) => at(e.created_at) > failedAt) |
There was a problem hiding this comment.
Strict > filter misses same-timestamp summarization events
Medium Severity
The event filter at at(e.created_at) > failedAt uses strict greater-than, while the subsequent searches for repair and re-execution events (lines 660, 678) use non-strict >=. If the backend creates the summarization event at the exact same timestamp as the execution's stopped_at (e.g., triggered synchronously or within the same clock tick), the filter permanently excludes it. Every poll re-fetches and re-filters with the same strict >, so the summarization is never found, causing the chain to spin until timeout (exit code 2) instead of following the repair path.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c6b1fc1. Configure here.


What
getlark workflows invoke --waitno longer counts a failed execution as a failure right away. When the account has auto-repair enabled (GET /settings→auto_repair_deterministic_workflows_enabled) and the workflow is deterministic, the CLI follows the post-failure repair chain before deciding the exit code.Exit codes are unchanged (
0success incl. self-healed,1failure/cancelled,2timeout,3unexpected). A single--timeoutnow covers the whole wait, repair chain included.How
GET /workflows/{id}/events(the ordered timeline) rather than jugglinglast_*pointers; only the summarization is fetched in detail (for itscategory), and it's verified to belong to the failed execution.mode === "deterministic"so AI-driven failures fail fast instead of hanging.GET /settingscan't be read.Changes
src/api/types.ts—WorkflowSummarizationResource,SettingsResource,last_summarization_*onWorkflowResource,"summarization"event type,"other"artifact type.src/api/client.ts—getSettings(),getWorkflowSummarization(),pollWorkflowRepairChain(); extracted a sharedsleep()helper.src/commands/invoke.ts— newWorkflowOutcome; failure path now follows the repair chain and reports auto-repaired / app-issue outcomes.README.md— documents the new--waitrepair behavior.skillscommand.Testing
npm run typecheckandnpm run buildpass. Not yet smoke-tested live against a failing deterministic workflow with auto-repair on — recommend a manual run (getlark workflows invoke --workflow-ids <det-wf> --wait --verbose) before relying on it in CI.Note
Medium Risk
CI exit semantics change for failed deterministic runs with auto-repair enabled; repair-chain polling depends on event ordering and summarization ownership matching, with conservative failure when settings cannot be loaded.
Overview
workflows invoke --waitnow treats a failed run as non-final when the account has auto-repair for deterministic workflows: it polls the post-failure summarization → repair → re-execution timeline (via workflow events + summarization detail), exits 0 if the re-run passes (logged as auto-repaired), and fails on app-issue classification, broken repair, or a failed re-run. AI-driven workflows and accounts without auto-repair (or unreadableGET /settings) keep fail-fast behavior; one--timeoutcovers execution and the repair chain. Exit handling drops the outer timeout race in favor of per-workflow deadlines and prioritizes exit 2 on timeout.The API layer adds
getSettings,getWorkflowSummarization,pollWorkflowRepairChain, and types for summarizations/settings/events.getlark skills installis registered (wrapsnpx skills add getlark/skills); README documents skills install and CI repair behavior.Reviewed by Cursor Bugbot for commit c6b1fc1. Bugbot is set up for automated code reviews on this repo. Configure here.