Skip to content

Wait out auto-repair chain before failing on invoke --wait#5

Open
jackcbrown89 wants to merge 2 commits into
mainfrom
jb-wait-for-repairs
Open

Wait out auto-repair chain before failing on invoke --wait#5
jackcbrown89 wants to merge 2 commits into
mainfrom
jb-wait-for-repairs

Conversation

@jackcbrown89
Copy link
Copy Markdown
Contributor

@jackcbrown89 jackcbrown89 commented May 29, 2026

What

getlark workflows invoke --wait no longer counts a failed execution as a failure right away. When the account has auto-repair enabled (GET /settingsauto_repair_deterministic_workflows_enabled) and the workflow is deterministic, the CLI follows the post-failure repair chain before deciding the exit code.

execution fails
  ├─ auto-repair disabled OR ai_driven workflow → FAIL now (unchanged behavior)
  └─ else follow the event timeline:
       summarization category "app_issue"   → FAIL (genuine app defect)
       summarization not successful          → FAIL (inconclusive)
       repair not successful                 → FAIL
       repair OK + re-run passes             → PASS, reported as auto-repaired
       repair OK + re-run fails              → FAIL

Exit codes are unchanged (0 success incl. self-healed, 1 failure/cancelled, 2 timeout, 3 unexpected). A single --timeout now covers the whole wait, repair chain included.

How

  • Drives the chain off GET /workflows/{id}/events (the ordered timeline) rather than juggling last_* pointers; only the summarization is fetched in detail (for its category), and it's verified to belong to the failed execution.
  • Gates on mode === "deterministic" so AI-driven failures fail fast instead of hanging.
  • Inconclusive verdicts (summarization/repair errored) → FAIL (conservative).
  • Falls back to the original fail-fast behavior if GET /settings can't be read.

Changes

  • src/api/types.tsWorkflowSummarizationResource, SettingsResource, last_summarization_* on WorkflowResource, "summarization" event type, "other" artifact type.
  • src/api/client.tsgetSettings(), getWorkflowSummarization(), pollWorkflowRepairChain(); extracted a shared sleep() helper.
  • src/commands/invoke.ts — new WorkflowOutcome; failure path now follows the repair chain and reports auto-repaired / app-issue outcomes.
  • README.md — documents the new --wait repair behavior.
  • Also wires up the previously-unregistered skills command.

Testing

npm run typecheck and npm run build pass. Not yet smoke-tested live against a failing deterministic workflow with auto-repair on — recommend a manual run (getlark workflows invoke --workflow-ids <det-wf> --wait --verbose) before relying on it in CI.


Note

Medium Risk
CI exit semantics change for failed deterministic runs with auto-repair enabled; repair-chain polling depends on event ordering and summarization ownership matching, with conservative failure when settings cannot be loaded.

Overview
workflows invoke --wait now treats a failed run as non-final when the account has auto-repair for deterministic workflows: it polls the post-failure summarization → repair → re-execution timeline (via workflow events + summarization detail), exits 0 if the re-run passes (logged as auto-repaired), and fails on app-issue classification, broken repair, or a failed re-run. AI-driven workflows and accounts without auto-repair (or unreadable GET /settings) keep fail-fast behavior; one --timeout covers execution and the repair chain. Exit handling drops the outer timeout race in favor of per-workflow deadlines and prioritizes exit 2 on timeout.

The API layer adds getSettings, getWorkflowSummarization, pollWorkflowRepairChain, and types for summarizations/settings/events. getlark skills install is registered (wraps npx skills add getlark/skills); README documents skills install and CI repair behavior.

Reviewed by Cursor Bugbot for commit c6b1fc1. Bugbot is set up for automated code reviews on this repo. Configure here.

When a workflow execution fails, `workflows invoke --wait` no longer counts
it as a failure immediately. If the account has auto-repair enabled and the
workflow is deterministic, the CLI now follows the post-failure event
timeline (summarization -> repair -> re-execution) before deciding the exit
code:

- summarization classifies it as an app issue  -> fail
- summarization is not successful (inconclusive) -> fail
- repair is not successful                       -> fail
- repair succeeds and the re-run passes          -> pass (self-healed)
- repair succeeds but the re-run fails           -> fail

AI-driven workflows and accounts without auto-repair keep the previous
fail-fast behavior. The `--timeout` now covers the whole wait, including the
repair chain.

Adds GET /settings and GET /workflows/{id}/summarizations/{id} to the client,
a pollWorkflowRepairChain() poller, the WorkflowSummarization/Settings types,
and the last_summarization_* fields on WorkflowResource. Also registers the
previously-unwired `skills` command.
@github-actions github-actions Bot added the lark:proposing-tests Lark is drafting test proposals for this PR label May 29, 2026
@github-actions
Copy link
Copy Markdown

Proposed Lark tests

1. getlark skills install subcommand is recognized and delegates to npx

The diff adds a brand-new skills install subcommand (src/commands/skills.ts) that delegates to npx -y skills add getlark/skills. No existing workflow covers this command. This test verifies the subcommand is wired up in the CLI and that the delegation to npx is observable from outside the process.


2. invoke --wait follows auto-repair chain and exits 0 on self-heal

The diff changes invoke --wait so that when a deterministic workflow fails on an account with auto-repair enabled, the CLI waits out the summarization → repair → re-execution chain before deciding the exit code. If the re-execution passes, the CLI now exits 0 and prints an 'auto-repaired' message. Before this diff, it would exit 1 immediately after the first failed execution. This test covers that new happy path.


3. invoke --wait labels app-issue failures distinctly in stderr

The diff changes the failure output so that when the auto-repair summarization classifies the root cause as an app defect (category: 'app_issue'), the CLI prints 'executed with failure (app issue)' instead of the generic 'executed with failure'. This test verifies that label appears and that the CLI still exits non-zero in this case.

Reply with /create-workflows to materialize all, or /create-workflows 1 3 to pick specific proposals.

@github-actions github-actions Bot added lark:tests-proposed Lark has proposed end-to-end tests for this PR and removed lark:proposing-tests Lark is drafting test proposals for this PR labels May 29, 2026
Comment thread src/api/client.ts
Comment thread src/api/client.ts Outdated
Copy link
Copy Markdown
Contributor Author

@jackcbrown89 jackcbrown89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial review — auto-repair chain on invoke --wait

Intent is sound and the README documents it well. Typecheck passes clean. A few findings, posted inline. Summary by severity:

High — contract/reliability

  • #1 Timeout silently exits 0 unless --timeout is explicitly passed. pollWorkflowRepairChain/pollWorkflowExecution signal timeout via throw TimeoutError, but that throw is consumed by Promise.allSettled (which never rejects), lands in the rejected branch as a console.error, and is not pushed to failedWorkflowIds — so the run falls through to process.exit(0). The only path that yields the documented exit-2 is the outer timeoutPromise race, which is only created if (cmdOpts.timeout). So --wait alone (default 600s) → a timed-out/hung run passes CI. The catch (TimeoutError) → exit(2) block is effectively dead for this command. (Partly pre-existing, but the new feature leans entirely on TimeoutError for the repair chain.)

Medium — correctness & simplification

  • #2 The "raced-ahead summarization" guard can spin until timeout instead of recovering — see inline on client.ts.
  • #3 Two overlapping timeout mechanisms. The branch threads a single deadline through both polls (good), making the inner timeout authoritative — the outer Promise.race(..., timeoutPromise) is now redundant and is the only thing honoring exit-2 (#1). Routing TimeoutError → exit 2 in the result loop fixes #1 and lets the outer race be deleted: one change, fewer layers.
  • #4 The verdict is reconstructed client-side from the event timeline with fragile assumptions — see inline on client.ts. If the backend can return the repair verdict directly, most of this 110-line state machine could be deleted. Worth confirming before it hardens into a contract.

Low

  • #5 skills install spawn behavior on Windows — see inline.
  • #6 reexecution_failed returns the summarization summary rather than the re-execution's own failure summary — see inline.

Verified: typecheck clean; new types consistent and used; skills follows the documented (program) registration pattern; success/repaired/failure/cancelled exit paths and messaging match the README.

Comment thread src/commands/invoke.ts Outdated
}

let workflowExecutionResults: PromiseSettledResult<WorkflowExecutionResource>[] =
let workflowExecutionResults: PromiseSettledResult<WorkflowOutcome>[] =
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1 + #3 (High): a repair-chain timeout exits 0 unless --timeout is explicitly passed.

pollWorkflowExecution/pollWorkflowRepairChain throw TimeoutError expecting exit code 2 (the documented contract). But that throw happens inside the promises consumed by Promise.allSettled, which never rejects — it becomes a {status: "rejected"} result handled at the else branch below (console.error("Error: " + result.reason)) and is not added to failedWorkflowIds. With nothing failed/cancelled, the run hits process.exit(0).

The only path that produces exit 2 is the outer timeoutPromise race — created only if (cmdOpts.timeout). So --wait alone (default 600s) → a timed-out/hung run silently passes CI, and the catch (TimeoutError) → exit(2) block is dead for this command.

Suggested fix (also resolves #3): detect TimeoutError in the rejected branch and process.exit(2), then delete the now-redundant outer Promise.race/timeoutPromise (the inner deadline is already authoritative). One change, fewer layers, contract honored regardless of --timeout.

Comment thread src/api/client.ts Outdated
const elapsedMs = Date.now() - startTime;
await onPoll?.(stage, elapsedMs);

const summ = chain.find((e) => e.event_type === "summarization");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2 (Medium): the raced-ahead guard can spin until timeout instead of recovering.

chain.find(e => e.event_type === "summarization") returns the oldest summarization after failedAt. If that one belongs to a different execution (the exact race the comment below at L617-619 guards against), the code continues and retries — but every retry re-finds the same oldest event, so it never advances to our summarization and just times out. To actually "keep waiting for ours" you'd need to consider all summarization candidates and match workflow_execution_id, not only the first.

Comment thread src/api/client.ts Outdated
summary: null,
};
}
const detail = await this.getWorkflowSummarization(workflowId, summ.id);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4 (Medium): verdict reconstructed client-side from the event timeline — fragile assumptions worth confirming against the API.

  • This assumes a workflow event's id equals the underlying summarization resource id. If events carry their own ids, getWorkflowSummarization(workflowId, summ.id) 404s every poll.
  • Ordering relies on millisecond timestamp comparisons with mixed strict/non-strict operators (> failedAt, >= summ.created_at, >= repair.stopped_at). The original-execution exclusion is double-guarded by the repair.stopped_at check today, but it's brittle.
  • limit: 50, newest-first, no pagination — fine for a fresh single chain, silently lossy if the window fills (e.g. concurrent invocations of the same workflow).

If the platform can expose the repair verdict directly (on the execution or a single endpoint), most of this ~110-line state machine could be deleted. Worth a backend conversation before it hardens into a client-side contract.

Comment thread src/api/client.ts Outdated
result: "failure",
reason: "reexecution_failed",
executionId: reExecution.id,
summary: detail.summary,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#6 (Low): reexecution_failed returns summary: detail.summary (the summarization's text) rather than the re-execution's own failure summary, which would be more actionable for the user reading the CI output.

Comment thread src/commands/skills.ts
.action(() => {
const cmdArgs = ["-y", "skills", "add", SKILLS_PACKAGE];

const child = spawn("npx", cmdArgs, {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#5 (Low): spawn("npx", …, { shell: false }) fails with ENOENT on Windows (npx resolves to npx.cmd). The published @getlark/cli will hit this on Windows. The child.on("error") handler degrades it to a clear message rather than a crash, and shell: false is the right call for safety — so this is acceptable, but consider shell: process.platform === "win32" or documenting the limitation.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c6b1fc1. Configure here.

Comment thread src/api/client.ts
limit: 50,
});
const chain = workflow_events
.filter((e) => at(e.created_at) > failedAt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strict > filter misses same-timestamp summarization events

Medium Severity

The event filter at at(e.created_at) > failedAt uses strict greater-than, while the subsequent searches for repair and re-execution events (lines 660, 678) use non-strict >=. If the backend creates the summarization event at the exact same timestamp as the execution's stopped_at (e.g., triggered synchronously or within the same clock tick), the filter permanently excludes it. Every poll re-fetches and re-filters with the same strict >, so the summarization is never found, causing the chain to spin until timeout (exit code 2) instead of following the repair path.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c6b1fc1. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lark:tests-proposed Lark has proposed end-to-end tests for this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant