Wait out auto-repair chain before failing on invoke --wait by jackcbrown89 · Pull Request #5 · getlark/cli

jackcbrown89 · 2026-05-29T18:45:33Z

What

getlark workflows invoke --wait no longer counts a failed execution as a failure right away. When the account has auto-repair enabled (GET /settings → auto_repair_deterministic_workflows_enabled) and the workflow is deterministic, the CLI follows the post-failure repair chain before deciding the exit code.

execution fails
  ├─ auto-repair disabled OR ai_driven workflow → FAIL now (unchanged behavior)
  └─ else follow the event timeline:
       summarization category "app_issue"   → FAIL (genuine app defect)
       summarization not successful          → FAIL (inconclusive)
       repair not successful                 → FAIL
       repair OK + re-run passes             → PASS, reported as auto-repaired
       repair OK + re-run fails              → FAIL

Exit codes are unchanged (0 success incl. self-healed, 1 failure/cancelled, 2 timeout, 3 unexpected). A single --timeout now covers the whole wait, repair chain included.

How

Drives the chain off GET /workflows/{id}/events (the ordered timeline) rather than juggling last_* pointers; only the summarization is fetched in detail (for its category), and it's verified to belong to the failed execution.
Gates on mode === "deterministic" so AI-driven failures fail fast instead of hanging.
Inconclusive verdicts (summarization/repair errored) → FAIL (conservative).
Falls back to the original fail-fast behavior if GET /settings can't be read.

Changes

src/api/types.ts — WorkflowSummarizationResource, SettingsResource, last_summarization_* on WorkflowResource, "summarization" event type, "other" artifact type.
src/api/client.ts — getSettings(), getWorkflowSummarization(), pollWorkflowRepairChain(); extracted a shared sleep() helper.
src/commands/invoke.ts — new WorkflowOutcome; failure path now follows the repair chain and reports auto-repaired / app-issue outcomes.
README.md — documents the new --wait repair behavior.
Also wires up the previously-unregistered skills command.

Testing

npm run typecheck and npm run build pass. Not yet smoke-tested live against a failing deterministic workflow with auto-repair on — recommend a manual run (getlark workflows invoke --workflow-ids <det-wf> --wait --verbose) before relying on it in CI.

Note

Medium Risk
CI exit semantics change for failed deterministic runs with auto-repair enabled; repair-chain polling depends on event ordering and summarization ownership matching, with conservative failure when settings cannot be loaded.

Overview
workflows invoke --wait now treats a failed run as non-final when the account has auto-repair for deterministic workflows: it polls the post-failure summarization → repair → re-execution timeline (via workflow events + summarization detail), exits 0 if the re-run passes (logged as auto-repaired), and fails on app-issue classification, broken repair, or a failed re-run. AI-driven workflows and accounts without auto-repair (or unreadable GET /settings) keep fail-fast behavior; one --timeout covers execution and the repair chain. Exit handling drops the outer timeout race in favor of per-workflow deadlines and prioritizes exit 2 on timeout.

The API layer adds getSettings, getWorkflowSummarization, pollWorkflowRepairChain, and types for summarizations/settings/events. getlark skills install is registered (wraps npx skills add getlark/skills); README documents skills install and CI repair behavior.

^{Reviewed by Cursor Bugbot for commit c6b1fc1. Bugbot is set up for automated code reviews on this repo. Configure here.}

When a workflow execution fails, `workflows invoke --wait` no longer counts it as a failure immediately. If the account has auto-repair enabled and the workflow is deterministic, the CLI now follows the post-failure event timeline (summarization -> repair -> re-execution) before deciding the exit code: - summarization classifies it as an app issue -> fail - summarization is not successful (inconclusive) -> fail - repair is not successful -> fail - repair succeeds and the re-run passes -> pass (self-healed) - repair succeeds but the re-run fails -> fail AI-driven workflows and accounts without auto-repair keep the previous fail-fast behavior. The `--timeout` now covers the whole wait, including the repair chain. Adds GET /settings and GET /workflows/{id}/summarizations/{id} to the client, a pollWorkflowRepairChain() poller, the WorkflowSummarization/Settings types, and the last_summarization_* fields on WorkflowResource. Also registers the previously-unwired `skills` command.

github-actions · 2026-05-29T18:47:51Z

Proposed Lark tests

1. getlark skills install subcommand is recognized and delegates to npx

The diff adds a brand-new skills install subcommand (src/commands/skills.ts) that delegates to npx -y skills add getlark/skills. No existing workflow covers this command. This test verifies the subcommand is wired up in the CLI and that the delegation to npx is observable from outside the process.

2. invoke --wait follows auto-repair chain and exits 0 on self-heal

The diff changes invoke --wait so that when a deterministic workflow fails on an account with auto-repair enabled, the CLI waits out the summarization → repair → re-execution chain before deciding the exit code. If the re-execution passes, the CLI now exits 0 and prints an 'auto-repaired' message. Before this diff, it would exit 1 immediately after the first failed execution. This test covers that new happy path.

3. invoke --wait labels app-issue failures distinctly in stderr

The diff changes the failure output so that when the auto-repair summarization classifies the root cause as an app defect (category: 'app_issue'), the CLI prints 'executed with failure (app issue)' instead of the generic 'executed with failure'. This test verifies that label appears and that the CLI still exits non-zero in this case.

Reply with /create-workflows to materialize all, or /create-workflows 1 3 to pick specific proposals.

jackcbrown89

Adversarial review — auto-repair chain on `invoke --wait`

Intent is sound and the README documents it well. Typecheck passes clean. A few findings, posted inline. Summary by severity:

High — contract/reliability

#1 Timeout silently exits 0 unless --timeout is explicitly passed. pollWorkflowRepairChain/pollWorkflowExecution signal timeout via throw TimeoutError, but that throw is consumed by Promise.allSettled (which never rejects), lands in the rejected branch as a console.error, and is not pushed to failedWorkflowIds — so the run falls through to process.exit(0). The only path that yields the documented exit-2 is the outer timeoutPromise race, which is only created if (cmdOpts.timeout). So --wait alone (default 600s) → a timed-out/hung run passes CI. The catch (TimeoutError) → exit(2) block is effectively dead for this command. (Partly pre-existing, but the new feature leans entirely on TimeoutError for the repair chain.)

Medium — correctness & simplification

#2 The "raced-ahead summarization" guard can spin until timeout instead of recovering — see inline on client.ts.
#3 Two overlapping timeout mechanisms. The branch threads a single deadline through both polls (good), making the inner timeout authoritative — the outer Promise.race(..., timeoutPromise) is now redundant and is the only thing honoring exit-2 (#1). Routing TimeoutError → exit 2 in the result loop fixes #1 and lets the outer race be deleted: one change, fewer layers.
#4 The verdict is reconstructed client-side from the event timeline with fragile assumptions — see inline on client.ts. If the backend can return the repair verdict directly, most of this 110-line state machine could be deleted. Worth confirming before it hardens into a contract.

Low

#5 skills install spawn behavior on Windows — see inline.
#6 reexecution_failed returns the summarization summary rather than the re-execution's own failure summary — see inline.

Verified: typecheck clean; new types consistent and used; skills follows the documented (program) registration pattern; success/repaired/failure/cancelled exit paths and messaging match the README.

jackcbrown89 · 2026-05-29T19:46:40Z

          }

-          let workflowExecutionResults: PromiseSettledResult<WorkflowExecutionResource>[] =
+          let workflowExecutionResults: PromiseSettledResult<WorkflowOutcome>[] =


#1 + #3 (High): a repair-chain timeout exits 0 unless --timeout is explicitly passed.

pollWorkflowExecution/pollWorkflowRepairChain throw TimeoutError expecting exit code 2 (the documented contract). But that throw happens inside the promises consumed by Promise.allSettled, which never rejects — it becomes a {status: "rejected"} result handled at the else branch below (console.error("Error: " + result.reason)) and is not added to failedWorkflowIds. With nothing failed/cancelled, the run hits process.exit(0).

The only path that produces exit 2 is the outer timeoutPromise race — created only if (cmdOpts.timeout). So --wait alone (default 600s) → a timed-out/hung run silently passes CI, and the catch (TimeoutError) → exit(2) block is dead for this command.

Suggested fix (also resolves #3): detect TimeoutError in the rejected branch and process.exit(2), then delete the now-redundant outer Promise.race/timeoutPromise (the inner deadline is already authoritative). One change, fewer layers, contract honored regardless of --timeout.

jackcbrown89 · 2026-05-29T19:46:40Z

+      const elapsedMs = Date.now() - startTime;
+      await onPoll?.(stage, elapsedMs);
+
+      const summ = chain.find((e) => e.event_type === "summarization");


#2 (Medium): the raced-ahead guard can spin until timeout instead of recovering.

chain.find(e => e.event_type === "summarization") returns the oldest summarization after failedAt. If that one belongs to a different execution (the exact race the comment below at L617-619 guards against), the code continues and retries — but every retry re-finds the same oldest event, so it never advances to our summarization and just times out. To actually "keep waiting for ours" you'd need to consider all summarization candidates and match workflow_execution_id, not only the first.

jackcbrown89 · 2026-05-29T19:46:40Z

+            summary: null,
+          };
+        }
+        const detail = await this.getWorkflowSummarization(workflowId, summ.id);


#4 (Medium): verdict reconstructed client-side from the event timeline — fragile assumptions worth confirming against the API.

This assumes a workflow event's id equals the underlying summarization resource id. If events carry their own ids, getWorkflowSummarization(workflowId, summ.id) 404s every poll.

Ordering relies on millisecond timestamp comparisons with mixed strict/non-strict operators (> failedAt, >= summ.created_at, >= repair.stopped_at). The original-execution exclusion is double-guarded by the repair.stopped_at check today, but it's brittle.

limit: 50, newest-first, no pagination — fine for a fresh single chain, silently lossy if the window fills (e.g. concurrent invocations of the same workflow).

If the platform can expose the repair verdict directly (on the execution or a single endpoint), most of this ~110-line state machine could be deleted. Worth a backend conversation before it hardens into a client-side contract.

jackcbrown89 · 2026-05-29T19:46:40Z

+              result: "failure",
+              reason: "reexecution_failed",
+              executionId: reExecution.id,
+              summary: detail.summary,


#6 (Low): reexecution_failed returns summary: detail.summary (the summarization's text) rather than the re-execution's own failure summary, which would be more actionable for the user reading the CI output.

jackcbrown89 · 2026-05-29T19:46:40Z

+    .action(() => {
+      const cmdArgs = ["-y", "skills", "add", SKILLS_PACKAGE];
+
+      const child = spawn("npx", cmdArgs, {


#5 (Low): spawn("npx", …, { shell: false }) fails with ENOENT on Windows (npx resolves to npx.cmd). The published @getlark/cli will hit this on Windows. The child.on("error") handler degrades it to a clear message rather than a crash, and shell: false is the right call for safety — so this is acceptable, but consider shell: process.platform === "win32" or documenting the limitation.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit c6b1fc1. Configure here.}

cursor · 2026-05-29T20:44:00Z

+        limit: 50,
+      });
+      const chain = workflow_events
+        .filter((e) => at(e.created_at) > failedAt)


Strict > filter misses same-timestamp summarization events

Medium Severity

The event filter at at(e.created_at) > failedAt uses strict greater-than, while the subsequent searches for repair and re-execution events (lines 660, 678) use non-strict >=. If the backend creates the summarization event at the exact same timestamp as the execution's stopped_at (e.g., triggered synchronously or within the same clock tick), the filter permanently excludes it. Every poll re-fetches and re-filters with the same strict >, so the summarization is never found, causing the chain to spin until timeout (exit code 2) instead of following the repair path.

Additional Locations (1)

src/api/client.ts#L659-L660

^{Reviewed by Cursor Bugbot for commit c6b1fc1. Configure here.}

github-actions Bot added the lark:proposing-tests Lark is drafting test proposals for this PR label May 29, 2026

github-actions Bot added lark:tests-proposed Lark has proposed end-to-end tests for this PR and removed lark:proposing-tests Lark is drafting test proposals for this PR labels May 29, 2026

cursor Bot reviewed May 29, 2026

View reviewed changes

Comment thread src/api/client.ts

Comment thread src/api/client.ts Outdated

jackcbrown89 commented May 29, 2026

View reviewed changes

address feedback

c6b1fc1

cursor Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait out auto-repair chain before failing on invoke --wait#5

Wait out auto-repair chain before failing on invoke --wait#5
jackcbrown89 wants to merge 2 commits into
mainfrom
jb-wait-for-repairs

jackcbrown89 commented May 29, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Uh oh!

Uh oh!

jackcbrown89 left a comment

Uh oh!

jackcbrown89 May 29, 2026

Uh oh!

jackcbrown89 May 29, 2026

Uh oh!

jackcbrown89 May 29, 2026

Uh oh!

jackcbrown89 May 29, 2026

Uh oh!

jackcbrown89 May 29, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jackcbrown89 commented May 29, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Changes

Testing

Uh oh!

github-actions Bot commented May 29, 2026

Proposed Lark tests

Uh oh!

Uh oh!

Uh oh!

jackcbrown89 left a comment

Choose a reason for hiding this comment

Adversarial review — auto-repair chain on invoke --wait

Uh oh!

jackcbrown89 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

jackcbrown89 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

jackcbrown89 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

jackcbrown89 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

jackcbrown89 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 29, 2026

Choose a reason for hiding this comment

Strict > filter misses same-timestamp summarization events

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jackcbrown89 commented May 29, 2026 •

edited by cursor Bot

Loading

Adversarial review — auto-repair chain on `invoke --wait`

Strict `>` filter misses same-timestamp summarization events