feat(workflow-executor): use stored OAuth credentials for MCP steps [PRD-367]#1665
Conversation
| }); | ||
|
|
||
| if (result.refreshToken) { | ||
| await this.persistRotatedRefreshToken(credential, result.refreshToken); |
There was a problem hiding this comment.
Claude Opus 4.8 — [Should fix] On the invalid_grant retry path, runGrantWithRotationRetry re-reads latest and grants with it, but this write-back passes the pre-retry credential for the non-token fields (clientId / clientSecret / tokenEndpoint / scopes). refreshTokenEnc and encKeyVersion are correct (from the fresh encrypt), so this is not token/key corruption — but if a peer concurrently re-deposited new config, those fields get reverted: a narrow lost-update on the exact multi-instance path this logic guards. Thread the credential actually used for the successful grant through to the write-back.
There was a problem hiding this comment.
Claude Opus 4.8 — resolved (2b5e5e5). Agreed. runGrantWithRotationRetry now returns the credential whose token produced the grant, and refreshAndCache writes the rotated token back onto that (current) row — so a concurrent re-deposit is no longer partially reverted. Added a regression test asserting the write-back uses the re-read credential fields.
| // Classifies errors surfaced while connecting to or calling an MCP server. The MCP SDK / HTTP | ||
| // transport reports failures in several shapes (a numeric status field, or only a message string), | ||
| // so the checks walk the cause chain and inspect both structured status and the message text. | ||
| const AUTH_STATUSES = new Set([401, 403]); |
There was a problem hiding this comment.
Claude Opus 4.8 — [Should fix] 403 is treated like 401, so a 403 on the tool call forces a refresh and (on a second 403) pauses with needs-oauth-reauth. The spec specifies retry on 401; a 403 is usually insufficient-scope, which re-consent often cannot fix, so the user re-auths back into the same 403. Confirm 403 should drive the reauth flow, or narrow the refresh/reauth trigger to 401 and surface 403 as a normal failure. (Classifying 403 as kind: auth is fine — this is only about the reauth trigger.)
There was a problem hiding this comment.
Claude Opus 4.8 — resolved (8d441d2). Narrowed to 401 per the spec: 403 is no longer treated as a refreshable auth error, so it surfaces as an ordinary failure instead of forcing a refresh + reauth loop. Treating a scope-403 as a re-consent opportunity is a separate, out-of-scope enhancement. isMcpAuthError / classifyMcpLoadError now match 401/unauthorized only; tests updated.
| // transport reports failures in several shapes (a numeric status field, or only a message string), | ||
| // so the checks walk the cause chain and inspect both structured status and the message text. | ||
| const AUTH_STATUSES = new Set([401, 403]); | ||
| const AUTH_PATTERN = /\b40[13]\b|unauthorized|forbidden/i; |
There was a problem hiding this comment.
Claude Opus 4.8 — [Preferential] This message-text fallback can match a bare 401/403 appearing for non-auth reasons in an error body (IDs, counts, timestamps). It only runs when no structured status field is present (those are checked first), so the blast radius is small — flagging as a known fuzzy edge, not a blocker.
There was a problem hiding this comment.
Claude Opus 4.8 — acknowledged, keeping as-is. Deliberate last-resort fallback, used only when no structured status field is present (code/status/statusCode are checked first), so the misclassification window is narrow and low-probability. Tightening the message regex risks false negatives on real 401s that carry the status only in text. Leaving it as the documented fuzzy edge.
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8d441d2 to
0e05efe
Compare
10 new issues
|
|
Coverage Impact ⬆️ Merging this pull request will increase total coverage on Modified Files with Diff Coverage (20) 🤖 Increase coverage with AI coding...🚦 See full report on Qlty Cloud » 🛟 Help
|
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d4555ee to
cf71d69
Compare
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cf71d69 to
55309c7
Compare
d42828e to
30b063c
Compare
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
55309c7 to
842d468
Compare
|
Agent code review — Claude Opus 4.8 (2026-06-18) Reviewed the folded-in commits Verified:
|
e121a1a to
270dbcc
Compare
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11aba95 to
1f58924
Compare
| ): Promise<void> { | ||
| const encrypted = this.encryption.encrypt(refreshToken); | ||
|
|
||
| try { | ||
| await this.store.upsert({ |
There was a problem hiding this comment.
🟡 Medium oauth/token-service.ts:179
The this.encryption.encrypt call in persistRotatedRefreshToken is outside the try block, so an encryption failure causes getAccessToken to throw instead of returning the already-obtained valid access token. Move the encrypt call inside the try so it is treated as a non-fatal write-back failure.
): Promise<void> {
- const encrypted = this.encryption.encrypt(refreshToken);
-
try {
+ const encrypted = this.encryption.encrypt(refreshToken);
await this.store.upsert({🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @packages/workflow-executor/src/oauth/token-service.ts around lines 179-183:
The `this.encryption.encrypt` call in `persistRotatedRefreshToken` is outside the `try` block, so an encryption failure causes `getAccessToken` to throw instead of returning the already-obtained valid access token. Move the encrypt call inside the `try` so it is treated as a non-fatal write-back failure.
Evidence trail:
packages/workflow-executor/src/oauth/token-service.ts lines 176-202 (REVIEWED_COMMIT): `encrypt` on line 180 is before `try` on line 182; comment on lines 194-196 states write-back failures are non-fatal. packages/workflow-executor/src/oauth/token-service.ts lines 89-110 (REVIEWED_COMMIT): `persistRotatedRefreshToken` is called on line 106, and if it throws, `return result.accessToken` on line 109 is never reached. packages/workflow-executor/src/crypto/credential-encryption.ts lines 48-60 and 79-88 (REVIEWED_COMMIT): `encrypt` calls `deriveKey()` which can throw `ExecutorEncryptionKeyMissingError`.
270dbcc to
2974d29
Compare
At an oauth2 MCP step the executor looks up the stored credential by (user, server), runs the refresh-token grant against the stored token endpoint behind an expiry-skew cache, injects the bearer token before connecting, retries once on a 401 across list-tools and the tool call, and pauses for re-authentication when no usable credential exists or the refresh is rejected. Bearer and none steps are unchanged. Adds additive auth-error classification to the shared ai-proxy McpClient consumed by this path. Behaviour stays dormant until the orchestrator serves authType and the frontend ships (deploy orchestrator first), so it is safe to deploy alone. Depends on the PR1 credential store. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cutor [PRD-367] PR1 wired an in-memory credential store + deposit endpoint into buildInMemoryExecutor, so the previous "in-memory raises ConfigurationError for oauth2 steps" behavior was inconsistent: a credential could be deposited but never used. Wire an OAuthTokenService into the in-memory runner (sharing the same store instance the deposit endpoint writes to) so oauth2 steps work end-to-end in dev, matching the database executor. The token service is now a required RunnerConfig/RemoteToolFetcher collaborator (both executors provide it), so the unreachable ConfigurationError guard and its fetcher test are removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tials port [PRD-367] The PR1 rebase moved the credentials store interface + types from stores/mcp-oauth-credentials-store to ports/mcp-oauth-credentials-store (the store file now holds only the Database/InMemory implementations). Import McpOAuthCredentialsStore and StoredMcpOAuthCredential from the new port path so the package compiles against the rebased PR1 base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion [PRD-367] PR1 dropped enc_key_version from the credential store (no version-aware decrypt path), so the rotation write-back no longer carries encKeyVersion. Per PRD-367 key-rotation handling, a decrypt failure with the encryption key PRESENT (auth-tag mismatch from a since-rotated/hard-swapped key) is recoverable: toGrantParams now classifies it as OAuthReauthRequiredError (needs-oauth-reauth) so re-consent re-deposits under the new key. A missing key (ExecutorEncryptionKeyMissingError) stays terminal — re-consent cannot help and a re-deposit would 503. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ting [PRD-367] New GET /list-mcp-tools?mcpServerId= route on the executor HTTP server for the orchestrator-engine MCP-server details page: resolves the caller's vault credential (user_id from the validated JWT, never the request), refreshes, injects the Bearer, and returns the server's tool definitions — reusing RemoteToolFetcher, no new fetch/refresh logic. A missing/unrefreshable credential returns a typed needs-oauth-reauth (409), not a generic error or empty list. Wired into both the database and in-memory executors so oauth2 tool listing works in dev too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nses [PRD-367]
A literal JSON null (or other non-object) body from the token endpoint overwrote the {} parse default, so the subsequent payload.error / payload.access_token reads threw a TypeError instead of the typed OAuthRefreshError. Keep the {} default for non-object bodies so the status checks still surface a typed OAuthRefreshError.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1f58924 to
63123b6
Compare
| tokenEndpointAuthMethod: credential.tokenEndpointAuthMethod, | ||
| scopes: credential.scopes, | ||
| }); | ||
| } catch (error) { |
There was a problem hiding this comment.
🟡 Medium oauth/token-service.ts:194
persistRotatedRefreshToken() logs but swallows store.upsert() failures, leaving the old refresh token in storage. The authorization server has already rotated the token, so the stored row is now invalid. On the next refresh, runGrantWithRotationRetry() re-reads the stale row, sees wasRotated === false, and incorrectly raises OAuthReauthRequiredError instead of retrying. A transient database failure therefore forces unnecessary user re-authorization.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @packages/workflow-executor/src/oauth/token-service.ts around line 194:
`persistRotatedRefreshToken()` logs but swallows `store.upsert()` failures, leaving the old refresh token in storage. The authorization server has already rotated the token, so the stored row is now invalid. On the next refresh, `runGrantWithRotationRetry()` re-reads the stale row, sees `wasRotated === false`, and incorrectly raises `OAuthReauthRequiredError` instead of retrying. A transient database failure therefore forces unnecessary user re-authorization.
| ctx.body = { | ||
| tools: tools.map(tool => ({ name: tool.base.name, description: tool.base.description })), | ||
| }; | ||
| } catch (err) { |
There was a problem hiding this comment.
🟡 Medium http/executor-http-server.ts:322
handleListMcpTools only intercepts OAuthReauthRequiredError in its catch block; any other error (including ExecutorEncryptionKeyMissingError, which OAuthTokenService deliberately throws for operator-misconfiguration) falls through to the generic error middleware and returns 500. When the encryption key is missing, callers receive an opaque "Internal server error" instead of the intended service-unavailable response that would prompt an admin to provision the key.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @packages/workflow-executor/src/http/executor-http-server.ts around line 322:
`handleListMcpTools` only intercepts `OAuthReauthRequiredError` in its catch block; any other error (including `ExecutorEncryptionKeyMissingError`, which `OAuthTokenService` deliberately throws for operator-misconfiguration) falls through to the generic error middleware and returns 500. When the encryption key is missing, callers receive an opaque "Internal server error" instead of the intended service-unavailable response that would prompt an admin to provision the key.
| const key = `${userId}:${mcpServerId}`; | ||
| const forceRefresh = options?.forceRefresh ?? false; | ||
|
|
||
| if (!forceRefresh) { |
There was a problem hiding this comment.
🟡 Medium oauth/token-service.ts:65
getAccessToken() returns a cached token from this.cache without invalidating when the underlying credential is deleted or updated via the HTTP endpoints (/mcp-oauth-credentials). Since OAuthTokenService is a long-lived singleton and the endpoints only modify the store, a user who disconnects or reconnects an MCP server continues to receive the stale cached token until expiry — allowing API calls with the old authorization after disconnect and potentially the wrong account after reconnect. Consider invalidating the cache entry on credential updates/deletes, or checking store freshness before serving from cache.
Also found in 1 other location(s)
packages/workflow-executor/src/build-workflow-executor.ts:240
Passing the shared
mcpOAuthTokenServiceintoRunnerat line240makes workflow steps keep using any access token already cached inOAuthTokenServiceeven after/mcp-oauth-credentials/:mcpServerIddeletes the stored credential.getAccessToken()serves the cache before reading the store, and there is no cache invalidation on delete, so a user who disconnects an MCP server can still have subsequent steps execute against that server until the cached token expires.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @packages/workflow-executor/src/oauth/token-service.ts around line 65:
`getAccessToken()` returns a cached token from `this.cache` without invalidating when the underlying credential is deleted or updated via the HTTP endpoints (`/mcp-oauth-credentials`). Since `OAuthTokenService` is a long-lived singleton and the endpoints only modify the store, a user who disconnects or reconnects an MCP server continues to receive the stale cached token until expiry — allowing API calls with the old authorization after disconnect and potentially the wrong account after reconnect. Consider invalidating the cache entry on credential updates/deletes, or checking store freshness before serving from cache.
Also found in 1 other location(s):
- packages/workflow-executor/src/build-workflow-executor.ts:240 -- Passing the shared `mcpOAuthTokenService` into `Runner` at line `240` makes workflow steps keep using any access token already cached in `OAuthTokenService` even after `/mcp-oauth-credentials/:mcpServerId` deletes the stored credential. `getAccessToken()` serves the cache before reading the store, and there is no cache invalidation on delete, so a user who disconnects an MCP server can still have subsequent steps execute against that server until the cached token expires.
| @@ -74,6 +85,22 @@ export default class McpStepExecutor extends BaseStepExecutor<McpStepDefinition> | |||
| } | |||
|
|
|||
| protected async doExecute(): Promise<StepExecutionResult> { | |||
There was a problem hiding this comment.
🟠 High executors/mcp-step-executor.ts:87
When executeToolAndPersist() throws OAuthReauthRequiredError, the beforeCall hook has already persisted idempotencyPhase: 'executing' to the run store. doExecute() catches the error and converts it to an awaiting-input outcome, but never clears that marker. On retry after the user re-authorizes, checkIdempotency() finds idempotencyPhase: 'executing' and throws StepStateError, blocking resumption of the step. Consider updating the state to idempotencyPhase: 'done' (or removing the executing marker) before returning the awaiting-input outcome.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @packages/workflow-executor/src/executors/mcp-step-executor.ts around line 87:
When `executeToolAndPersist()` throws `OAuthReauthRequiredError`, the `beforeCall` hook has already persisted `idempotencyPhase: 'executing'` to the run store. `doExecute()` catches the error and converts it to an `awaiting-input` outcome, but never clears that marker. On retry after the user re-authorizes, `checkIdempotency()` finds `idempotencyPhase: 'executing'` and throws `StepStateError`, blocking resumption of the step. Consider updating the state to `idempotencyPhase: 'done'` (or removing the executing marker) before returning the `awaiting-input` outcome.

What & why
Executor-side runtime use of stored OAuth credentials for OAuth2-protected MCP servers (PRD-367, PR2). At an
oauth2MCP step the executor looks up the stored credential by(user, server), runs a refresh-token grant against the storedtoken_endpointbehind an expiry-skew cache, injects the Bearer token before connecting, retries once on an upstream 401 (covering both list-tools and the tool call), and pauses the stepawaiting-inputwithawaitingInputReason: needs-oauth-reauthwhen there is no usable credential or the refresh is rejected. Bearer/none steps are unchanged.Stacked PR / deploy ordering
feat/prd-367-pr1-executor-oauth-credentials(PR feat(workflow-executor): add OAuth credential store + deposit endpoint (PRD-367 PR1) #1619), notmain: PR2 depends on PR1's credential store / encryption / deposit endpoint, which is not yet merged. Review the diff against that branch.authType+ accepts the typedawaitingInputReason(PR1.5) and the frontend ships (PR3) — deploy the orchestrator first. Safe to deploy alone: the oauth2 path only activates onceauthType=oauth2is served; bearer/none is unchanged.Notable choices (large-PR annotation)
invalid_grant, not a Postgres row lock — the executor and token endpoints can sit behind a client VPN, so no DB lock is held across the refresh HTTP call. Row lock is the documented hardening path if a strict reuse-detection provider plus real cross-process contention appears.reloadWithFreshAuth) plus a typedOAuthReauthRequiredError -> awaiting-inputmapping; all token/credential/HTTP logic stays behindRemoteToolFetcherand the new OAuth token service.@forestadmin/ai-proxychange (consumed by the agent too) is additive-only (newloadToolsWithFailures/loadRemoteToolsWithFailures+mcp-auth-errorexport) plus behaviour-preserving delegation;authTypeis stripped in theMcpClientconstructor likeid.ConfigurationErrorfor oauth2 steps (no credential store wired).Tests
294 tests across the touched suites (ai-proxy 59 + workflow-executor 235); build, lint, tsc clean; >=90% line/stmt coverage on changed files.
Known limitation
A re-auth at tool-call time leaves the step at
idempotencyPhase=executing, so it needs a manual retry after re-auth; the common list-tools-time re-auth (pre-executor) resumes cleanly.Part of PRD-367.
🤖 Generated with Claude Code
Note
Add stored OAuth credential support for MCP steps in workflow executor
OAuthTokenServiceto retrieve, cache, and refresh OAuth access tokens per (user, MCP server), with mutex-serialized refresh, rotation persistence, andOAuthReauthRequiredErroron unrecoverable invalid_grant.RemoteToolFetcherto inject Bearer tokens for oauth2-typed MCP servers and retry tool loading with a forced token refresh on auth failure.McpStepExecutorto catchOAuthReauthRequiredErrorduring both tool fetching and tool invocation, pausing the step withawaiting-input/needs-oauth-reauthrather than failing it; a single reauth retry is attempted before pausing.loadRemoteToolsWithFailurestoMcpClient,AiClient, and all adapter layers so per-server auth/connection failures are classified and surfaced without aborting the entire tool load.awaitingInputReasonthroughMcpStepOutcomeSchemaandstep-outcome-to-update-step-mapperso the reason reaches the server update request.📊 Macroscope summarized 0e05efe. 21 files reviewed, 0 issues evaluated, 0 issues filtered, 0 comments posted
🗂️ Filtered Issues
No issues evaluated.