Adds retry support to the Amazon.Lambda.DurableExecution#2363
Adds retry support to the Amazon.Lambda.DurableExecution#2363GarrettBeatty wants to merge 1 commit into
Conversation
711bf82 to
4f05fa9
Compare
4f05fa9 to
54d18f9
Compare
54d18f9 to
599445f
Compare
599445f to
e7a85e4
Compare
e7a85e4 to
8f23ebb
Compare
8f23ebb to
e39e68e
Compare
e39e68e to
52055d3
Compare
531cbbe to
31ea7e8
Compare
31ea7e8 to
ef44439
Compare
ef44439 to
6bc97f2
Compare
6bc97f2 to
85eae3e
Compare
85eae3e to
0a32c0d
Compare
There was a problem hiding this comment.
Pull request overview
Builds on PR #2360 to add retry support to the Amazon.Lambda.DurableExecution SDK. Failed steps can now be retried with configurable backoff and jitter via service-mediated retries (the SDK checkpoints a RETRY operation and suspends the Lambda so the user is not billed during backoff). Adds at-most-once semantics for non-idempotent steps via a synchronously-flushed START checkpoint that allows crash detection on replay.
Changes:
- New public retry API:
IRetryStrategy,RetryDecision,RetryStrategyfactories (Default/Transient/None/Exponential/FromDelegate),JitterStrategy,StepSemantics, andStepConfig.RetryStrategy/StepConfig.Semantics. StepOperationaddsPENDING(retry-timer) andSTARTED(AtMostOnce crash-recovery) replay arms, aHandleStepFailureAsyncdecision tree, and START-checkpoint emission (sync for AtMostOnce, fire-and-forget for AtLeastOnce).- 21 new unit tests plus integration-test updates asserting
StepStartedevents and richer history logging.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
Config/IRetryStrategy.cs |
New strategy interface + RetryDecision struct |
Config/RetryStrategy.cs |
ExponentialRetryStrategy, DelegateRetryStrategy, JitterStrategy, StepSemantics, factories |
Config/StepConfig.cs |
Adds RetryStrategy and Semantics properties |
Internal/StepOperation.cs |
PENDING/STARTED replay arms, retry decision tree, START-checkpoint emission |
Internal/TerminationManager.cs |
Adds RetryScheduled termination reason |
Internal/CheckpointBatcher.cs |
Doc-only update describing fire-and-forget semantics |
Tests/RetryStrategyTests.cs |
14 unit tests for exponential math/jitter/filters/delegate |
Tests/DurableContextTests.cs |
6 retry/AtMostOnce/Pending replay tests |
Tests/DurableFunctionTests.cs |
Updated to assert START + SUCCEED + WAIT-START flat sequence |
IntegrationTests/*.cs |
Add StepStarted-event assertions; richer history dump in DurableFunctionDeployment |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var history = await deployment.WaitForHistoryAsync( | ||
| arn!, | ||
| h => (h.Events?.Count(e => e.StepSucceededDetails != null) ?? 0) >= 2 | ||
| h => (h.Events?.Count(e => e.EventType == EventType.StepStarted) ?? 0) >= 2 |
There was a problem hiding this comment.
now that we are emitting START steps (which are needed for retries) we are asserting them in the IT tests
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
| /// Replay semantics — example: <c>await ctx.StepAsync(ChargeCard, "charge")</c> | ||
| /// Replay branches — example: <c>await ctx.StepAsync(ChargeCard, "charge")</c> | ||
| /// <list type="bullet"> | ||
| /// <item>Fresh: no prior state → run func → emit SUCCEED → return result.</item> |
There was a problem hiding this comment.
in previous PR only SUCCEEDED or FAILED mattered. But now for replays, we need to keep track of how many times the function was executed, which is done via the number of STARTED steps.
| // in this commit, so fall through and execute fresh. (Future work | ||
| // on retries will replace this default with explicit arms.) | ||
| return ExecuteFunc(cancellationToken); | ||
| // Unknown status — treat as fresh. |
There was a problem hiding this comment.
unknown status i think we should error - need to check
| private Task<T> ReplayStarted(Operation started, CancellationToken cancellationToken) | ||
| { | ||
| var attemptNumber = (started.StepDetails?.Attempt ?? 0) + 1; | ||
|
|
||
| if (_config?.Semantics == StepSemantics.AtMostOncePerRetry) | ||
| { | ||
| // Re-running func would risk a duplicate side effect (e.g. double | ||
| // charge). Treat the lost result as a failure; let the retry | ||
| // strategy decide whether to try again or give up. | ||
| var error = started.StepDetails?.Error; | ||
| var ex = error != null | ||
| ? new StepException(error.ErrorMessage ?? "Step failed on previous attempt") { ErrorType = error.ErrorType } | ||
| : new StepException("Step result lost during AtMostOncePerRetry replay"); | ||
| return HandleStepFailureAsync(ex, attemptNumber, cancellationToken); | ||
| } | ||
|
|
||
| return ExecuteFunc(attemptNumber, cancellationToken); | ||
| } |
#2216
What
Adds retry support to
Amazon.Lambda.DurableExecutionon top of #2360. A step that throws can now be retried with configurable backoff and jitter. The Lambda suspends between attempts and is re-invoked by the service when the retry timer fires, so compute is not billed during the wait.Public API:
IRetryStrategyRetryDecisionIRetryStrategy.ShouldRetry—ShouldRetryflag plusDelay.RetryStrategyDefault,Transient,None,Exponential(...),FromDelegate(...).JitterStrategyNone/Half/Fullfor exponential backoff.StepSemanticsAtLeastOncePerRetry(default) /AtMostOncePerRetry.StepConfig.RetryStrategy,StepConfig.SemanticsHow
When a step throws,
StepOperation.HandleStepFailureAsynccallsIRetryStrategy.ShouldRetry(ex, attemptNumber). If the strategy says retry, the SDK writes aRETRYcheckpoint withNextAttemptDelaySecondsand suspends —RunAsyncreturnsPending. The service holds the execution until the delay elapses, then re-invokes us. On replay,StepOperation.ReplayAsyncsees thePENDINGstatus and either re-suspends (timer not yet up) or re-executes the step with an incremented attempt counter.AtMostOncePerRetrysemantics handle non-idempotent steps (charging a card, sending an email). The SDK writes aSTARTcheckpoint and blocks until the batcher flushes it before user code runs. If Lambda crashes between user code and theSUCCEEDflush, replay seesSTARTEDwith no terminal record and routes the attempt throughHandleStepFailureAsyncinstead of re-executing — the side effect runs at most once per attempt.ExponentialRetryStrategysupports max attempts, initial/max delay, backoff rate, jitter, and exception filtering by type or message regex. Built-in factories:Default(6 attempts, 5s/60s, 2× backoff, full jitter),Transient(3 attempts, 1s/5s, half jitter),None.RetryStrategy.FromDelegate(...)covers arbitrary policies.Testing
21 new unit tests in
Amazon.Lambda.DurableExecution.Tests(130 total, up from 109 in #2360):RetryStrategyTests(14) — exponential backoff math, jitter, max-attempt exhaustion, exception-type and message-pattern filtering, delegate strategies.DurableContextTestsretry block (6) — checkpoint-and-suspend on retry, fail-without-strategy, retry exhaustion, future/pastPENDINGreplay,AtMostOncestart-flush ordering,STARTEDreplay routing through the retry handler.Integration tests in
Amazon.Lambda.DurableExecution.IntegrationTests—RetrySucceedsandRetryExhaustsrun end-to-end against the durable-execution service.Out of scope (follow-up PRs)
MapAsync/ParallelAsync/RunInChildContextAsync/WaitForConditionAsyncCallbackAsync,InvokeAsyncDefaultJsonCheckpointSerializerDurableLoggerreplay-suppression (currentlyNullLogger)[DurableExecution]attributeDurableTestRunner/Amazon.Lambda.DurableExecution.Testingpackagedotnet new lambda.DurableFunctionblueprint