Fail-open cache when persist backend is unreachable (#50)#51
Merged
Conversation
A persist-backend exception is no longer propagated to the caller: reads are treated as a miss and fall through to the source loader, and writes are logged and swallowed. A cache outage therefore never faults the application as long as the source of truth is healthy. Gated by the new CacheOptions.FailOpenOnBackendError (default true); set false to restore the previous throwing behavior. Lives in CacheBase, so it covers every IPersist backend (Redis/MongoDB/File). Adds a Polly circuit breaker (outer) around the Redis retry (inner) so a sustained outage short-circuits immediately instead of paying retry latency on every call -- which is what caused the thread-pool starvation in the reported incident. New RedisCacheOptions: RetryCount, CircuitBreakerFailureThreshold, CircuitBreakerDuration, CommandTimeout. Tests: FailOpenTests (4) + RedisResiliencePolicyTests (3). Closes #50
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
Released as v0.4.8 — https://github.com/Tharga/Cache/releases/tag/0.4.8 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #50.
Problem
When the persist backend (Redis) times out commands rather than cleanly failing to connect,
ICache.GetAsync<T>(key, fetch)threw instead of falling back to thefetchloader. A production Redis outage therefore took the whole service down (every read retried ~15s then threw, blocked threads exhausted the pool at ~60k queued items) even though the backing store was healthy.Changes
1. Fail-open in
CacheBase(provider-agnostic). A backend exception is caught and:fetchsource loader (GetCoreAsync,PeekAsync,BuyMoreTime);FetchCallback,SetCoreAsync).Gated by an exception filter
when (_options.FailOpenOnBackendError), so setting the newCacheOptions.FailOpenOnBackendError = falserestores the previous throwing behavior exactly. Because it lives in the base class it protects allIPersistbackends (Redis/MongoDB/File).CacheBasegained a nullableILogger, threaded through the 5 cache subclasses and the DI factory lambdas.2. Circuit breaker in the Redis provider. New internal
RedisResiliencePolicyfactory builds a Polly circuit breaker (outer) wrapping the existing retry (inner), so once the circuit is open calls fail fast (BrokenCircuitException, caught by the core fail-open) instead of paying retry latency per call — the fix for the thread-pool starvation. Half-open auto-recovers.CanConnectAsyncreturns(false, "circuit open")instead of throwing.3. Options.
CacheOptions.FailOpenOnBackendError(defaulttrue);RedisCacheOptions.RetryCount/CircuitBreakerFailureThreshold/CircuitBreakerDuration/CommandTimeout.Acceptance criteria
GetAsync(key, fetch)returnsfetch()'s result and does not throw.Tests
FailOpenTests(4) — throwingIPersist: Get returns the loader result; read/write failures don't fault;FailOpenOnBackendError=falsere-throws.RedisResiliencePolicyTests(3) — breaker opens after the threshold and fast-fails without invoking the backend; success passes through; transient failures still retry.Docs: README updated (core feature bullet + Redis "Resilience (fail-open)" section).