Skip to content

Fail-open cache when persist backend is unreachable (#50)#51

Merged
poxet merged 1 commit into
masterfrom
feature/issue-50-fail-open-cache
Jun 15, 2026
Merged

Fail-open cache when persist backend is unreachable (#50)#51
poxet merged 1 commit into
masterfrom
feature/issue-50-fail-open-cache

Conversation

@poxet

@poxet poxet commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Closes #50.

Problem

When the persist backend (Redis) times out commands rather than cleanly failing to connect, ICache.GetAsync<T>(key, fetch) threw instead of falling back to the fetch loader. A production Redis outage therefore took the whole service down (every read retried ~15s then threw, blocked threads exhausted the pool at ~60k queued items) even though the backing store was healthy.

Changes

1. Fail-open in CacheBase (provider-agnostic). A backend exception is caught and:

  • reads → logged, treated as a miss → control flows to the fetch source loader (GetCoreAsync, PeekAsync, BuyMoreTime);
  • writes → logged and swallowed, never fault the caller (FetchCallback, SetCoreAsync).

Gated by an exception filter when (_options.FailOpenOnBackendError), so setting the new CacheOptions.FailOpenOnBackendError = false restores the previous throwing behavior exactly. Because it lives in the base class it protects all IPersist backends (Redis/MongoDB/File). CacheBase gained a nullable ILogger, threaded through the 5 cache subclasses and the DI factory lambdas.

2. Circuit breaker in the Redis provider. New internal RedisResiliencePolicy factory builds a Polly circuit breaker (outer) wrapping the existing retry (inner), so once the circuit is open calls fail fast (BrokenCircuitException, caught by the core fail-open) instead of paying retry latency per call — the fix for the thread-pool starvation. Half-open auto-recovers. CanConnectAsync returns (false, "circuit open") instead of throwing.

3. Options. CacheOptions.FailOpenOnBackendError (default true); RedisCacheOptions.RetryCount / CircuitBreakerFailureThreshold / CircuitBreakerDuration / CommandTimeout.

Acceptance criteria

  • With the backend down/timing out, GetAsync(key, fetch) returns fetch()'s result and does not throw.
  • A cache write failure does not fault the caller.
  • Under a sustained outage, calls fail fast once the breaker is open and recover automatically.

Tests

  • FailOpenTests (4) — throwing IPersist: Get returns the loader result; read/write failures don't fault; FailOpenOnBackendError=false re-throws.
  • RedisResiliencePolicyTests (3) — breaker opens after the threshold and fast-fails without invoking the backend; success passes through; transient failures still retry.
  • Full solution builds clean. Core: 478 pass (4 new). Redis: 5 pass.

Docs: README updated (core feature bullet + Redis "Resilience (fail-open)" section).

A persist-backend exception is no longer propagated to the caller: reads
are treated as a miss and fall through to the source loader, and writes
are logged and swallowed. A cache outage therefore never faults the
application as long as the source of truth is healthy. Gated by the new
CacheOptions.FailOpenOnBackendError (default true); set false to restore
the previous throwing behavior. Lives in CacheBase, so it covers every
IPersist backend (Redis/MongoDB/File).

Adds a Polly circuit breaker (outer) around the Redis retry (inner) so a
sustained outage short-circuits immediately instead of paying retry
latency on every call -- which is what caused the thread-pool starvation
in the reported incident. New RedisCacheOptions: RetryCount,
CircuitBreakerFailureThreshold, CircuitBreakerDuration, CommandTimeout.

Tests: FailOpenTests (4) + RedisResiliencePolicyTests (3).

Closes #50
@poxet poxet merged commit d20c9bb into master Jun 15, 2026
7 of 8 checks passed
@github-actions

Copy link
Copy Markdown

Released as v0.4.8https://github.com/Tharga/Cache/releases/tag/0.4.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache should fail open to the source loader when the persist backend is unreachable

1 participant