fix(ingester): prevent false series limit throttling during scale-up by yeya24 · Pull Request #7491 · cortexproject/cortex

yeya24 · 2026-05-08T06:06:54Z

What this PR does

During ingester scale-up, the local series limit (derived from global_limit / num_ingesters * replication_factor) shrinks immediately when new ingesters join the ring. However, existing ingesters cannot shed their series until head compaction (~2h). This causes false throttling for tenants who are within their global limit.

Root Cause

When ingesters scale from N to M (e.g., 75 → 249), the local limit drops from global / 75 * 3 = 108,000 to global / 249 * 3 = 32,530. An ingester holding 50,000 series (well under the 2.7M global limit) is immediately throttled because 50,000 > 32,530.

The series cannot redistribute until head compaction flushes idle series, so the throttling persists for up to 2 hours.

Fix

Cache the previous local limit per tenant in convertGlobalToLocalLimit(). If the global limit has not changed and the new calculated limit is lower than the cached value, return the cached (higher) limit.

The cache is only used when globalLimit == prev.globalLimit, ensuring:

Global limit decrease: immediately enforced (cache bypassed)
Global limit increase: recalculated fresh (cache bypassed)
Scale-up with same limit: cached (prevents false throttle)
Scale-down: new limit is higher, cache updates naturally

A ResetLocalLimitCache() method is provided for callers to clear the cache after head compaction, when the ingester's series count reflects its true post-resharding ownership.

How was this tested?

Unit tests covering scale-up, scale-down, global limit changes, and cache reset scenarios
All existing limiter tests pass without modification

During ingester scale-up, the local series limit (derived from global_limit / num_ingesters * replication_factor) shrinks immediately when new ingesters join the ring. However, existing ingesters cannot shed their series until head compaction (~2h). This causes false throttling for tenants who are within their global limit. Add experimental -ingester.local-limit-cache-enabled flag (default false). When enabled, the limiter caches the previous local limit per tenant in convertGlobalToLocalLimit(). If the global limit has not changed and the new calculated limit is lower than the cached value, the cached (higher) limit is returned. The cache is only used when globalLimit == prev.globalLimit, ensuring: - Global limit decrease: immediately enforced (cache bypassed) - Global limit increase: recalculated fresh (cache bypassed) - Scale-up with same limit: cached (prevents false throttle) - Scale-down: new limit is higher, cache updates naturally ResetLocalLimitCache(userID) resets the cache for a specific tenant after their TSDB head compaction, when the ingester's series count reflects its true post-resharding ownership. Signed-off-by: Ben Ye <benye@amazon.com>

pull-request-size Bot added the size/L label May 8, 2026

dosubot Bot added component/ingester type/bug labels May 8, 2026

yeya24 force-pushed the fix/prevent-false-throttle-during-scaleup branch 8 times, most recently from 4c9f220 to 1e065f4 Compare May 8, 2026 06:37

yeya24 force-pushed the fix/prevent-false-throttle-during-scaleup branch from 1e065f4 to aea36ca Compare May 8, 2026 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingester): prevent false series limit throttling during scale-up#7491

fix(ingester): prevent false series limit throttling during scale-up#7491
yeya24 wants to merge 1 commit intocortexproject:masterfrom
yeya24:fix/prevent-false-throttle-during-scaleup

yeya24 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yeya24 commented May 8, 2026

What this PR does

Root Cause

Fix

How was this tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant