fix: classify permanent ingest failures; cut a per-flush allocation#21
Merged
Conversation
Three hardening changes from the perf/reliability review.
1. HTTP failure taxonomy (the important one). flush() treated EVERY non-200
as retryable, so retryFlush(false) retried forever. A single misconfig — a
wrong index (404), an expired token (401), or a schema-rejected batch
(400/422) — parked a worker in the infinite-retry loop, and with the default
2 workers a couple of poison batches froze all ingest and blocked every
Ingest caller. flush() now treats 400/401/403/404/409/422 as permanent:
it settles tracked items with a new *IngestError{Reason: ReasonServer},
invokes OnDiscard for fire-and-forget items, and drops the batch so the
worker keeps draining. 5xx, 408, 425, 429, 413 and transport errors stay
retryable (413 keeps its auto-reduce path). Tracked callers now get a
definitive verdict instead of an ambiguous context timeout.
2. Drop a per-flush allocation. flush() built a fresh `encoded` slice
(~48KB + a full copy at the default batch size) every round-trip just to
replay the settle list. Removed: on HTTP 200 we settle over `batch`
directly, which is identical because encode-failure and fire-and-forget
items have ack==nil so settle(nil) is a no-op for them.
3. Recover a 413-reduced batch size on the ticker, not only on a later
successful flush. flush() never runs on an empty buffer, so after a
transient 413 spike followed by quiet traffic the batch could stay shrunk
(down to 10%) indefinitely. The reset now also runs in the worker's ticker
branch, so it fires within ~one maxDelay of the window expiring.
Fire-and-forget batching, ordering, the 413 split, and the IngestSync
at-least-once / exactly-once-settle invariants are unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The DO-NOW tier from the performance + send-failure review.
1. HTTP failure taxonomy — stop retrying permanent 4xx forever (the important one)
flush()treated every non-200 as retryable, soretryFlush(false)retried forever. A single misconfiguration — wrong index (404), expired token (401), or a schema-rejected batch (400/422) — parked a worker in the infinite-retry loop. With the default 2 workers, a couple of poison batches froze all ingest and blocked everyIngestcaller. There was no permanent-failure path.Now
flush()classifies the status:400/401/403/404/409/422): settle tracked items with a new*IngestError{Reason: ReasonServer}, invokeOnDiscardfor fire-and-forget items, drop the batch, and keep draining.413keeps its own auto-reduce/re-chunk path.Tracked (
IngestSync) callers now get a definitive Nack/dead-letter verdict instead of an ambiguous context timeout.2. Drop a per-flush allocation
flush()built a freshencoded := make([]ingestItem, 0, len(batch))(~48 KB + a full 1000-element copy at the default batch size) every round-trip, only to replay the settle list. Removed — on HTTP 200 we settle overbatchdirectly. Identical behavior, because encode-failure and fire-and-forget items haveack==nil, sosettle(nil)is a no-op for them.3. Recover a 413-reduced batch size on the ticker
The post-413 batch-size reset only ran on a successful flush, but
flush()never runs on an empty buffer — so after a transient 413 spike followed by quiet traffic, the batch could stay shrunk (down to 10%) indefinitely. The reset now also runs in the worker's ticker branch, firing within ~onemaxDelayof the 10-minute window expiring. Gated onSetAutoReduceBatchSize, off by default.Invariants preserved
Fire-and-forget batching, ordering, the 413 split, and the
IngestSyncat-least-once / exactly-once-settle guarantees are unchanged. The permanent path settles each item exactly once and drops the batch, so nothing is re-flushed or double-settled.Tests
5 new (
ingest_reliability_test.go): permanent 400 → promptReasonServer; permanent 404 doesn't wedge a single worker (second call still gets a verdict); 5xx still retried to success (guards against over-classifying); fire-and-forget 422 →OnDiscard+ Close returns promptly; worker-side encode failure still discarded with theencodedslice gone (good docs delivered, poison discarded). Full suite green,go vetclean, race-clean across-count=2.Reviewed but deferred (not in this PR)
429/Retry-After, backoff jitter,
ResponseHeaderTimeout, ingest-response-body inspection (silent reject loss),SetCloseTimeout, andcommit=wait_for— all from the "with care" tier, to be decided separately.🤖 Generated with Claude Code