Skip to content

fix(logmq): add per-entry fallback on batch InsertMany failure#954

Closed
mvanhorn wants to merge 1 commit into
hookdeck:mainfrom
mvanhorn:fix/948-logmq-batch-fallback
Closed

fix(logmq): add per-entry fallback on batch InsertMany failure#954
mvanhorn wants to merge 1 commit into
hookdeck:mainfrom
mvanhorn:fix/948-logmq-batch-fallback

Conversation

@mvanhorn

Copy link
Copy Markdown
Contributor

Summary

  • Add processEntry helper that inserts a single log entry and evaluates alerts for it
  • On InsertMany batch failure, fall back to calling processEntry per-entry instead of nacking the whole batch
  • Refactor post-batch alert evaluation loop into shared evalAlert helper (no behaviour change on happy path)
  • Add two new unit tests: TestBatchProcessor_BatchFailure_SingleBadEntry and TestBatchProcessor_BatchFailure_AllEntriesFail

Why this matters

Fixes #948. With LOG_BATCH_SIZE=1000, a single duplicate-key entry (SQLSTATE 21000: ON CONFLICT DO UPDATE command cannot affect row a second time) causes InsertMany to fail and every message in the batch to be nacked. All 1000 messages redeliver, the bad entry lands in a new batch, and the failure cycle repeats. The mass-nacking also exhausts the SQS consecutive-receive-error budget, permanently killing the consumer worker. The per-entry fallback limits blast radius to the one bad message; the remaining entries are acked normally. Transient failures (DB down) degrade gracefully - per-entry inserts all fail and everything nacks, same as before.

Changes

  • internal/logmq/batchprocessor.go - add processEntry and evalAlert helpers; call processEntry loop on batch insert error
  • internal/logmq/batchprocessor_test.go - extend mockLogStore with failBatchGt field; add two batch-failure unit tests

Fixes #948

When InsertMany fails for a batch, nacking every message allows one bad
entry to poison the entire batch and trigger a redelivery loop that can
exhaust the SQS consecutive-receive-error budget and kill the consumer.

Add a processEntry helper that inserts and evaluates alerts for a single
entry. On batch failure, fall back to calling processEntry per-entry so
only the bad message is nacked while valid entries are acked normally.
Refactor the post-batch alert loop into a shared evalAlert helper (no
behaviour change on the happy path). Add unit tests for both fallback
scenarios (single bad entry and full transient DB failure).

Fixes hookdeck#948
@alexbouchardd

Copy link
Copy Markdown
Contributor

Seems to overlap with #953, @alexluong let's find what the right common ground between those 2 PRs is!

@alexluong

alexluong commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the PR @mvanhorn, I'm review it now!

@alexbouchardd not quite overlap I think. The 2 PRs implemented 2 different GH issues. Only overlap is that it's related to the logmq flow but each PR solves different problem.

@alexluong

Copy link
Copy Markdown
Collaborator

Hi @mvanhorn, thanks for digging into this. After some internal discussions, we're going to close this PR as well as the underlying issue #948. After some further consideration, we realized that the per-entry insertion is not the right approach for scale and we'll need to tackle this problem in a different way.

That said, the PR itself did a good job of the implementation and I appreciate the contribution! Thanks

@alexluong alexluong closed this Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

logmq: batch insert failure nacks the entire batch — one bad entry poisons all messages

3 participants