fix(storage): verify ClientPut receiver against over-query window#142
Open
jacderida wants to merge 1 commit into
Open
fix(storage): verify ClientPut receiver against over-query window#142jacderida wants to merge 1 commit into
jacderida wants to merge 1 commit into
Conversation
Mirror the paid-quote issuer fix (WithAutonomi#141) on the receiver storage-responsibility check. After WithAutonomi#140/WithAutonomi#141 removed the reachability re-rank and widened the issuer check, uploads recovered substantially, but a residual few percent still fail on: ClientPut receiver <peer> is not among this node's local 9 closest peers (close group plus storage margin) most visibly as "Failed to store public DataMap" — the DataMap is a single critical chunk, so one receiver-check rejection fails the whole upload regardless of file size. Cause is the same divergence the issuer check had: the uploader queries 2 * CLOSE_GROUP_SIZE peers and PUTs each chunk to the CLOSE_GROUP_SIZE closest *successful responders* (ant-client get_store_quotes), so when closer peers are slow or NAT-stuck the storer it legitimately PUT to sits at XOR positions up to 2 * close_group_size. The receiver check verified only the bare close_group_size + storage margin (9) of the node's *local* routing table with exact self-membership, so it rejected honest PUTs. Bring it in line with the issuer check: - Widen to 2 * close_group_size, matching the uploader's over-query window. This does not amplify replica count — the uploader still PUTs to only its selected storers — it just lets a legitimately-selected storer at position 10..14 accept. - Keep the XOR-only lookup (find_closest_nodes_local_with_self reranks by reachability and would demote XOR-close relay-only / NAT'd storers). - Hybrid source: try the local routing-table view first, and only on a local miss fall back to an authoritative find_closest_nodes_network lookup (the same view the uploader used to choose the storers), wrapped in the shared CLOSENESS_LOOKUP_TIMEOUT. Reject only if absent from both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
After #140 (XOR-only verification) and #141 (issuer-closeness over-query window + network fallback), upload success improved dramatically — large files 100%, small files ~92–98%. The residual few-percent failures are now dominated by the receiver storage-responsibility check (
AntProtocol::validate_store_membership):They surface most visibly as
Failed to store public DataMap— the public DataMap is a single critical chunk per upload, so one receiver-check rejection of it fails the whole upload regardless of file size.Cause
This is the same divergence #141 fixed for the issuer check, on the receiver side. The uploader queries
2 * CLOSE_GROUP_SIZEpeers and PUTs each chunk to theCLOSE_GROUP_SIZEclosest successful responders (ant-clientget_store_quotes). When closer peers are slow or NAT-stuck, the storer it legitimately PUT to sits at XOR positions up to2 * close_group_size. But the receiver check verified only the bareclose_group_size+ storage margin (9) of the node's local routing table, with exact self-membership — so it rejected honest PUTs.#141only touched the issuer check (PaymentVerifier); the receiver check still had the strict local-only / width-9 / exact-membership logic.Fix (symmetric with #141)
2 * close_group_size, matching the uploader's over-query window. This does not amplify replica count — the uploader still PUTs to only its selected storers — it just lets a legitimately-selected storer at position 10..14 accept.find_closest_nodes_local_with_selfreranks by reachability and would demote XOR-close relay-only / NAT'd storers).find_closest_nodes_networklookup (the same view the uploader used to choose the storers), wrapped in the sharedCLOSENESS_LOOKUP_TIMEOUT. Reject only if absent from both.CLOSENESS_LOOKUP_TIMEOUTis promoted topub(crate)so the receiver check reuses the merkle/issuer timeout.Note
Same cost caveat as #141: the fallback can issue a per-chunk network lookup when the local view misses; the local fast-path keeps that off the hot path for storers we already know. One storage-specific consideration: a node may now accept a chunk it is the ~10th–14th-closest to, which the replication/pruning close-group logic could later treat as borderline; replica count is unaffected since the uploader controls the PUT targets.
Builds on #140 and #141.
🤖 Generated with Claude Code