Skip to content

feat: add attestation aggregate coverage metrics#386

Open
pablodeymo wants to merge 2 commits into
mainfrom
add-attestation-aggregate-coverage-metrics
Open

feat: add attestation aggregate coverage metrics#386
pablodeymo wants to merge 2 commits into
mainfrom
add-attestation-aggregate-coverage-metrics

Conversation

@pablodeymo
Copy link
Copy Markdown
Collaborator

🗒️ Description / Motivation

Ports leanSpec PR #735 to ethlambda: registers three Prometheus metrics that describe attestation aggregate coverage, with default zero-valued series so dashboards render from a fresh node startup.

leanSpec PR #735 itself mirrors blockblaz/zeam#898. Per upstream, this PR is registration only — the producer side (per-slot coverage computation, de-duplication across payloads, and the chain-status log line) is the equivalent of blockblaz/zeam#876 and lands in a follow-up.

What Changed

crates/blockchain/src/metrics.rs (+76 / -0):

  • Two pub const &[&str] label-set constants — single source of truth for sections and directions, mirroring ATTESTATION_AGGREGATE_COVERAGE_SECTIONS and ATTESTATION_AGGREGATE_COVERAGE_DIFF_DIRECTIONS in leanSpec.
  • Three new IntGaugeVec statics:
    • lean_attestation_aggregate_coverage_validators — labels: section, subnet. subnet="combined" is the section total; subnet="subnet_N" is per-subnet coverage.
    • lean_attestation_aggregate_coverage_subnets — label: section. Count of covered subnets per section.
    • lean_attestation_aggregate_coverage_diff_validators — label: direction. Counts of validators in the symmetric difference between block-included aggregates and locally-aggregated pre-merge (timely) aggregates for the same slot.
  • init() forces the new statics and seeds 18 default zero-valued series: 8 sections × subnet="combined", 8 sections, and 2 directions. Per-subnet (subnet="subnet_N") series appear lazily when instrumentation writes them.

Sections

timely, late, block, combined, agg_start_new, proposal_payloads, proposal_gossip, proposal_combined.

Directions

block_only, timely_only.

Notes

  • IntGaugeVec (not GaugeVec): all coverage values are integer counts, and every other labeled gauge in metrics.rs uses IntGaugeVec (LEAN_NODE_INFO, LEAN_TABLE_BYTES, LEAN_NODE_SYNC_STATUS).
  • No setter functions yet — they would be dead code until instrumentation lands and will be designed against real call sites in the follow-up PR.
  • The diff_validators help text intentionally diverges from upstream's terse phrasing ("Validator coverage delta between block payloads and timely pre-merge payloads") to spell out the symmetric-difference semantics: block_only = in block but not in local timely pool; timely_only = the reverse. Metric name, labels, and values are unchanged, so dashboards built against any client's schema are unaffected.

Operator interpretation of diff_validators

The aggregation pipeline produces two pools for the same slot: timely (locally aggregated pre-merge) and block (what the proposer included). The diff metric counts validators in the symmetric difference:

  • block_only persistently high → this node was slow to receive/aggregate via gossip; proposer had a better view.
  • timely_only persistently high → proposer omitted attestations the network had time to gossip.
  • Both near zero → local aggregation tracks proposers; the network is converging.

Correctness / Behavior Guarantees

  • No behavior change. This PR is metric registration only — no code path produces or consumes the new gauges yet. State transition, fork choice, attestation processing, and aggregation are untouched.
  • The new metrics are forced in init() exactly like the existing gauges, so they appear at /metrics from node startup.
  • Cardinality at registration: 18 series. Even with full instrumentation at 64 subnets, the validators gauge tops out at 8 × 65 = 520 series — well within Prometheus comfort.

Tests Added / Run

No new tests in this PR. The upstream Python test (test_attestation_aggregate_coverage_metrics_registered) is tautological for code that calls .set(0), and the blockchain crate doesn't currently host a metric-registry test harness; introducing one for this is out of proportion. Verified via the existing suite + manual /metrics smoke check.

Commands run:

  • cargo fmt --all -- --check — clean
  • make lint (clippy -D warnings) — clean
  • cargo test -p ethlambda-blockchain --release --lib --bins — 20 passed
  • cargo test -p ethlambda-blockchain --release --test signature_spectests — 7 passed
  • cargo test --workspace --release --exclude ethlambda-blockchain — all passing
  • cargo test -p ethlambda-blockchain --release --test forkchoice_spectests — 62 passed, 8 failed, all AttestationTooFarInFuture from pre-existing fixture flakes on main (no logic changes in this PR could affect attestation timing).

Related Issues / PRs

✅ Verification Checklist

  • Ran make fmt — clean
  • Ran make lint (clippy with -D warnings) — clean
  • Ran cargo test --workspace --release — passing modulo 8 pre-existing forkchoice fixture flakes documented above (unrelated to this PR)

  Port leanSpec PR #735: register three IntGaugeVec metrics describing
  attestation aggregate coverage, with default zero-valued series so
  dashboards render from a fresh node startup.

    - lean_attestation_aggregate_coverage_validators (section, subnet)
    - lean_attestation_aggregate_coverage_subnets    (section)
    - lean_attestation_aggregate_coverage_diff_validators (direction)

  ATTESTATION_AGGREGATE_COVERAGE_SECTIONS and
  ATTESTATION_AGGREGATE_COVERAGE_DIFF_DIRECTIONS are exported as the
  single source of truth for label sets. init() forces the new statics
  and seeds combined-subnet, section, and direction series to 0 (18
  default series total). Per-subnet (subnet="subnet_N") series appear
  lazily when instrumentation writes them.

  Registration only. The producer side (per-slot coverage computation,
  de-duplication across payloads, and the chain-status log line) ports
  zeam #876 and lands in a follow-up PR.

  The diff_validators help text diverges from upstreams terse phrasing
  to spell out the symmetric-difference semantics (block_only: in block
  but not in local timely pool; timely_only: the reverse). Metric name,
  labels, and values are unchanged.
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

This PR adds attestation aggregate coverage metrics following the existing patterns in the codebase. The implementation is correct and idiomatic.

Assessment:

  1. Code correctness: The use of std::sync::LazyLock for static initialization and the force() pattern in init() is consistent with the existing metrics in the file. Label value construction (&[section, "combined"]) is correct.

  2. Cardinality management: Item 1 (lines 469-488) correctly seeds only the combined subnet series at startup, deferring per-subnet series to lazy creation as noted in the comment. This prevents unbounded cardinality explosions in Prometheus.

  3. Error handling: The .unwrap() calls on register_int_gauge_vec! (lines 131, 141, 152) match the existing codebase's fail-fast approach for metrics registration failures. This is acceptable for initialization-time panics.

  4. Documentation: Metric help strings (lines 129-130, 143-144, 165-168) clearly describe the label semantics and update frequency, including the important distinction between subnet=combined and subnet=subnet_N.

Minor observation:

  • The ATTESTATION_AGGREGATE_COVERAGE_SECTIONS slice (lines 10-20) contains 8 sections. Ensure the instrumentation code that actually writes these metrics uses the same string literals to avoid label mismatch errors at runtime.

No security vulnerabilities or consensus risks identified—this is pure instrumentation with no state transition logic.

Verdict: LGTM. Clean, well-documented addition that follows established patterns.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

No findings.

This PR is limited to new Prometheus label constants and gauge registration in metrics.rs and startup seeding in metrics.rs. I don’t see any impact on fork choice, attestation validation, justification/finalization, state transition, XMSS handling, or SSZ behavior. The added code is straightforward, uses the existing LazyLock/Prometheus patterns consistently, and doesn’t introduce new memory-safety or error-handling concerns beyond the existing .unwrap() registration style already used in this module.

One residual note: this patch only registers/seeds the new metrics; there are no writer APIs or call sites in this PR, so the metrics will stay at their seeded values until follow-up instrumentation lands. I also couldn’t complete cargo check in this environment because Cargo/Rustup cannot write to /home/runner/.cargo and /home/runner/.rustup, so the review is based on static inspection rather than a successful local build.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Here is the full code review:


PR #386feat: add attestation aggregate coverage metrics

Overview

Ports leanSpec PR #735: registers three IntGaugeVec Prometheus metrics for attestation aggregate coverage, pre-seeding 18 zero-valued series in init() so dashboards render from node startup. Registration-only — no instrumentation/setter functions yet (coming in a follow-up). Single file changed: crates/blockchain/src/metrics.rs.


Code Quality

Correct and idiomatic. The three statics follow the established LazyLock + register_*_vec! + .unwrap() pattern exactly. IntGaugeVec is the right choice — integer counts, consistent with all other labeled gauges (LEAN_NODE_INFO, LEAN_TABLE_BYTES, LEAN_NODE_SYNC_STATUS). Labels are compact and match the upstream schema, so cross-client Grafana dashboards remain compatible.


Specific Observations

1. Redundant force() calls before seeding loops (lines 472–474)

std::sync::LazyLock::force(&LEAN_ATTESTATION_AGGREGATE_COVERAGE_VALIDATORS);
std::sync::LazyLock::force(&LEAN_ATTESTATION_AGGREGATE_COVERAGE_SUBNETS);
std::sync::LazyLock::force(&LEAN_ATTESTATION_AGGREGATE_COVERAGE_DIFF_VALIDATORS);
for &section in ATTESTATION_AGGREGATE_COVERAGE_SECTIONS {
    LEAN_ATTESTATION_AGGREGATE_COVERAGE_VALIDATORS
        .with_label_values(&[section, "combined"])
        .set(0);
    ...
}

Calling .with_label_values(...).set(0) forces the LazyLock internally, so the preceding explicit force() calls are redundant. They're harmless and consistent with the existing style (every static gets an explicit force() in init()), so this is fine to keep — just worth knowing it doesn't add behavior.

2. pub const label sets vs. SyncStatus::ALL inconsistency

SyncStatus::ALL is a private const on the enum and is only used internally in set_node_sync_status(). The new ATTESTATION_AGGREGATE_COVERAGE_SECTIONS and ATTESTATION_AGGREGATE_COVERAGE_DIFF_DIRECTIONS are pub const at module scope. The pub is justified because setter functions in the follow-up PR may need these from call sites outside metrics.rs (e.g., store.rs) — so this is forward-looking but reasonable. Consider keeping them pub(crate) if they won't be exported from the crate's public API, to limit exposure.

3. LEAN_NODE_SYNC_STATUS is not seeded in init() — pre-existing inconsistency

The existing LEAN_NODE_SYNC_STATUS is force()d in init() but its label values are not pre-seeded with zero; they appear only after the first set_node_sync_status() call. The new coverage metrics are seeded up front, which is the correct behavior per the PR's stated goal ("appear at /metrics from node startup"). No action needed in this PR, but the existing LEAN_NODE_SYNC_STATUS inconsistency could be addressed separately.

4. The three force() calls for all three statics happen before either seeding loop

LEAN_ATTESTATION_AGGREGATE_COVERAGE_DIFF_VALIDATORS is forced on line 474, but its seeding loop doesn't start until line 483. No bug, but co-locating each force() with its respective seeding block would make the grouping clearer. Again, a trivial style point given the existing init() structure.


Documentation

Excellent. The doc comments are precise and operationally useful:

  • The distinction between subnet=combined (section total) and subnet=subnet_N (per-subnet) is clearly stated.
  • The diff_validators help text spelling out symmetric-difference semantics is better than the upstream terse version and doesn't break dashboard compatibility.
  • The init() comment correctly explains the lazy-seeding strategy.

Correctness / Safety

  • No behavior change: confirmed. State transition, fork choice, attestation processing — all untouched.
  • Cardinality: 18 seeded series now; 520 max at 64 subnets — well within Prometheus limits.
  • No panics introduced: the register_*!(...).unwrap() calls are consistent with all existing statics and fail only at program startup (acceptable for misconfigured registries).
  • No security concerns.

Verdict

Approved. The implementation is correct, follows project conventions faithfully, and is appropriately scoped. The two actionable points worth considering before merge:

  • Consider pub(crate) instead of pub for the label-set constants if they won't be part of the crate's public API.
  • Track the LEAN_NODE_SYNC_STATUS zero-seeding gap as a follow-up (not a blocker for this PR).

Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

Registers three new IntGaugeVec Prometheus metrics for attestation aggregate coverage, porting leanSpec#735. The change is registration-only — no code path yet produces or consumes the new gauges.

  • Three new statics follow the exact same LazyLock<IntGaugeVec> pattern used by every other labeled gauge in the file.
  • init() forces all three and seeds 18 zero-valued series, keeping per-subnet series lazy for the follow-up instrumentation PR.
  • Two pub const &[&str] label-set constants are exported so the follow-up setter functions can reference them without hard-coding strings.

Confidence Score: 5/5

This PR is safe to merge — it is purely additive metric registration with no changes to any logic, state, or existing code paths.

The change adds three new IntGaugeVec statics and seeds 18 default zero series in init(). No existing code is modified, no consumers of the new metrics exist yet, and the implementation closely mirrors the patterns already established in the file.

No files require special attention. The single changed file is straightforward metric registration.

Important Files Changed

Filename Overview
crates/blockchain/src/metrics.rs Adds three IntGaugeVec statics and two pub label-set constants; seeds 18 default zero series in init(). Follows all existing patterns; no logic, no consumers yet.

Reviews (1): Last reviewed commit: "feat: add attestation aggregate coverage..." | Re-trigger Greptile

Copy link
Copy Markdown
Collaborator

@MegaRedHand MegaRedHand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to emit the metrics

  Port the producer side of zeam #876 on top of the metrics registered in
  the previous commit. After this commit, all 18 coverage series receive
  real per-slot updates from chain activity.

  Five emission sites:

    - accept_new_attestations (store.rs): captures `new_payloads`
      participant bits BEFORE promote and stashes them as a
      CoverageSnapshot on the Store. Read at the next slot boundary to
      populate the `timely` section ("prev_new" in zeam).
    - on_block_core (store.rs): mirrors the imported block s per-AttData
      aggregation bits into Store::last_block_coverage. Observability-only;
      fork choice is unchanged.
    - on_tick interval 0 (lib.rs): emits the post-block-merge report for
      `slot - 1`. Computes `timely`/`late`/`block`/`combined` from the
      stashed snapshots and the current `new_payloads`, then emits the
      diff_validators direction counts as the symmetric difference between
      `block` and `timely`.
    - start_aggregation_session (lib.rs): emits `agg_start_new` from the
      current `new_payloads` right before fork-choice aggregation runs at
      interval 2.
    - propose_block (lib.rs): emits `proposal_payloads`,
      `proposal_gossip`, and `proposal_combined` after the block is built.
      Each validator set in the block is classified by whether the
      AttestationData has a matching known-payload proof.

  New module crates/blockchain/src/coverage.rs holds the Coverage type
  (seen + has_subnet bitsets, derived subnet via vid % committee_count to
  match the gossip subnet assignment) plus the 3 emission helpers and 6
  unit tests covering add_bits, merge_from, diff_counts, empty/zero/out-
  of-range edge cases.

  Storage gets a CoverageSnapshot type and two Arc<Mutex<Option<…>>>
  fields on Store. No proofs are duplicated — only AggregationBits are
  captured, keeping the per-slot allocation in the tens of bytes per
  entry. The pre-merge capture happens inside accept_new_attestations
  just before promote_new_aggregated_payloads, so consumer-side timing
  concerns stay in the existing tick path.

  BlockChain::spawn now takes attestation_committee_count as a
  parameter; bin/ethlambda/src/main.rs already resolves the value
  (CLI > validator-config.yaml > 1) and passes it through. The number
  of attestation committees was previously only known to P2P (for
  subnet subscriptions); the coverage emitters need it to derive
  subnet ids.
@pablodeymo
Copy link
Copy Markdown
Collaborator Author

Instrumentation added in 855f56d. Five emission sites now write to the 18 series registered in e9a04f7:

  • timely / late / block / combined / diff_validators — emitted at the slot boundary (on_tick interval 0) for slot - 1, sourcing from a pre-merge snapshot of new_payloads, current late arrivals, and the last-imported block's aggregation bits.
  • agg_start_new — emitted right before fork-choice aggregation runs (interval 2).
  • proposal_payloads / proposal_gossip / proposal_combined — emitted from propose_block after produce_block_with_signatures, classifying validators in the proposed block as known-payload-covered vs. gossip-only.

Coverage lives in a new crates/blockchain/src/coverage.rs (~280 lines, 6 unit tests covering add_bits, merge_from, diff_counts, plus empty/zero/out-of-range edge cases). Subnet derivation uses vid % committee_count to match the gossip subnet assignment in crates/net/p2p/src/lib.rs:241.

Store gets a small CoverageSnapshot type and two Arc<Mutex<Option<…>>> fields — no proofs duplicated, only AggregationBits. The pre-merge capture happens inside accept_new_attestations just before promote_new_aggregated_payloads, so consumer-side timing concerns stay in the existing tick path.

BlockChain::spawn gains an attestation_committee_count: u64 parameter — only bin/ethlambda/src/main.rs calls it, and the value was already resolved there (CLI > validator-config.yaml > 1).

Ready for another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants