Improve TopologyService and HeartbeatService scalability for large clusters by CRZbulabula · Pull Request #17595 · apache/iotdb

CRZbulabula · 2026-05-05T06:04:56Z

Summary

This PR improves the scalability of TopologyService, HeartbeatService, and related components for clusters with a large number of DataNodes.

TopologyService — probing scalability

√N sampling with batch rotation: Each probing cycle selects only ceil(√N) DataNodes as probers instead of all N, rotating across cycles for full coverage. Reduces per-cycle RPC fan-out from O(N) to O(√N) and total connection tests from O(N²) to O(N√N).
Batch-scoped failure detection: The failure detector runs only on heartbeat pairs from the current prober batch. Non-prober pairs carry forward their previous reachable state from heartbeat history.
Only probe Running DataNodes: Uses LoadManager.getNodeStatus() to filter, replacing the manually maintained startingDataNodes list.
Configurable interval and timeout: topology_probing_base_interval_in_ms (default 5000) and topology_probing_timeout_ratio (default 0.5) replace hardcoded constants.

TopologyService — topology distribution

New updateClusterTopology Thrift RPC: Added TUpdateClusterTopologyReq struct and updateClusterTopology service method to datanode.thrift, fully decoupled from the heartbeat interface.
Dedicated PUSH_TOPOLOGY request type: Topology updates are pushed via CnToDnInternalServiceAsyncRequestManager with PUSH_TOPOLOGY async request type, calling updateClusterTopology on DataNodes.
Per-DataNode topology: Each DataNode receives only its own reachable set plus the full dataNodes location map, reducing push payload from O(N²) to O(N) per node.
Incremental push: lastPushedTopology tracks what was last sent; only DataNodes whose reachable set has changed receive a push. Updated only on successful delivery (in response callback).
Removed topology from heartbeat: HeartbeatService.genHeartbeatReq() no longer sets topology or dataNodes fields. Topology handling removed from getDataNodeHeartBeat handler on DataNode side.

ClusterTopology (DataNode side)

Simplified to per-node view: Stores only myReachableNodes (this node's reachable set) instead of the full O(N²) topology map. isPartitioned is computed from myReachableNodes.size() != dataNodes.size().
Graceful handling of unknown topology: When TopologyService hasn't probed yet (myReachableNodes is empty), getValidatedReplicaSet, getReachableCandidates, and filterReachableCandidates return full replica sets instead of empty results.
Simplified getReachableCandidates: Filters by this node's own reachable set instead of brute-force searching across all nodes' views. Complexity reduced from O(N×R) to O(R).

Timeout protection and thread isolation

Bounded CountDownLatch.await: testAllDataNodeConnectionInHeartbeatChannel and all 4 service-type test methods (submitTestConnectionTask) now use sendAsyncRequestWithTimeoutInMs(timeout) instead of unbounded await().
Dedicated probing thread pool: submitInternalTestConnectionTask offloads blocking work to TOPOLOGY_PROBING_EXECUTOR (sized max(1, cores/4)) with Future.get(timeout), keeping the DataNodeInternalRPCService thread pool free for sendFragmentInstance and other critical RPCs.

HeartbeatService scalability

Increased selector threads: Heartbeat client pools use a dedicated heartbeat_selector_num_of_client_manager config (default 0 = auto: max(1, cores/4)), up from the general default of 1.

Bug fixes

Topology diff logging: LoadCache.updateTopology() and ClusterTopology.updateTopology() had copy-paste bugs where originReachable read from latestTopology instead of the old topology, making the diff log dead code.

New configuration parameters

Parameter	Default	Description
`topology_probing_base_interval_in_ms`	5000	Base interval for topology probing
`topology_probing_timeout_ratio`	0.5	Ratio of timeout to interval
`heartbeat_selector_num_of_client_manager`	0 (auto)	Selector threads for heartbeat client pools. 0 = `max(1, cores/4)`

Test plan

Deploy 3-node cluster, confirm topology probing selects √3 ≈ 2 probers per cycle via [Topology] debug logs
Scale to 9+ DataNodes, confirm prober rotation covers all nodes over √N cycles
Stop one DataNode, confirm other DataNodes' internal RPC threads are not blocked beyond dnConnectionTimeoutInMS
Trigger a network partition, confirm [Topology] DataNode X is now unreachable to myself(Y) log entries appear on the affected DataNode
Verify queries work correctly before TopologyService has completed its first probing cycle (empty topology → full replicas returned)
Run existing TopologyService and HeartbeatService integration tests

🤖 Generated with Claude Code

codecov · 2026-05-05T08:28:32Z

Codecov Report

❌ Patch coverage is 17.73050% with 116 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.05%. Comparing base (f72b3ee) to head (b0e2df7).

Files with missing lines	Patch %	Lines
...nfignode/manager/load/service/TopologyService.java	7.27%	51 Missing ⚠️
...ol/thrift/impl/DataNodeInternalRPCServiceImpl.java	17.24%	24 Missing ⚠️
...che/iotdb/db/queryengine/plan/ClusterTopology.java	4.16%	23 Missing ⚠️
...he/iotdb/confignode/conf/ConfigNodeDescriptor.java	0.00%	8 Missing ⚠️
...apache/iotdb/confignode/conf/ConfigNodeConfig.java	25.00%	6 Missing ⚠️
...apache/iotdb/commons/client/ClientPoolFactory.java	50.00%	2 Missing ⚠️
...iotdb/confignode/manager/load/cache/LoadCache.java	0.00%	1 Missing ⚠️
...otdb/db/pipe/agent/task/PipeDataNodeTaskAgent.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #17595   +/-   ##
=========================================
  Coverage     40.05%   40.05%           
  Complexity     2554     2554           
=========================================
  Files          5176     5176           
  Lines        348528   348594   +66     
  Branches      44558    44557    -1     
=========================================
+ Hits         139595   139626   +31     
- Misses       208933   208968   +35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sonarqubecloud · 2026-05-05T08:36:42Z

Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

- Only run failure detector on current prober batch instead of all pairs - Push topology via dedicated PUSH_TOPOLOGY request type (not heartbeat) - Update lastPushedTopology only on successful push - Simplify onNodeStatisticsChanged to only clean up on removal/Removing - ClusterTopology returns full replicas when topology not yet probed - Add Javadoc to ClusterTopology

- Add TUpdateClusterTopologyReq struct and updateClusterTopology RPC to datanode.thrift - Implement updateClusterTopology handler in DataNodeInternalRPCServiceImpl - Change PUSH_TOPOLOGY action to call updateClusterTopology instead of getDataNodeHeartBeat - Use DataNodeTSStatusRPCHandler (default) instead of custom TopologyPushRPCHandler - Remove topology handling from heartbeat handler (getDataNodeHeartBeat) - Delete TopologyPushRPCHandler (no longer needed)

CRZbulabula added 5 commits May 6, 2026 14:33

ready 4 review

b1f01d5

update

af504f9

better impl 4 topology service

e992d04

Drop PipeDataNodeTaskAgent change — already fixed in master (#17590)

431e2b3

CRZbulabula force-pushed the upgrade-heartbeat-service branch from bfbda1a to 431e2b3 Compare May 6, 2026 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve TopologyService and HeartbeatService scalability for large clusters#17595

Improve TopologyService and HeartbeatService scalability for large clusters#17595
CRZbulabula wants to merge 6 commits intomasterfrom
upgrade-heartbeat-service

CRZbulabula commented May 5, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 5, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CRZbulabula commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

TopologyService — probing scalability

TopologyService — topology distribution

ClusterTopology (DataNode side)

Timeout protection and thread isolation

HeartbeatService scalability

Bug fixes

New configuration parameters

Test plan

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented May 5, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CRZbulabula commented May 5, 2026 •

edited

Loading

codecov Bot commented May 5, 2026 •

edited

Loading