Skip to content

[server] Add disk-usage write protection to TabletServer#3340

Open
swuferhong wants to merge 1 commit into
apache:mainfrom
swuferhong:disk-usage-protect
Open

[server] Add disk-usage write protection to TabletServer#3340
swuferhong wants to merge 1 commit into
apache:mainfrom
swuferhong:disk-usage-protect

Conversation

@swuferhong
Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #3338

Introduce a periodic disk-usage monitoring mechanism that proactively rejects client writes when the TabletServer's data disk usage exceeds a configurable high-water-mark ratio, preventing ENOSPC errors and potential data corruption.

Key design decisions:

  • Hysteresis state machine with a fixed 10% recovery gap to avoid rapid lock/unlock oscillation (lock at limit, unlock at limit-0.10)
  • Max-per-disk strategy: report the highest usage across all distinct FileStores so a single full disk is never masked by other low-usage disks in multi-disk deployments
  • Only client-driven writes (appendLog/putKv) are rejected with a retriable DiskWriteLockedException; follower replication is not blocked to preserve replica consistency
  • write-limit-ratio supports runtime dynamic reconfiguration via ServerReconfigurable, with an immediate re-check on change
  • Setting ratio to 1.0 completely disables the protection

New configuration:

  • server.data-disk.write-limit-ratio (default 0.85, dynamic)
  • server.data-disk.check-interval (default 30s)

New metrics:

  • diskUsageRatio: current disk usage ratio [0.0, 1.0]
  • diskWriteLocked: 1 when writes are being rejected, 0 otherwise

Brief change log

Tests

API and Format

Documentation

@swuferhong swuferhong force-pushed the disk-usage-protect branch 2 times, most recently from 5949356 to ac209de Compare May 18, 2026 03:16
Copy link
Copy Markdown
Member

@zuston zuston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If exceeding the disk usage ratio threshold (or disk corruption), do we need to make this tablet server as offline or unhealthy status? I think the writer side fencing is not enough, sometimes the disk usage exceeding will not recover automaticlly at the many cases

@swuferhong swuferhong force-pushed the disk-usage-protect branch from ac209de to feb7dc6 Compare May 18, 2026 03:36
@swuferhong
Copy link
Copy Markdown
Contributor Author

If exceeding the disk usage ratio threshold (or disk corruption), do we need to make this tablet server as offline or unhealthy status? I think the writer side fencing is not enough, sometimes the disk usage exceeding will not recover automaticlly at the many cases

Hi, @zuston. Writer-side fencing is the minimum-sufficient response for a capacity event; promoting it to node-level offline turns a localized capacity problem into a cluster-wide availability incident and triggers cascading failover. Disk corruption is a separate fault domain (IOException-driven Log Directory Failure) and should be addressed in a dedicated PR.

Happy to add a follow-up issue tracking the Log Directory Failure work if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[server] Add disk-usage write protection to TabletServer

2 participants