Skip to content

cloudhypervisor: route serial through a hypeman-owned unix socket#210

Merged
sjmiller609 merged 6 commits intomainfrom
hypeship/ch-serial-socket
May 8, 2026
Merged

cloudhypervisor: route serial through a hypeman-owned unix socket#210
sjmiller609 merged 6 commits intomainfrom
hypeship/ch-serial-socket

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented May 7, 2026

Summary

Switch CH's serial console from mode=File to mode=Socket. CH binds a unix socket in the instance directory and hypeman dials it as a client; a goroutine in the cloudhypervisor package copies bytes from the socket into the log file opened with O_APPEND.

This is the CH-side counterpart to the QEMU O_APPEND fix. CH's ConsoleConfig has no append flag, so we can't ask it to open the file correctly — the only way to get an O_APPEND writer is to make hypeman the writer.

Why

CH's serial.mode=File opens with plain O_WRONLY|O_CREAT, no O_APPEND, and never reopens on signal. When rotateLogIfNeeded truncates the log file out from under it, CH's next write lands at its stale fd offset and creates a sparse hole of NUL bytes from byte 0 onward. Downstream log readers chunk those NULs, JSON-encode them at ~6× expansion, and choke when batches exceed receiver body limits.

With hypeman owning the writer fd, O_APPEND atomically seeks to EOF on every write, so post-truncate writes correctly resume at byte 0.

Changes

  • cloudhypervisor/config.goserial.mode is now Socket with a derived socket path (one level above logs/, kept short for sun_path).
  • cloudhypervisor/serial.go — new serialReader dials CH's bound socket with retry (CH is the server in mode=Socket) and copies bytes into app.log opened with O_APPEND. The reader is owned by the CloudHypervisor client so Shutdown can stop it.
  • cloudhypervisor/process.go — wire the reader into StartVM and RestoreVM. Reader is started before CH so it's ready to dial as soon as CH binds during vm.create. Cleanup paths close it on failure.
  • cloudhypervisor/fork_snapshot.go — fork rewrites and RestoreVM both migrate snapshot configs from File-mode to Socket-mode so legacy snapshots get the fix on the next restore.

Tests

  • TestSerialReader_CopiesBytesToLog — basic socket → file copy.
  • TestSerialReader_NoSparseHoleAfterCopytruncate — regression: write bytes, copytruncate, write more, assert the post-truncate file is non-sparse and content is exactly the post-truncate bytes.
  • TestRewriteSerialConfigForRestore — covers File→Socket migration, idempotence on already-Socket configs, and no-op when no serial block is present.
  • TestRewriteSnapshotConfigForFork — updated for new shape, asserts legacy serial.file is dropped on fork rewrite.
  • Integration: TestSystemdMode/SerialLogSurvivesCopytruncate — boots a real CH VM with the existing systemd image, waits for serial output to accumulate, performs copytruncate against app.log mid-run, drives more serial output via /dev/kmsg, then asserts the post-rotation file is non-sparse (allocated blocks ≈ apparent size) and does not start with mostly NUL bytes.

Backward compatibility

Snapshots taken with the old File-mode config are migrated in place on restore (and on fork), so the next time a legacy snapshot is restored or forked it switches to Socket mode and gets the fix. The rewrite is idempotent on already-Socket configs.

Test plan

  • go test ./lib/hypervisor/cloudhypervisor/... passes locally
  • go vet ./lib/hypervisor/cloudhypervisor/... ./integration/... clean
  • gofmt clean
  • Linux integration job (self-hosted KVM runner) green — TestSystemdMode/SerialLogSurvivesCopytruncate runs end-to-end on CH

Note

Medium Risk
Changes Cloud Hypervisor VM bring-up/restore/shutdown and snapshot config rewriting to route serial through a new socket reader goroutine; failures could impact VM startup, restore, or log capture. Risk is mitigated by unit tests plus an end-to-end integration regression test for the sparse-log issue.

Overview
Cloud Hypervisor serial logging is switched from mode=File to socket-based serial with a hypeman-owned writer opened using O_APPEND, preventing copytruncate rotation from creating sparse NUL holes in app.log.

This wires a new serialReader into CH StartVM/RestoreVM (started before CH and closed on cleanup/Shutdown), and adds snapshot migration logic so both forked and restored legacy snapshots are rewritten from File-mode serial to Socket-mode.

Adds focused unit tests for socket→file copying and the copytruncate regression, updates fork snapshot tests for the new serial config shape, and adds an integration regression test that performs a real copytruncate against a running CH VM and asserts the post-rotation log is non-sparse and not NUL-prefixed.

Reviewed by Cursor Bugbot for commit 02bb2de. Bugbot is set up for automated code reviews on this repo. Configure here.

Cloud Hypervisor's serial.mode=File opens the log file with plain
O_WRONLY|O_CREAT and never reopens on signal. When the file is
externally truncated (e.g. by rotateLogIfNeeded's copytruncate), CH's
next write lands at its stale fd offset, leaving a sparse hole of NUL
bytes from byte 0 to that offset. Downstream log readers chunk those
NULs and choke (they JSON-encode at ~6x expansion, so a 64KiB chunk
becomes a ~384KiB record and small batches blow past the receiver's
1MiB body limit).

CH's ConsoleConfig has no append flag, so we can't ask CH to use
O_APPEND directly. Switch to mode=Socket: hypeman binds a unix socket
in the instance directory, CH connects to it as a client, and a small
goroutine in the cloudhypervisor package copies bytes from the socket
into the log file opened with O_APPEND. Because hypeman now owns the
writer fd, copytruncate is safe — O_APPEND atomically seeks to EOF
on every write, so post-truncate writes correctly resume at byte 0.

Snapshot fork rewrites are updated to migrate legacy File-mode
serial config to Socket on fork.

Adds:
- TestSerialReader_CopiesBytesToLog — basic byte path
- TestSerialReader_NoSparseHoleAfterCopytruncate — regression
- integration TestSystemdMode/SerialLogSurvivesCopytruncate — boots a
  real CH VM, copies+truncates app.log mid-run, asserts the post-rotation
  file is non-sparse and starts with non-NUL bytes
Cloud Hypervisor's mode=Socket calls UnixListener::bind() inside
vm.create. The previous implementation made hypeman bind the socket
first, which caused EADDRINUSE on CH side. Flip the design: hypeman
opens the log with O_APPEND up front, then dials the socket with
retry once CH binds it during boot.

Also use a short /tmp temp dir in unit tests so socket paths stay
under the platform sun_path limit (104 bytes on macOS).
Long test temp paths plus /logs/serial.sock pushed the unix socket
path over Linux's 108-byte sun_path limit, so CH's UnixListener::bind
returned EINVAL ("path must be shorter than SUN_LEN"). Place the
socket next to ch.sock at the instance dir instead of under logs/.
Pre-fix snapshots embed serial.mode=File, so a plain restore on an
upgraded hypeman would still write directly to app.log without
O_APPEND and reproduce the copytruncate sparse-hole bug. Mirror the
fork-time rewrite in RestoreVM so legacy snapshots are migrated to
mode=Socket pointing at the per-instance reader. Idempotent for
post-fix snapshots already on Socket.
@sjmiller609 sjmiller609 requested a review from rgarcia May 7, 2026 23:23
@sjmiller609 sjmiller609 marked this pull request as ready for review May 7, 2026 23:23
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR modifies the cloudhypervisor package, not the kernel API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) specified in the filter.

To monitor this PR anyway, reply with @firetiger monitor this.

Comment thread lib/hypervisor/cloudhypervisor/process.go
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. PR body contradicts the code in two places. body says "serialReader listens on the socket, accepts CH's connection. Listener is closed after first accept." — code does the opposite (net.Dial, hypeman is client). body also says "Snapshots taken with the old File-mode config are restored as-is by CH" — but rewriteSerialConfigForRestore actively migrates File→Socket. earlier paragraphs in the body have it right; the "Changes" + "Backward compatibility" sections are stale. update before merge so future readers aren't misled.

  2. small race in Close() between dialUnixWithRetry returning a conn and the goroutine assigning it to s.conn under mu. if Close() runs in that window, it sees s.conn == nil, doesn't close anything, hits the 2s timeout, returns. the goroutine then assigns s.conn and io.Copy blocks forever on a leaked conn. window is tiny (between return conn, nil and s.mu.Lock()), but cleanly fixable: store s.conn before releasing it from dialUnixWithRetry, or check ctx after the lock and bail.

  3. net.Dial("unix", path) doesn't honor ctx. loop checks ctx between attempts, but a single Dial syscall is uninterruptible. on a missing path Dial returns instantly so it's not really an issue in practice — flagging as it'll bite if someone changes the path semantics later. (&net.Dialer{}).DialContext(ctx, "unix", path) is the ctx-aware form.

  4. serialSocketPath derives the instance dir via filepath.Dir(filepath.Dir(logPath)). fragile if InstanceAppLog layout ever changes. minor — a paths.InstanceSerialSocket(id) helper would be cleaner and consistent with the rest of lib/paths/paths.go. low priority.

Bugbot flagged that on the success path of StartVM/RestoreVM the
serialReader was constructed but never retained anywhere — so its
Close() could not be called from Shutdown, and the unix socket file
persisted on disk after the VM went away (CH unlinks on Drop, but
nothing guarantees CH actually exits cleanly).

Hand the reader to the CloudHypervisor client and call Close() in
Shutdown. Also defer os.Remove(socketPath) inside the run goroutine
so the socket is unlinked on goroutine exit even if Close() is never
called (e.g. CH crash).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related fixes in serialReader:

1. Race in run/Close. Between dialUnixWithRetry returning a conn and
   the goroutine assigning it under s.mu, Close() could acquire the
   lock, see s.conn == nil, and time out without closing the dialed
   conn. The goroutine would then publish the conn and io.Copy would
   block forever on a connection nobody can reach. Fix: re-check
   ctx.Err() under the lock and close the conn ourselves if Close
   already fired.

2. Use net.Dialer.DialContext instead of net.Dial so a single dial
   attempt is interruptible. The retry loop already checks ctx
   between attempts, but a single Dial syscall is not. In practice
   ENOENT returns instantly, but DialContext is the correct form.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sjmiller609
Copy link
Copy Markdown
Collaborator Author

thanks @rgarcia, all four addressed:

  1. updated PR body — Changes + Backward compat sections now match the code (CH binds, hypeman dials; restore migrates File→Socket).
  2. real bug, fixed in 02bb2de. re-check ctx.Err() under the lock before publishing s.conn; close the dialed conn ourselves if Close already fired.
  3. switched net.Dialdialer.DialContext(ctx, ...) in 02bb2de.
  4. agreed on the layering, but adding paths.InstanceSerialSocket(id) would mean threading a CH-only socket path through hypervisor.VMConfig (and the qemu/vz/firecracker plumbing that consumes it), since serialSocketPath is called from ToVMConfig where we only have cfg.SerialLogPath. left as-is for now — happy to do the refactor in a follow-up if you'd rather not have the Dir(Dir()) derivation in tree.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 02bb2de. Configure here.

Comment thread lib/hypervisor/cloudhypervisor/serial.go
@sjmiller609 sjmiller609 merged commit a1b7f3b into main May 8, 2026
11 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/ch-serial-socket branch May 8, 2026 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants