fix(dataset rm): delete staging files from a uid-65532 pod, not jobs-manager (#259) by LukasWodka · Pull Request #78 · tracebloc/cli

LukasWodka · 2026-06-15T12:10:14Z

Summary

Fixes the tracebloc dataset rm teardown failure in tracebloc/client#259: the table dropped but the staging files leaked with Permission denied, and the "re-run" advice never worked.

Root cause (refined from the issue)

The leftover files are written by the CLI's ephemeral stage pod (uid 65532 + FSGroup 65532), not the ingestor. The teardown then exec'd rm -rf inside the long-lived jobs-manager pod (a different non-root uid, no shared fsGroup). A non-65532 uid can't delete 65532-owned files in a non-group-writable directory → EACCES, orphans left behind. Re-running hit the same wall every time.

(Full three-UID analysis + why fsGroup-on-jobs-manager is a no-op on hostPath is in the issue #259 refinement comment.)

Fix

Run the teardown rm from a short-lived pod that mirrors the stage pod's identity (uid 65532 + FSGroup 65532, shared PVC mounted), reusing the existing BuildStagePodSpec / CreateStagePod / WaitForStagePodReady / DeleteStagePod machinery (same pattern as push.Stage).

The teardown pod owns the staging files it deletes, so it works by ownership on hostPath (where fsGroup is a no-op, kubernetes/kubernetes#138411) and CSI alike. This fully fixes tabular datasets (no sidecar files) on every volume type — the common case and the current blocker.

Secondary:

Teardown now takes an injectable Executor (matching push.Stage), so the exec path is unit-testable.
dataset_rm drops the misleading "re-run completes the cleanup" claim — the table DROP is idempotent, and if file removal keeps failing the message points to node-side cleanup.

Scope / what's left

Refs #259 (not auto-closing). The image/sidecar case — the ingestor's /data/shared/<table> files written as uid 65534 — is deletable by this fix on CSI (via fsGroup) but not on hostPath. Closing that needs the documented complement (ingestor fsGroup 65532 + group-writable DEST_PATH in client-runtime / data-ingestors). Tabular datasets are fully covered now.

Test plan

go build ./..., go vet, gofmt — clean.
go test ./... — all packages pass.
New regression test TestTeardown_RemovesViaStageIdentityPod pins the fix: the rm runs in a tracebloc-stage-* pod with RunAsUser/FSGroup = 65532 (container stage), not the jobs-manager pod, with the correct rm -rf <paths>, and the teardown pod is cleaned up afterward.

🤖 Generated with Claude Code

…manager (#259) `tracebloc dataset rm` dropped the table but failed to delete the dataset's staging files on the shared PVC: rm: cannot remove '/data/shared/.tracebloc-staging/<t>/labels.csv': Permission denied Root cause: the staging files are written by the CLI's ephemeral stage pod as uid 65532 (+ fsGroup 65532), but the teardown exec'd `rm` inside the long-lived jobs-manager pod, which runs as a different non-root uid with no shared fsGroup. A non-65532 uid cannot delete 65532-owned files in a non-group-writable dir, so the rm hit EACCES and left orphans. The "re-run to clean up" advice was a dead end — the same permission error every time. Fix: run the teardown `rm` from a short-lived pod that mirrors the stage pod's identity (uid 65532 + fsGroup 65532, shared PVC mounted), reusing the existing BuildStagePodSpec / CreateStagePod / WaitForStagePodReady / DeleteStagePod machinery. That pod OWNS the staging files it deletes, so it works by ownership on hostPath (where fsGroup is a no-op, kubernetes/kubernetes#138411) and CSI alike. Fully fixes tabular datasets (no sidecar files) on every volume type. Also: - Teardown now takes an injectable Executor (matching push.Stage), enabling a regression test that pins "rm runs in a uid-65532 stage pod, not jobs-manager". - dataset_rm: drop the misleading "re-run completes the cleanup" claim; the table DROP is idempotent, and if file removal keeps failing, point to node-side cleanup. Refs #259. The image/sidecar case (ingestor's /data/shared/<table> written as uid 65534) on hostPath still needs the documented complement (ingestor fsGroup + group-writable DEST_PATH in client-runtime/data-ingestors). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka · 2026-06-15T12:11:17Z

👋 Heads-up — Code review queue is at 16 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#115 — Release: staging → main (averaging correctness sweep → production) · author: @aptracebloc · no reviewer assigned
client-runtime#105 — chore(ci): drop unused mysqlclient dep; report test coverage (Phase 0) · author: @LukasWodka · no reviewer assigned
client-runtime#106 — fix(schema): sync vendored ingest.v1.json — restore token_classification · author: @LukasWodka · no reviewer assigned
data-ingestors#253 — refactor(P3a): move per-category behavior flags into a ModalityRegistry · author: @LukasWodka · no reviewer assigned
data-ingestors#255 — refactor(P3b): move validator factories into the ModalityRegistry · author: @LukasWodka · no reviewer assigned
data-ingestors#256 — refactor(P3c): move sidecar-transfer factories into the ModalityRegistry · author: @LukasWodka · no reviewer assigned
data-ingestors#257 — refactor(P3d): source data_format from the ModalityRegistry · author: @LukasWodka · no reviewer assigned
design-system#47 — feat(table): Table compound organism (Figma 959:6061 + 986:9822) · author: @waqaskhanroghani · no reviewer assigned
docs#57 — release: promote develop → main (machine-sizing + Mintlify rebuild + page-coverage CI) · author: @LukasWodka · no reviewer assigned
frontend-app#523 — test(cypress): session + protected-route + error-state + endurance E2E coverage (and 2 bug fixes) · author: @waqaskhanroghani · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

LukasWodka assigned saadqbal Jun 15, 2026

LukasWodka mentioned this pull request Jun 15, 2026

bug: dataset rm cannot delete staging files — ingestor (uid 65534) vs jobs-manager uid mismatch, no shared fsGroup tracebloc/client#259

Open

2 tasks

aptracebloc approved these changes Jun 15, 2026

View reviewed changes

LukasWodka mentioned this pull request Jun 15, 2026

chore(schema): re-sync vendored ingest.v1.json from data-ingestors master #79

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dataset rm): delete staging files from a uid-65532 pod, not jobs-manager (#259)#78

fix(dataset rm): delete staging files from a uid-65532 pod, not jobs-manager (#259)#78
LukasWodka wants to merge 1 commit into
developfrom
fix/259-dataset-rm-teardown-uid

LukasWodka commented Jun 15, 2026

Uh oh!

LukasWodka commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LukasWodka commented Jun 15, 2026

Summary

Root cause (refined from the issue)

Fix

Scope / what's left

Test plan

Uh oh!

LukasWodka commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants