Extract preview/sync GitHub Actions#4897
Conversation
037a389 to
79657eb
Compare
Observability diff (vs staging)Show diffdiff --git a/tmp/remote-canon.Nq1dRP/dashboards/boxel-status/indexing.json b/tmp/committed-canon.XjD11i/dashboards/boxel-status/indexing.json
index a39cf75..25280b9 100644
--- a/tmp/remote-canon.Nq1dRP/dashboards/boxel-status/indexing.json
+++ b/tmp/committed-canon.XjD11i/dashboards/boxel-status/indexing.json
@@ -69,6 +69,10 @@
"uid": "cef5v5sl9k7i8f"
},
"description": "System-wide operator action: queue a full reindex across every realm. The button disables itself while a `full-reindex` orchestration job is already pending or running. Per-realm reindex moved to the Realms dashboard. Click POSTs with `Authorization: Bearer ${grafana_secret}` (substituted from SSM at apply time, CS-10929).",
+ "fieldConfig": {
+ "defaults": {},
+ "overrides": []
+ },
"gridPos": {
"h": 8,
"w": 24,
(Run: https://github.com/cardstack/boxel/actions/runs/26161560752) |
Grafana previewPreview deployed for 1 dashboard in the staging Grafana.
Dashboards: Preview is torn down automatically when this PR is closed or merged. (Run: https://github.com/cardstack/boxel/actions/runs/26161560825) |
Preview deploymentsHost Test Results 1 files 1 suites 1h 33m 6s ⏱️ Results for commit 880fdff. Realm Server Test Results 1 files ±0 1 suites ±0 10m 23s ⏱️ -18s Results for commit 880fdff. ± Comparison against earlier commit 57d3fe8. |
d7095f0 to
434ac24
Compare
Node's fetch always reports `TypeError: fetch failed` as `error.message`; the actual transport reason (ECONNRESET, TLS handshake error, undici socket error, ENOTFOUND, GOAWAY, etc.) is stashed on `error.cause` and was being silently dropped by the publish/unpublish error paths. That left the action-demo workflow showing a bare "Error: fetch failed" with no way to distinguish a real network issue from, say, a self-signed cert problem against the published-realm subdomain. Wrap the three swallowed sites: - `publish.ts` `.action()` catch: log `err.cause` separately if present. - `publish.ts` `waitForPublishedRealmReady`: capture cause into the `lastError` string so the readiness-timeout error reports the same thing the polling loop kept hitting. - `unpublish.ts` `unpublishRealm`: embed cause into the `result.error` string the CLI surfaces. This is the diagnostic the action-demo on #4897 needs to figure out why publish hangs at the initial POST despite the server-side mount completing successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The worker's `fatalExit` handler already exists (uncaughtException / unhandledRejection backstop with a finalize-reservation race) — but it reports the error via `log.error(...)` immediately before `process.exit(1)`. `worker-manager.ts` spawns the child with `stdio: ['pipe', 'pipe', 'pipe', 'ipc']`, so the child's stderr is a libuv-async pipe; the final stream chunk gets discarded when the process disappears, and the captured server log shows the child as having silently exited `code=1, signal=null` with no clue why. worker.ts already uses `writeSync(2, ...)` for exactly this reason on the STARTUP / SIGINT / SIGTERM / disconnect stamps (see the comment above the STARTUP block at the top of the file). Apply the same pattern to the three fatal-exit paths: the uncaughtException / unhandledRejection handler, its inner finalize-failed fallback, and the outer startup-error `.catch`. Route each through a new helper that serializes the error with its full stack and walks `error.cause` (where Node fetch / undici / TLS errors stash the real reason). Discovered while debugging the action-demo on #4897 (CS-11180): every `_publish-realm` of a fresh source realm enqueues a copy-index job that throws *something* inside the worker; the worker exited silently; pg-queue retried, hit the 2-reservation cap, abandoned the job; the realm-server returned HTTP 500 `Job abandoned after 2 failed attempts (max=2)` to the publish endpoint caller. Without this fix the underlying job-processing error is unobservable. The bundled `serialize-fatal-reason` helper is in its own module because the FD-level write behavior can't be unit-tested in-process (it requires a real child_process.spawn + libuv-piped stderr to reproduce the bug being fixed) — but the serialization can. Tests cover: stack preservation, cause-chain walking, non-Error values, self-referential cause cycles (depth-capped), and Node fetch's typical `TypeError: fetch failed` + ECONNRESET-on-cause shape. Closes CS-11200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Node's fetch always reports `TypeError: fetch failed` as `error.message`; the actual transport reason (ECONNRESET, TLS handshake error, undici socket error, ENOTFOUND, GOAWAY, etc.) is stashed on `error.cause` and was being silently dropped by the publish/unpublish error paths. That left the action-demo workflow showing a bare "Error: fetch failed" with no way to distinguish a real network issue from, say, a self-signed cert problem against the published-realm subdomain. Wrap the three swallowed sites: - `publish.ts` `.action()` catch: log `err.cause` separately if present. - `publish.ts` `waitForPublishedRealmReady`: capture cause into the `lastError` string so the readiness-timeout error reports the same thing the polling loop kept hitting. - `unpublish.ts` `unpublishRealm`: embed cause into the `result.error` string the CLI surfaces. This is the diagnostic the action-demo on #4897 needs to figure out why publish hangs at the initial POST despite the server-side mount completing successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f8a1399 to
608717a
Compare
Extract the publish-preview-realm / unpublish-preview-realm / workspace-sync composite actions so `boxel-catalog`, `boxel-home`, `boxel-skills` (and any future consumer) can stop maintaining duplicated bespoke preview-realm workflows. This branch is layered on top of cs-11161 (#4851) so the bundled demo workflow can exercise `boxel realm publish` / `unpublish` / `push` end-to-end against the CLI commits in this branch's ancestry. Once #4851 lands, GitHub will auto-rebase this PR's base onto main and the diff will stay clean against main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Used while iterating on the three composite actions; not part of the shipped product. External consumers (boxel-catalog, boxel-home, boxel-skills) exercise the actions in their own preview workflows.
608717a to
cb1d9db
Compare
Adds preview-realm-actions-integration.yml — runs the three composite actions (publish, workspace-sync, unpublish) end-to-end against the same local matrix + realm-server stack `boxel-cli-test` boots, so contract drift between the actions, the boxel-cli commands they wrap, and the realm-server handlers they POST to is caught the moment any side changes. Path-gated triggers (on `pull_request` and `push` to main, plus `workflow_dispatch` for manual) so only PRs touching the integration surface pay the runtime cost. The set covers each action.yml, this workflow, the publish/unpublish/push CLI commands, the handle-publish-realm / handle-unpublish-realm server handlers, and the copy-index task that the publish handler enqueues. Uses path-relative `uses: ./.github/actions/...` so the actions run at the PR's own commit. External consumers (boxel-catalog, -home, -skills) pin a SHA instead. Also re-applies the in-repo `mise` short-circuit in each action: when `github.action_repository == github.repository` (i.e., invoked from inside cardstack/boxel itself), set BOXEL_SRC to $GITHUB_WORKSPACE and skip the clone + mise/pnpm install steps because the calling workflow's ./.github/actions/init already did them. Without this the inner `jdx/mise-action` re-hashes a separate cache key whose lookup sits ~30 minutes before transfer. External consumers continue to go through the full clone + install path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The in-repo short-circuit compared `github.action_repository` against `github.repository`, but `github.action_repository` is only populated for *external* `uses: org/repo/...@ref` references. For path-relative `uses: ./.github/...` (which is exactly how preview-realm-actions-integration.yml invokes these actions), the value is empty, so the predicate `"" = "cardstack/boxel"` was false and the action fell into the external-consumer branch and tried to `git clone https://github.com/.git/`, failing with `remote: Not Found`. Treat empty BOXEL_REPO as in-repo too. External consumers still hit the populated-and-different branch and run the full clone + install.
There was a problem hiding this comment.
Pull request overview
This PR moves the PR preview realm publish/sync/unpublish automation into reusable composite GitHub Actions inside the boxel monorepo, and adds an end-to-end integration workflow that exercises those actions against the local test stack to detect contract drift (e.g., the realm publish endpoint now returning HTTP 202 as expected).
Changes:
- Adds three composite actions:
publish-preview-realm,workspace-sync, andunpublish-preview-realm, implemented on top of in-tree@cardstack/boxel-cli. - Adds a path-gated integration workflow that boots the local stack and runs publish → sync → unpublish to validate the full surface area.
- Updates the publish action behavior to accept/poll the newer “202 + pending” publish contract via
boxel realm publish --timeout.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
.github/workflows/preview-realm-actions-integration.yml |
Adds an E2E workflow that validates the three composite actions against the local realm-server + Matrix stack. |
.github/actions/publish-preview-realm/action.yml |
Introduces a composite action to create/push/publish a preview realm and wait for readiness via boxel-cli. |
.github/actions/workspace-sync/action.yml |
Introduces a composite action to push a local directory into an existing Boxel workspace via boxel realm push. |
.github/actions/unpublish-preview-realm/action.yml |
Introduces a composite action to unpublish a previously published preview realm (tolerating missing). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| realm-server-url: https://localhost:4201/ | ||
|
|
||
| - name: Print server logs | ||
| if: ${{ !cancelled() }} |
There was a problem hiding this comment.
that’s not true! failure in an earlier step won’t cause this to not run, because it’s looking explicitly for cancellation
|
in the end these are mostly wrapping Boxel CLI, I don’t think they have much value on their own, so I’m closing this, but I’ll still update the other repositories to use minimal CLI patterns |
Consumer repos (boxel-home, boxel-skills, boxel-catalog) duplicate ~250 lines of nearly-identical sync logic across their own workflow files. Composite actions could share that logic but force each consumer to pin a SHA - so a change to the realm-server's _publish-realm contract doesn't surface in consumer CI until they manually bump. Reusable workflows give the same de-duplication AND let consumers track @main, so any contract change at the CLI<->server boundary auto-propagates on their next run. They also remove the clone-the-monorepo bootstrap the composite actions needed: each workflow `npm install`s @cardstack/boxel-cli@latest, which the caller can pin via the `boxel-cli-version` input. - .github/workflows/sync-workspace.yml - reusable workflow for the push-to-staging-on-main / dry-run-on-PR / push-to-production-on- release pattern. Supports both sticky-PR-comment and artifact reporting (the latter for boxel-catalog). - .github/workflows/preview-realm.yml - reusable workflow for the create+push+publish lifecycle on PR open/sync and the unpublish cleanup on PR close (currently only boxel-home; generic enough for any per-PR preview consumer). - .github/workflows/preview-realm-actions-integration.yml - rewritten to exercise the underlying `boxel realm create/push/publish/unpublish` commands directly against a local matrix + realm-server stack. The reusable workflows are thin shells around these CLI invocations, so contract drift between the CLI and the server's handlers surfaces here at PR time. Path-gated to fire on changes to the reusable workflows, the relevant CLI commands, or the server-side handlers. Drops the three composite actions (.github/actions/{publish,unpublish}-preview-realm, workspace-sync) along with their ~500 lines of monorepo-bootstrap scaffolding.
`with:` blocks disallow the `secrets` context, so consumers whose matrix username lives in `secrets.*` (boxel-skills) couldn't pass it to the reusable workflow as an input. Declaring it as a secret lets the caller source the value from either `vars.*` or `secrets.*` — both contexts are allowed inside `secrets:` blocks.
Addresses Copilot review feedback on PR #4897. boxel-cli reads BOXEL_PASSWORD from the environment when --password is not supplied (packages/boxel-cli/src/commands/profile.ts:124), and explicitly warns in build-program.ts that env-var auth is preferred over the flag because flags leak via /proc/*/cmdline and ps output. Drop --password from the five `boxel profile add` invocations and rename the env var from MATRIX_PASSWORD to BOXEL_PASSWORD so the CLI picks it up implicitly.
I noticed that
boxel-homePR previews are broken:This is because the interface to
_publish-realmchanged:HTTP 202 is actually expected now!
I also noticed that
boxel-catalog,boxel-home, andboxel-skillswere all using duplicative bespoke workflows to accomplish similar tasks, with use ofcardstack/boxel-cli, npm Boxel CLI, and the old workspace sync CLI.This extracts the preview/sync workflows into the monorepo so they can be used from external repositories and tested in-monorepo in case of interface changes like the above. You can see them tested and passing in this job.