ci: fix flaky integration tests by distributing images via GHCR#3582
ci: fix flaky integration tests by distributing images via GHCR#3582amir-deris wants to merge 6 commits into
Conversation
PR SummaryMedium Risk Overview
Adds Reviewed by Cursor Bugbot for commit bc4f860. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3582 +/- ##
==========================================
- Coverage 59.22% 58.35% -0.87%
==========================================
Files 2214 2140 -74
Lines 183389 174842 -8547
==========================================
- Hits 108604 102031 -6573
+ Misses 64994 63720 -1274
+ Partials 9791 9091 -700
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
| packages: write | ||
| steps: | ||
| - name: Delete stale run-id tags | ||
| uses: dataaxiom/ghcr-cleanup-action@d52806a0dc70b430571a37da1fde39733ffd640f # v1.2.2 |
There was a problem hiding this comment.
I don't trust this action, could we use the gh official one please: https://github.com/actions/delete-package-versions
There was a problem hiding this comment.
@masih Thanks for feedback. I removed the 3rd party package now and used a custom script instead. Regarding the github official action, it wouldn't work here because:
It's count-based, not time-based. Its two main modes are:
- num-old-versions-to-delete: N — delete the N oldest versions
- min-versions-to-keep: N — delete everything except the N newest versions
ignore-versions takes a regex of version names/tags to skip deletion, so you could protect :cache with ignore-versions: ^cache$.
It could have been used here with a count-based policy — e.g. "keep the last 20 run images" — but that's a weaker fit for this use case:
- CI frequency varies week to week, so a fixed count doesn't map cleanly to a time window
- It would require tuning a magic number rather than "14 days"
The official action is better suited for things like "keep the last 5 releases" on a package with predictable, low-frequency publishing. For a high-frequency CI artifact store where time-based
retention is the natural policy, it falls short.
Problem
The
Docker Integration Testworkflow packaged the localnode/rpcnode Docker images into a ~1 GB artifact (integration-docker-images.tar.zst) that ~40 matrix jobs each downloaded concurrently viaactions/download-artifact@v4. The action streams and extracts the zip without an end-to-end integrity check, so a prematurely closed connection can leave a truncated file without failing the step. The first detector waszstd -d | docker load, failing withRead error (39): premature end/unexpected EOFand requiring a manual rerun. With 40 concurrent 1 GB downloads per run, this flaked regularly.Fix
Distribute the images via GHCR instead of an artifact. Registry pulls are content-addressed — every layer is sha256-verified and retried automatically by the docker client — so truncation cannot slip through silently.
prepare-clusterpushes both images toghcr.io/sei-protocol/sei-chain-integration-test-{localnode,rpcnode}:<run_id>usingGITHUB_TOKEN(no OIDC or external secrets required). The CI artifact now carries only the smallseidtarball.docker pullthe run-tagged images, and retag them tosei-chain/{localnode,rpcnode}— everything downstream (docker-cluster-start-cietc.) is unchanged.sei-chain.ci-run-idlabel so every run pushes a unique image digest. Labels are config-only: the layer cache is unaffected and a cache-hit run uploads just a new config blob + manifest. This avoids the pitfall of re-tagging a stable digest where in-flight runs could be affected by tag moves.run_idand persist in GHCR across attempts.ghcr-integration-test-cleanup.yml: a weekly scheduled workflow (Sundays 06:00 UTC) that prunes run-id tags older than 14 days from both GHCR repos, while preserving the:cachetag. Supportsworkflow_dispatchwith a dry-run option.Advantage over ECR
It avoid ~3000$ monthly cost for egress charge from AWS to GitHub runners. Also
GITHUB_TOKENis automatically available to all workflows including fork PRs, removing the need for OIDC role assumptions or AWS credentials for image distribution. No IAM setup required.