Skip to content

ci: fix flaky integration tests by distributing images via GHCR#3582

Open
amir-deris wants to merge 6 commits into
mainfrom
amir/plt-476-CI-integration-test-image-fix
Open

ci: fix flaky integration tests by distributing images via GHCR#3582
amir-deris wants to merge 6 commits into
mainfrom
amir/plt-476-CI-integration-test-image-fix

Conversation

@amir-deris

@amir-deris amir-deris commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

The Docker Integration Test workflow packaged the localnode/rpcnode Docker images into a ~1 GB artifact (integration-docker-images.tar.zst) that ~40 matrix jobs each downloaded concurrently via actions/download-artifact@v4. The action streams and extracts the zip without an end-to-end integrity check, so a prematurely closed connection can leave a truncated file without failing the step. The first detector was zstd -d | docker load, failing with Read error (39): premature end / unexpected EOF and requiring a manual rerun. With 40 concurrent 1 GB downloads per run, this flaked regularly.

Fix

Distribute the images via GHCR instead of an artifact. Registry pulls are content-addressed — every layer is sha256-verified and retried automatically by the docker client — so truncation cannot slip through silently.

  • prepare-cluster pushes both images to ghcr.io/sei-protocol/sei-chain-integration-test-{localnode,rpcnode}:<run_id> using GITHUB_TOKEN (no OIDC or external secrets required). The CI artifact now carries only the small seid tarball.
  • Test jobs log in to GHCR, docker pull the run-tagged images, and retag them to sei-chain/{localnode,rpcnode} — everything downstream (docker-cluster-start-ci etc.) is unchanged.
  • Both builds stamp a sei-chain.ci-run-id label so every run pushes a unique image digest. Labels are config-only: the layer cache is unaffected and a cache-hit run uploads just a new config blob + manifest. This avoids the pitfall of re-tagging a stable digest where in-flight runs could be affected by tag moves.
  • Reruns of failed test jobs keep working: tags are keyed by run_id and persist in GHCR across attempts.
  • Adds ghcr-integration-test-cleanup.yml: a weekly scheduled workflow (Sundays 06:00 UTC) that prunes run-id tags older than 14 days from both GHCR repos, while preserving the :cache tag. Supports workflow_dispatch with a dry-run option.

Advantage over ECR

It avoid ~3000$ monthly cost for egress charge from AWS to GitHub runners. Also GITHUB_TOKEN is automatically available to all workflows including fork PRs, removing the need for OIDC role assumptions or AWS credentials for image distribution. No IAM setup required.

@cursor

cursor Bot commented Jun 12, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes how CI distributes test images and requires GHCR package permissions; fork PR integration runs may fail until maintainers use upstream branches, but runtime test logic is unchanged.

Overview
Fixes flaky integration CI by stopping ~1 GB Docker image artifacts that many matrix jobs downloaded in parallel (truncated downloads could pass artifact download and fail at zstd/docker load).

integration-test.yml now builds localnode/rpcnode once, pushes them to ghcr.io/sei-protocol/sei-chain-integration-test-{localnode,rpcnode}:<run_id>, and uploads only the seid tarball as the artifact. Matrix jobs log in to GHCR, docker pull by run id, and retag to sei-chain/localnode / sei-chain/rpcnode so cluster startup is unchanged. AWS OIDC/ECR login and registry build cache are replaced with GHCR :cache tags (cache export still limited to push events). Each build gets a sei-chain.ci-run-id label so run tags are unique and old tags can be pruned without moving shared digests. packages: write on prepare and packages: read on tests; comments note fork PRs cannot push org packages.

Adds ghcr-integration-test-cleanup.yml: weekly (and manual) pruning of numeric run-id package versions older than 14 days, preserving :cache and other non-run-id tags, with optional dry-run.

Reviewed by Cursor Bugbot for commit bc4f860. Bugbot is set up for automated code reviews on this repo. Configure here.

@amir-deris amir-deris changed the title modified integration-test yaml to push pull from ecr ci: distribute integration test images via ECR instead of 1GB artifact Jun 12, 2026
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 15, 2026, 5:28 PM

@amir-deris amir-deris requested review from bdchatham and masih June 12, 2026 19:11
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.35%. Comparing base (0a2c388) to head (bc4f860).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3582      +/-   ##
==========================================
- Coverage   59.22%   58.35%   -0.87%     
==========================================
  Files        2214     2140      -74     
  Lines      183389   174842    -8547     
==========================================
- Hits       108604   102031    -6573     
+ Misses      64994    63720    -1274     
+ Partials     9791     9091     -700     
Flag Coverage Δ
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 74 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@amir-deris amir-deris changed the title ci: distribute integration test images via ECR instead of 1GB artifact ci: distribute integration test images via GHCR instead of 1GB artifact Jun 12, 2026
@amir-deris amir-deris changed the title ci: distribute integration test images via GHCR instead of 1GB artifact ci: fix flaky integration tests by distributing images via GHCR Jun 12, 2026
packages: write
steps:
- name: Delete stale run-id tags
uses: dataaxiom/ghcr-cleanup-action@d52806a0dc70b430571a37da1fde39733ffd640f # v1.2.2

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't trust this action, could we use the gh official one please: https://github.com/actions/delete-package-versions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masih Thanks for feedback. I removed the 3rd party package now and used a custom script instead. Regarding the github official action, it wouldn't work here because:

It's count-based, not time-based. Its two main modes are:

  - num-old-versions-to-delete: N — delete the N oldest versions
  - min-versions-to-keep: N — delete everything except the N newest versions

  ignore-versions takes a regex of version names/tags to skip deletion, so you could protect :cache with ignore-versions: ^cache$.

  It could have been used here with a count-based policy — e.g. "keep the last 20 run images" — but that's a weaker fit for this use case:
  - CI frequency varies week to week, so a fixed count doesn't map cleanly to a time window
  - It would require tuning a magic number rather than "14 days"
  
  The official action is better suited for things like "keep the last 5 releases" on a package with predictable, low-frequency publishing. For a high-frequency CI artifact store where time-based
  retention is the natural policy, it falls short.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants