feat(exporter): skip GCS upload when object CRC32C is unchanged#5338
feat(exporter): skip GCS upload when object CRC32C is unchanged#5338SanskaarUndale21 wants to merge 4 commits into
Conversation
|
/gcbrun |
|
Hi, I've pushed a follow-up commit fixing the lint failures from the previous CI run. Could someone please run /gcbrun and re-review when you get a chance? Thanks! |
|
/gcbrun |
michaelkedar
left a comment
There was a problem hiding this comment.
This LGTM - just need to fix the lints again
| - path: _test\.go | ||
| linters: | ||
| - dupl | ||
| - unparam |
There was a problem hiding this comment.
I'd prefer we just fix the linter complaint than add more exceptions - could we just have runWriter always use "out" instead of taking in pathPrefix? (or even just inline //nolint:unparam)
|
Hi, addressed the feedback -- removed the i synced my fork to latest changes and rebased it aswell Thanks! |
Before this change, the exporter uploaded every output file on every run regardless of whether the content had changed. Since all.zip and other outputs are now reproducible (google#3491), unchanged files would accumulate redundant object generations in the bucket, making it harder for downstream consumers to detect real updates. The writer now calls ReadObjectAttrs before each GCS write and computes the CRC32C of the outgoing data using the Castagnoli polynomial (the same algorithm GCS uses for its stored checksums). If the checksums match, the upload is skipped and an info log is emitted. New objects (ErrNotFound) and any transient attr-read errors fall through to the normal upload path so the exporter remains correct under all conditions. Tests verify the three cases: same content is skipped, changed content is uploaded, and brand-new objects are always created. Fixes google#3513
Addresses reviewer feedback to fix the linter complaint directly rather than adding a golangci.yaml exception.
7581747 to
81ed69c
Compare
|
/gcbrun |
michaelkedar
left a comment
There was a problem hiding this comment.
Thanks for this.
I'll merge this tomorrow after we do our weekly deployment just to give it some time to test on the testing environment
Summary
Fixes #3513.
Before this change, the exporter uploaded every output file on every run regardless of whether the content had changed. Since all.zip and other generated files became reproducible in #3491, unchanged runs would still create new object generations in the bucket -- making it harder for downstream consumers to detect real updates and generating unnecessary storage churn.
Approach: before each GCS write the writer now fetches the existing object's attributes via
ReadObjectAttrsand computes the CRC32C checksum of the outgoing data using the Castagnoli polynomial (the same algorithm GCS itself uses). If the checksums match, the upload is skipped and an info log is emitted. Three edge cases fall through to the normal upload path:ErrNotFound) -- upload always proceedsThe local-write path (used for dev/test via
-upload-to-gcs=false) is unchanged, since writes there carry no network cost.Changes
go/cmd/exporter/writer.go: addgcsContentUnchangedhelper and call it beforeWriteObjectgo/cmd/exporter/writer_test.go: new test file covering skip-on-unchanged, upload-on-changed, upload-of-new-object, and multi-file skipTest plan
TestWriter_GCS_SkipsUnchangedContent-- same data, generation must not incrementTestWriter_GCS_UploadsChangedContent-- different data, generation must incrementTestWriter_GCS_UploadsNewObject-- no pre-existing object, must be created at generation 1TestWriter_GCS_SkipsMultipleUnchanged-- three unchanged files, all generations stable