Allow sharding data files by row when data file count is too small#3807
Merged
Conversation
dec58b5 to
c895bb7
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
e38eeb9 to
14ceab6
Compare
igorts-git
reviewed
Jun 26, 2026
igorts-git
left a comment
Collaborator
There was a problem hiding this comment.
Please add a unit test for compute_file_sharding. Otherwise LGTM.
14ceab6 to
b0b03ba
Compare
SurbhiJainUSC
approved these changes
Jun 26, 2026
b0b03ba to
9a6910f
Compare
9a6910f to
7a898f8
Compare
Collaborator
Author
Good suggestion, unit tests added |
igorts-git
approved these changes
Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Currently, in the scenario when data file count < data loading host count, the following paths have issues handling:
Training stopped:load_next_batch()failed with <class 'IndexError'> exception: (list index out of range).(example log)After this PR, the new behavior is:
Test
Unit test
Added ComputeFileShardingNormalCaseTest
End to end test
Grain
Tests (dataset contains 1 files, on 2x v5e-32, 16 data loading hosts). Test 2 runs to make sure checkpointing works.
TFDS
(checkpoint not supported, 1 run)
Dataset contains 8 shards, test on 4x v5e-32, 32 data loading hosts.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.