Skip to content

Allow sharding data files by row when data file count is too small#3807

Merged
copybara-service[bot] merged 1 commit into
mainfrom
aireen/shard_by_row2
Jun 27, 2026
Merged

Allow sharding data files by row when data file count is too small#3807
copybara-service[bot] merged 1 commit into
mainfrom
aireen/shard_by_row2

Conversation

@aireenmei

@aireenmei aireenmei commented May 4, 2026

Copy link
Copy Markdown
Collaborator

Description

Currently, in the scenario when data file count < data loading host count, the following paths have issues handling:

  • For tfds, it will have all the hosts reading all the files, impacting perf. On a large scale, when a file has hundreds reader, it will hang or crash the workload. (b/505191017)
  • For grain, with parquet or tfrecord formats, each data loading host requires at least 1 data file. If this requirement is not met, it fails with error Training stopped: load_next_batch() failed with <class 'IndexError'> exception: (list index out of range). (example log)

After this PR, the new behavior is:

  • Allow data file split between hosts (shard data file by row). In the case of data_file_count < data_loading_host_count, each host gets data_file_count / data_loading_host_count files.

Test

Unit test

Added ComputeFileShardingNormalCaseTest

End to end test

Grain

Tests (dataset contains 1 files, on 2x v5e-32, 16 data loading hosts). Test 2 runs to make sure checkpointing works.

TFDS

(checkpoint not supported, 1 run)
Dataset contains 8 shards, test on 4x v5e-32, 32 data loading hosts.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@aireenmei aireenmei force-pushed the aireen/shard_by_row2 branch from dec58b5 to c895bb7 Compare June 26, 2026 05:16
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.81967% with 30 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/input_pipeline/input_pipeline_utils.py 48.48% 17 Missing ⚠️
...rc/maxtext/input_pipeline/grain_data_processing.py 50.00% 3 Missing and 4 partials ⚠️
src/maxtext/input_pipeline/tfds_data_processing.py 57.14% 4 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@aireenmei aireenmei force-pushed the aireen/shard_by_row2 branch 5 times, most recently from e38eeb9 to 14ceab6 Compare June 26, 2026 06:36
@aireenmei aireenmei marked this pull request as ready for review June 26, 2026 19:43

@igorts-git igorts-git left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a unit test for compute_file_sharding. Otherwise LGTM.

Comment thread src/maxtext/input_pipeline/grain_data_processing.py Outdated
@aireenmei aireenmei force-pushed the aireen/shard_by_row2 branch from b0b03ba to 9a6910f Compare June 26, 2026 22:56
@aireenmei aireenmei force-pushed the aireen/shard_by_row2 branch from 9a6910f to 7a898f8 Compare June 26, 2026 22:59
@aireenmei

Copy link
Copy Markdown
Collaborator Author

Please add a unit test for compute_file_sharding. Otherwise LGTM.

Good suggestion, unit tests added

@copybara-service copybara-service Bot merged commit c666fb8 into main Jun 27, 2026
41 checks passed
@copybara-service copybara-service Bot deleted the aireen/shard_by_row2 branch June 27, 2026 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants