[DO NOT REVIEW]Enable LoRA training in the NNX path of MaxText (pre-training and native SFT) by SurbhiJainUSC · Pull Request #4284 · AI-Hypercomputer/maxtext

SurbhiJainUSC · 2026-06-26T21:32:55Z

Description

This PR enables LoRA training in the NNX path of MaxText, extending support to both pre-training and native SFT workflows.

Problem solved and implementation details:

NNX Path Integration: Updates the pre-training loop and native SFT loops to support applying LoRA adapter overlays directly on the model variables before optimizer initialization.
Gradient Accumulation & Optimizers: Integrates the updated gradient_accumulation.py and train_utils.py to seamlessly track, scale, and update LoRA-specific parameters during training.
Checkpointing Compatibility: Supports saving and resuming checkpoints containing both base model weights and LoRA adapter parameters, as well as importing/warm-starting from adapter checkpoints via lora.lora_restore_path.

Tests

Pre-Training

Scenario 1: Start a new training with steps=5

Initialize the LoRA adapters from scratch, run 5 steps of training, and save a checkpoint containing both the base weights and the trained adapters in $BASE_OUTPUT_DIRECTORY/pre-train-$RUN_NAME.

python3 -m maxtext.trainers.pre_train.train \
    src/maxtext/configs/base.yml \
    base_output_directory=$BASE_OUTPUT_DIRECTORY run_name=pre-train-$RUN_NAME \
    model_name=gemma3-4b dataset_type=synthetic steps=5 enable_checkpointing=True \
    per_device_batch_size=2 max_target_length=128 \
    pure_nnx=True pure_nnx_decoder=True \
    lora.enable_lora=True lora.lora_rank=8 lora.lora_alpha=16.0 \
    weight_dtype=bfloat16

Scenario 2: Resume training with steps=10 without restoring LoRA adapters

MaxText detects the step 5 checkpoint saved in Scenario 1, load the model weights + optimizer states from it, and continue training from step 5 to step 10.

python3 -m maxtext.trainers.pre_train.train \
    src/maxtext/configs/base.yml \
    base_output_directory=$BASE_OUTPUT_DIRECTORY run_name=pre-train-$RUN_NAME \
    model_name=gemma3-4b dataset_type=synthetic steps=10 enable_checkpointing=True \
    per_device_batch_size=2 max_target_length=128 \
    pure_nnx=True pure_nnx_decoder=True \
    lora.enable_lora=True lora.lora_rank=8 lora.lora_alpha=16.0 \
    weight_dtype=bfloat16

Scenario 3: Warm-starting new training by restoring LoRA adapters from Scenario 2

Warm-starting a new training run at Step 0 by restoring trained LoRA adapter weights from a previous checkpoint into a freshly initialized base model.

python3 -m maxtext.trainers.pre_train.train \
    src/maxtext/configs/base.yml \
    base_output_directory=$BASE_OUTPUT_DIRECTORY run_name=pre-train-new-$RUN_NAME \
    model_name=gemma3-4b dataset_type=synthetic steps=5 enable_checkpointing=True \
    per_device_batch_size=2 max_target_length=128 \
    pure_nnx=True pure_nnx_decoder=True \
    lora.enable_lora=True lora.lora_rank=8 lora.lora_alpha=16.0 \
    lora.lora_restore_path=$BASE_OUTPUT_DIRECTORY/pre-train-$RUN_NAME/checkpoints/9/items \
    weight_dtype=bfloat16

SFT

Scenario 0: Run SFT Training & save NNX Checkpoint

    src/maxtext/configs/post_train/sft-vision-chartqa.yml \
    base_output_directory=$BASE_OUTPUT_DIRECTORY run_name=sft-$RUN_NAME \
    model_name=gemma3-4b tokenizer_path="google/gemma-3-4b-it" \
    steps=5 enable_checkpointing=True \
    per_device_batch_size=2 max_target_length=128 \
    pure_nnx=True pure_nnx_decoder=True \
    weight_dtype=bfloat16 scan_layers=False

Scenario 1: Warm-starting LoRA SFT Training from Base (NNX) Checkpoint

Warm-starting a multimodal SFT run at Step 0 by loading pre-trained base weights from a checkpoint and training freshly initialized LoRA adapters on a vision dataset.

MAXTEXT_CKPT_PATH=<multimodal gemma3-4b checkpoint>
python3 -m maxtext.trainers.post_train.sft.train_sft_native \
    src/maxtext/configs/post_train/sft-vision-chartqa.yml \
    base_output_directory=$BASE_OUTPUT_DIRECTORY run_name=sft-$RUN_NAME \
    model_name=gemma3-4b tokenizer_path="google/gemma-3-4b-it" \
    load_parameters_path=$MAXTEXT_CKPT_PATH \
    steps=5 enable_checkpointing=True \
    per_device_batch_size=2 max_target_length=128 \
    pure_nnx=True pure_nnx_decoder=True \
    lora.enable_lora=True lora.lora_rank=8 lora.lora_alpha=16.0 \
    weight_dtype=bfloat16 scan_layers=False

Scenario 2: Warm-starting LoRA SFT Training by restoring LoRA adapters from Scenario 1

Warm-starting a multimodal SFT run at Step 0 by loading base weights from a base checkpoint and restoring trained LoRA adapter weights from a previous SFT checkpoint.

python3 -m maxtext.trainers.post_train.sft.train_sft_native \
    src/maxtext/configs/post_train/sft-vision-chartqa.yml \
    base_output_directory=$BASE_OUTPUT_DIRECTORY run_name=sft-$RUN_NAME \
    model_name=gemma3-4b tokenizer_path="google/gemma-3-4b-it" \
    load_parameters_path=$MAXTEXT_CKPT_PATH \
    steps=5 enable_checkpointing=True \
    per_device_batch_size=2 max_target_length=128 \
    pure_nnx=True pure_nnx_decoder=True \
    lora.enable_lora=True lora.lora_rank=8 lora.lora_alpha=16.0 \
    lora.lora_restore_path=$BASE_OUTPUT_DIRECTORY/sft-$RUN_NAME/checkpoints/4/items \
    weight_dtype=bfloat16 scan_layers=False

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-06-26T21:37:53Z

Codecov Report

❌ Patch coverage is 35.22727% with 57 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/trainers/pre_train/train.py	23.33%	20 Missing and 3 partials ⚠️
src/maxtext/utils/gradient_accumulation.py	0.00%	13 Missing ⚠️
src/maxtext/utils/lora_utils.py	52.38%	9 Missing and 1 partial ⚠️
src/maxtext/utils/train_utils.py	35.71%	7 Missing and 2 partials ⚠️
src/maxtext/common/checkpointing.py	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

…ive SFT)

SurbhiJainUSC force-pushed the lora_train branch from e1faad5 to 1647e8e Compare June 26, 2026 21:33

SurbhiJainUSC force-pushed the lora_train branch 5 times, most recently from 7bd18eb to 2b1a987 Compare June 27, 2026 00:18

Enable LoRA training in the NNX path of MaxText (pre-training and nat…

df7a663

…ive SFT)

SurbhiJainUSC force-pushed the lora_train branch from 2b1a987 to df7a663 Compare June 27, 2026 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DO NOT REVIEW]Enable LoRA training in the NNX path of MaxText (pre-training and native SFT)#4284

[DO NOT REVIEW]Enable LoRA training in the NNX path of MaxText (pre-training and native SFT)#4284
SurbhiJainUSC wants to merge 1 commit into
mainfrom
lora_train

SurbhiJainUSC commented Jun 26, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

SurbhiJainUSC commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem solved and implementation details:

Tests

Pre-Training

Scenario 1: Start a new training with steps=5

Scenario 2: Resume training with steps=10 without restoring LoRA adapters

Scenario 3: Warm-starting new training by restoring LoRA adapters from Scenario 2

SFT

Scenario 0: Run SFT Training & save NNX Checkpoint

Scenario 1: Warm-starting LoRA SFT Training from Base (NNX) Checkpoint

Scenario 2: Warm-starting LoRA SFT Training by restoring LoRA adapters from Scenario 1

Checklist

Uh oh!

codecov Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SurbhiJainUSC commented Jun 26, 2026 •

edited

Loading

codecov Bot commented Jun 26, 2026 •

edited

Loading