fix: correct LoRA initialization and forward pass under tensor parallelism by chen2021673 · Pull Request #150 · InfiniTensor/InfiniTrain

chen2021673 · 2026-04-30T01:40:57Z

Summary

Fix LoRA loss divergence under tensor/data parallel training by making LoRA tensor-parallel initialization deterministic and aligning the forward collective order between base linear and LoRA linear paths.

This PR changes TP LoRA parallel linear modules to compute the base shard and LoRA shard locally first, add them before communication, and then run a single TP collective on the combined output. It also adds rank-aware Broadcast/Scatter ProcessGroup APIs and updates LoRA weight loading to support loading full saved LoRA tensors into TP-sharded model parameters.

Motivation

In tensor-parallel LoRA training, the base linear output and the LoRA update together form one logical linear output. Previously, the base path and LoRA path could run separate TP collectives and add their outputs afterward:

base = TPCollective(base_i);
lora = TPCollective(lora_i);
out = base + lora;

This changes the floating-point communication/reduction order compared with computing the logical local contribution first:

out_i = base_i + lora_i;
out = TPCollective(out_i);

The separate-collective path can introduce numerical divergence across TP/DDP configurations.

Replicated or sharded LoRA parameters need rank-consistent initialization so every TP rank starts from the same logical LoRA weights.

Key Changes

LoRA parallel linear forward

Update LoRAColumnParallelLinear to:
- compute the base shard locally;
- compute the LoRA shard locally;
- add base_shard + scaled_lora_shard before TP gather;
- run a single gather when gather_output_ is enabled.
Update LoRARowParallelLinear to:
- compute the base shard locally;
- compute the LoRA shard locally;
- add base_shard + scaled_lora_shard before TP reduce/reduce-scatter;
- apply bias after the collective, matching the base RowParallelLinear behavior.

This makes the LoRA path follow the same logical communication boundary as the base parallel linear layer and avoids running separate collectives for base and LoRA outputs.

Deterministic TP LoRA initialization

Make replicated ColumnParallel lora_A consistent across TP ranks.
Initialize the logical RowParallel lora_A from TP rank 0 and distribute the correct local shards to each TP rank.
Ensure TP ranks observe the same logical LoRA initialization instead of independently sampling incompatible parameter values.

ProcessGroup communication APIs

Add rank-aware communication APIs:
- ProcessGroup::Broadcast(tensors, root_rank_in_group, async_op)
- ProcessGroup::Scatter(output_tensors, input_tensors, root_rank_in_group, async_op)
Rename the old APIs to BroadCast_ / Scatter_ to avoid semantic ambiguity.
Update existing autograd communication wrappers to call the legacy renamed APIs where appropriate.

LoRA weight loading

Extend LoadLoRAWeights to handle shape mismatch between saved full LoRA tensors and TP-sharded destination tensors.
When the destination tensor is sharded, slice the loaded full tensor according to tp_rank before copying into the local model parameter.

Test

The loss values in the LoRA tests fluctuate slightly, which is expected because the tests now use saved fixed initialization values.

The performance variance is concentrated in the lora/bfloat16/distopt tests and is treated as normal fluctuation.

…ence Inline base and LoRA matmuls, add locally, then issue a single AllGather/AllReduce instead of two separate collective ops. The prior two-collective approach caused floating-point divergence in DDP loss. Also fix LoadLoRAWeights to slice sharded tensors by tp_rank when the checkpoint shape differs from the partitioned model shape.

…ated weights

Introduce new multi-stream Broadcast and Scatter APIs that take a root_rank_in_group argument, and rename the legacy single-stream variants to BroadCast_/Scatter_ to disambiguate.

Chamberlain0w0 · 2026-06-11T05:36:03Z

+                int64_t shard_size = dims[shard_dim] / tp_size;
+                int64_t start = parallel::tp_rank * shard_size;
+                auto sliced = cpu_tensor->Slice(shard_dim, start, start + shard_size);
+                dst->CopyFrom(sliced);


Chamberlain0w0 · 2026-06-11T06:09:47Z

+
+        if (tp_rank == 0) {
+            if (config_.use_kaiming_a) {
+                init::KaimingUniform(parameters_[kParamLoraAName], config_.kaiming_a_param);


在这里做各个 tp rank 各自 init，可能会导致 DP 组之间无法达到权重一致的结果。现在的情况可能会因为都用默认 seed 导致恰好能对上，但是从代码看，语义上还是各初始化各的，DP 组之间有可能权重对不齐。原则上应该保证 DDP model 最后再做一步参数的广播/复制，从而确保在实际执行 forward 前各个能对应上的 dp rank 上面模型参数是相同的，但这里我知道改的话会不会影响其他 lora 基建；最简单的就是 tp_group->Broadcast({parameters_[kParamLoraAName]}, 0); 后面再加个 dp_group 的 broadcast。

Chamberlain0w0 · 2026-06-11T11:33:43Z

    }
 }

 void SaveLoRAWeights(const std::shared_ptr<Module> &model, const std::string &filepath) {


save 似乎 save 的是 tp local shard tensor，跟 load 逻辑对不上了？load 应该默认读 full tensor 吧？

chen2021673 force-pushed the lora_ddp_loss branch from 8459156 to f918ae3 Compare May 14, 2026 09:08

chen2021673 force-pushed the lora_ddp_loss branch from 7bfc67d to d2c8f7a Compare June 4, 2026 05:38

chen2021673 added 3 commits June 4, 2026 06:12

fix: broadcast lora_A init from TP rank 0 to ensure consistent replic…

9467166

…ated weights

feat: add rank-aware Broadcast/Scatter to ProcessGroup

f4f2220

Introduce new multi-stream Broadcast and Scatter APIs that take a root_rank_in_group argument, and rename the legacy single-stream variants to BroadCast_/Scatter_ to disambiguate.

chen2021673 force-pushed the lora_ddp_loss branch from d2c8f7a to f4f2220 Compare June 5, 2026 05:22

Use TP collectives for LoRA A init

adbc82d

Chamberlain0w0 requested changes Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct LoRA initialization and forward pass under tensor parallelism#150

fix: correct LoRA initialization and forward pass under tensor parallelism#150
chen2021673 wants to merge 4 commits into
masterfrom
lora_ddp_loss

chen2021673 commented Apr 30, 2026 •

edited

Loading

Uh oh!

Chamberlain0w0 Jun 11, 2026

Uh oh!

Chamberlain0w0 Jun 11, 2026

Uh oh!

Chamberlain0w0 Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chen2021673 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Key Changes

LoRA parallel linear forward

Deterministic TP LoRA initialization

ProcessGroup communication APIs

LoRA weight loading

Test

Uh oh!

Chamberlain0w0 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Chamberlain0w0 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Chamberlain0w0 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chen2021673 commented Apr 30, 2026 •

edited

Loading