FP8 Megablox for batch split by BirdsOfAFthr · Pull Request #3770 · AI-Hypercomputer/maxtext

BirdsOfAFthr · 2026-04-29T05:00:19Z

Description

(1) This update enables FP8 Megablox quantization support for DeepSeek batch split configurations.

When quantization is active, the following changes apply:

Kernel Quantization: gmm kernels allow FP8 recipes (defined via the MaxText command line) in both forward and backward passes.
gmm forward: weight is manually quantized to bypass the expcilt sharding error; activation is quantized using qwix
gmm backward: gradients are quantized using qwix.

(2) This change also enables merging gating gmm kernels.

In the previous SwiGLU/GLU implementation, the gate-projection and up-projection were processed using two sequential gmm_fn calls. By concatenating these weights and processing them together, we effectively double the contiguous hidden dimension of the kernel. This is especially critical for FP8 utilizing Expert Parallelism (EP) that shard along the contracting dimension. Because this sharding strategy inherently shrinks the local MLP hidden dimension on each device, the matrix multiplications can become small and bottlenecked by memory bandwidth. Merging $W_0$ and $W_1$ effectively gives us a 2X increase in that local dimension, restoring arithmetic intensity and hardware utilization.

Tests

Verification: Validated via end-to-end (e2e) perf and convergence benchmarks.
Coverage: Unit tests will be added in a subsequent update.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

gobbleturk · 2026-04-29T18:10:40Z

Note there is more general support in this PR #3736

shuningjin

Thanks for GMM FP8 support with careful manual quantization, while bringing back merging gating from PR#3199! Had a comment wrt sharding, and other minor changes.

suexu1025

Thanks @BirdsOfAFthr

# Description 1. Refactor `manual_axis_type` conditional specification. - It was previously introduced by [PR#3770](#3770) and [PR#3869](#3869). - It requires tokamax > 0.12.0 (unreleased yet) with [commit](openxla/tokamax@cc374e3) - It is used for gmm shard_map with check_vma=True. This is currently only activated for experimental run (deepseek_batchsplit and use_manual_quantization). Tests won't blocked by dependency. 2. Add unit test for `use_tokamax_gmm=True` for smoke train. Both bf16 and fp8. - Smoke train instead of AOT as blocked by b/489205940 - Run in g3 only to save github CI/CD time 3. Reset `group offset` to None: [PR#4082](#4082) added `group_offset` to `tokamax.ragged_dot` and `tokamax.ragged_dot_general`. However, `group_offset` is not yet supported by tokamax: - see [tokamax code](https://github.com/openxla/tokamax/blob/cdde910bf925d834c0c9e6cee5a488095f0381d4/tokamax/_src/ops/ragged_dot/api.py#L177-L178) - the added unit test meet [error](http://shortn/_oy55Dkh22N) # Tests unit test `tokamax_gmm_test` # Checklist Before submitting this PR, please make sure (put X in square brackets): - [x] I have performed a self-review of my code. For an optional AI review, add the `gemini-review` label. - [x] I have necessary comments in my code, particularly in hard-to-understand areas. - [x] I have run end-to-end tests tests and provided workload links above if applicable. - [x] I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in [our documentation](https://maxtext.readthedocs.io/en/latest/development.html#adding-new-documentation-files). PiperOrigin-RevId: 930601487

BirdsOfAFthr added the pull ready label Apr 29, 2026

BirdsOfAFthr marked this pull request as ready for review April 29, 2026 05:01

BirdsOfAFthr requested review from NicoGrande, NuojCheng, RissyRan, bvandermoon, gagika, gobbleturk, jesselu-google, jiangjy1982, parambole, richjames0, shralex, shuningjin and suexu1025 as code owners April 29, 2026 05:01

BirdsOfAFthr force-pushed the amandaliang branch from 47d4976 to 3da5049 Compare April 29, 2026 19:07

BirdsOfAFthr requested review from A9isha, SurbhiJainUSC, abhinavclemson, aireenmei, dipannita08, hengtaoguo, igorts-git, jshin1394, khatwanimohit, liudangyi, michelle-yooh and vipannalla as code owners April 29, 2026 19:07

BirdsOfAFthr force-pushed the amandaliang branch from 3da5049 to 5992420 Compare April 29, 2026 19:13

BirdsOfAFthr changed the title ~~Support merging gating gmm kernels~~ FP8 Megablox for batch split Apr 29, 2026

BirdsOfAFthr force-pushed the amandaliang branch from 5992420 to 61a6832 Compare April 30, 2026 21:45

shuningjin reviewed May 1, 2026

View reviewed changes

Comment thread src/maxtext/kernels/megablox/ops.py Outdated

Comment thread src/maxtext/kernels/megablox/ops.py Outdated

Comment thread src/maxtext/layers/quantizations.py Outdated

Comment thread src/maxtext/models/deepseek_batchsplit.py Outdated

BirdsOfAFthr force-pushed the amandaliang branch 2 times, most recently from 49d07fb to 5ef365f Compare May 1, 2026 05:24

shuningjin reviewed May 1, 2026

View reviewed changes

Comment thread src/maxtext/models/deepseek_batchsplit.py Outdated

BirdsOfAFthr force-pushed the amandaliang branch from 5ef365f to 556797b Compare May 1, 2026 16:30

shuningjin approved these changes May 1, 2026

View reviewed changes

suexu1025 approved these changes May 1, 2026

View reviewed changes

FP8 Megablox for batch split

0ec1454

BirdsOfAFthr force-pushed the amandaliang branch from 556797b to 0ec1454 Compare May 1, 2026 17:30

copybara-service Bot closed this in 233794e May 1, 2026

copybara-service Bot mentioned this pull request Jun 11, 2026

[deprecated] clean tokamax gmm and add test #4147

Closed

4 tasks

shuningjin mentioned this pull request Jun 11, 2026

clean tokamax gmm and add test #4152

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FP8 Megablox for batch split#3770

FP8 Megablox for batch split#3770
BirdsOfAFthr wants to merge 1 commit into
mainfrom
amandaliang

BirdsOfAFthr commented Apr 29, 2026 •

edited

Loading

Uh oh!

gobbleturk commented Apr 29, 2026

Uh oh!

shuningjin left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suexu1025 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

BirdsOfAFthr commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

(1) This update enables FP8 Megablox quantization support for DeepSeek batch split configurations.

(2) This change also enables merging gating gmm kernels.

Tests

Checklist

Uh oh!

gobbleturk commented Apr 29, 2026

Uh oh!

shuningjin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suexu1025 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BirdsOfAFthr commented Apr 29, 2026 •

edited

Loading

shuningjin left a comment •

edited

Loading