Reuse quantized embedding table for tied LM head in TieWordEmbeddings by justinchuby · Pull Request #2549 · microsoft/Olive

justinchuby · 2026-07-01T17:33:44Z

Describe your changes

Add a reuse mode to the TieWordEmbeddings graph surgery.

Previously TieWordEmbeddings handled two cases: both weights unquantized
(Gather + MatMul) or both quantized (GatherBlockQuantized + MatMulNBits).
There was no path for when only the embedding is quantized while the LM head
is still a float MatMul whose weight is the tied embedding (reached through a
Transpose). This happens naturally when the embedding Gather is quantized with
OnnxBlockWiseRtnQuantization while the transformer body is left for a separate
pass such as OnnxKQuantQuantization. In that state the tied word-embedding matrix
is stored twice — once as INT4 (embedding) and once as float16 (LM head) —
which is larger than an all-float16 model.

This PR adds:

handle_reuse — rebuilds the float LM head as a MatMulNBits that shares the
embedding's INT4 qweight / scales / zero_point (the byte-identical table,
reshaped from the 2D GatherBlockQuantized layout to the 3D MatMulNBits
layout), then prunes the now-dead Transpose and the float embedding weight.
reuse_weights_match — a correctness gate that only ties when the float LM
head weight actually equals the dequantized embedding table (comparing a slice),
so an untied projection is never incorrectly tied.

Pipeline it enables (smallest size at the highest-quality body quantization):

MobiusBuilder(fp16)
  -> OnnxKQuantQuantization            # body -> Q4_K_M
  -> OnnxBlockWiseRtnQuantization      # embedding Gather -> GatherBlockQuantized
  -> GraphSurgeries[TieWordEmbeddings] # LM head reuses the embedding INT4 table

Each pass touches only its intended nodes (body MatMuls have initializer weights;
the embedding Gather has an initializer weight; the tied LM head's weight is
behind a Transpose, so it is skipped by both quantizers and handled here).

Measured on a 1.8B tied-embedding translation model (CUDA): on-disk size drops from
~1.39 GB (K-Quant body, float16 embedding/LM head) to ~1.03 GB, with equal or
better output fidelity vs float16 compared to the two-table INT4 variant.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.
TieWordEmbeddings can now tie a float LM head onto an already-quantized
(GatherBlockQuantized) embedding, storing the shared word-embedding matrix
once as INT4 instead of INT4 + float16.

(Optional) Issue link

Replace the GPTQ+RTN CUDA INT4 pipeline with: MobiusBuilder(fp16) -> OnnxKQuantQuantization(body, Q4_K_M) -> OnnxBlockWiseRtnQuantization(embedding Gather -> GatherBlockQuantized) -> GraphSurgeries[TieWordEmbeddings] (LM head reuses the embedding INT4 table) Each pass touches only its intended nodes: K-Quant quantizes the body MatMuls, ONNX RTN quantizes the embedding Gather, and TieWordEmbeddings rebuilds the tied LM head as a MatMulNBits sharing the embedding's INT4 table (pruning the float weight). This gives the smallest on-disk model (~1.03 GB) at K-Quant body quality, with translation quality on par with the K-Quant baseline and better than GPTQ. Requires the TieWordEmbeddings reuse mode from microsoft/Olive#2549. Drop gptqmodel from requirements (no longer used); update info.yml and README. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>

Copilot

Pull request overview

This PR extends the TieWordEmbeddings ONNX graph surgery to support a new reuse mode where the embedding is already quantized (GatherBlockQuantized) but the tied LM head is still a float MatMul. In this case, the surgery rebuilds the LM head as MatMulNBits that reuses the embedding’s quantized tensors to avoid storing the same embedding table twice.

Changes:

Add a reuse path in TieWordEmbeddings to convert a float LM-head MatMul into MatMulNBits sharing the embedding’s quantized qweight/scales/zero_point.
Add a reuse_weights_match gate to verify the float LM-head weight matches the dequantized embedding table before tying.
Add unit tests covering both the successful reuse case and the “skip when not actually tied” case.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`olive/passes/onnx/graph_surgeries.py`	Adds the reuse-mode surgery + correctness gate and performs pruning/rewiring of the LM head to reuse embedding quantized tensors.
`test/passes/onnx/test_graph_surgeries.py`	Adds unit tests that validate reuse-mode tying occurs only when the float LM head truly matches the embedding table.

+        graph_idx = dag.get_graph_idx(lm_head_name)
+        n_blocks = hidden // block_size
+        blob_size = block_size * bits // 8
+


+        n = min(n_check, vocab)
+        n_blocks = hidden // block_size
+
+        # Unpack two 4-bit codes per byte (low nibble first), matching MatMulNBits packing.
+        q = qweight[:n]
+        codes = np.empty((n, hidden), np.float32)
+        codes[:, 0::2] = (q & 0x0F).astype(np.float32)
+        codes[:, 1::2] = (q >> 4).astype(np.float32)
+        codes = codes.reshape(n, n_blocks, block_size)
+
+        if zero_point is not None:
+            zp = zero_point[:n]
+            zcodes = np.empty((n, n_blocks), np.float32)
+            zcodes[:, 0::2] = (zp & 0x0F).astype(np.float32)
+            zcodes[:, 1::2] = (zp >> 4).astype(np.float32)
+            zcodes = zcodes.reshape(n, n_blocks, 1)
+        else:
+            # Symmetric quantization centers codes at the midpoint of the 4-bit range.
+            zcodes = np.float32(2 ** (bits - 1))


+        if transpose_name is not None and not dag.get_consumers(transpose_name):
+            dag.remove_node(transpose_name)


Add a reuse mode to the TieWordEmbeddings graph surgery for the case where the embedding has been quantized to GatherBlockQuantized but the LM head is still a float MatMul (its weight is the tied embedding, reached through a Transpose). Previously TieWordEmbeddings only handled both-unquantized (Gather + MatMul) or both-quantized (GatherBlockQuantized + MatMulNBits). When only the embedding Gather is quantized (e.g. OnnxBlockWiseRtnQuantization while the body is left for OnnxKQuantQuantization), the tied word-embedding matrix ends up stored twice: once as INT4 (embedding) and once as float16 (LM head), which is larger than a fully float16 model. handle_reuse rebuilds the LM head as a MatMulNBits that shares the embedding's INT4 qweight / scales / zero-point (the byte-identical table, reshaped to the MatMulNBits layout), and prunes the now-dead Transpose and float embedding weight. reuse_weights_match gates this on the float LM head weight actually matching the dequantized embedding table, so an untied projection is never tied. This lets a K-Quant body + shared-INT4 tied embedding/LM head model reach the smallest on-disk size at the highest-quality body quantization. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>

Copilot AI review requested due to automatic review settings July 1, 2026 17:33

Copilot started reviewing on behalf of justinchuby July 1, 2026 17:34 View session

Copilot AI reviewed Jul 1, 2026

View reviewed changes

justinchuby force-pushed the justinchu/tie-embeddings-reuse-quantized branch from 8d2f2fa to 5c8867b Compare July 1, 2026 18:02

justinchuby changed the base branch from main to justinchu/graph-surgeries-ir-rewriter July 1, 2026 18:02

justinchuby force-pushed the justinchu/tie-embeddings-reuse-quantized branch 2 times, most recently from 2f8d752 to b148803 Compare July 1, 2026 21:22

justinchuby force-pushed the justinchu/tie-embeddings-reuse-quantized branch from b148803 to 3dcb3fa Compare July 1, 2026 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reuse quantized embedding table for tied LM head in TieWordEmbeddings#2549

Reuse quantized embedding table for tied LM head in TieWordEmbeddings#2549
justinchuby wants to merge 1 commit into
justinchu/graph-surgeries-ir-rewriterfrom
justinchu/tie-embeddings-reuse-quantized

justinchuby commented Jul 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if transpose_name is not None and not dag.get_consumers(transpose_name):
		dag.remove_node(transpose_name)

Uh oh!

Conversation

justinchuby commented Jul 1, 2026

Describe your changes

Checklist before requesting a review

(Optional) Issue link

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants