Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration by xadupre · Pull Request #2546 · microsoft/Olive

xadupre · 2026-06-29T11:38:07Z

Describe your changes

Merges #2536, #2535, #2534.

Additionally adds llama.cpp integration and other improvements to OnnxDiscrepancyCheck:

New llama_cpp flag (bool, default False) on OnnxDiscrepancyCheck — when enabled, converts the reference HuggingFace model to GGUF format using convert_hf_to_gguf.py from llama.cpp and compares inference with llama.cpp.
New llama_cpp_env_path parameter (Optional[str]) — path to the llama_env virtual environment where llama-cpp-python and convert_hf_to_gguf.py are installed (defaults to "llama_env" relative to cwd). The virtual environment also isolates potentially conflicting versions of torch, transformers, etc.
compare_llama_cpp() method — saves the reference model and tokenizer to output_dir/hf_model (alongside the saved test model/report output) using save_pretrained (standard HuggingFace format), then:
1. Calls convert_hf_to_gguf.py (the official llama.cpp conversion CLI) inside llama_env via subprocess to produce a GGUF F32 file at output_dir/model.gguf.
2. Runs a self-contained inference helper script inside llama_env via subprocess.run to measure first-token latency with llama_cpp.Llama.
  Results include first-token match vs PyTorch, llama_cpp_ttft_s, llama_cpp_ttfn_s, llama_cpp_total_time_s, llama_cpp_speedup_vs_pytorch, and llama_cpp_speedup_vs_onnx. All llama_cpp/gguf imports are strictly isolated to the subprocess — the main Olive process never imports them.
New --test_llama_path CLI option — specifies the path to the llama_env virtual environment when running olive optimize --test. When provided, the injected OnnxDiscrepancyCheck pass automatically enables llama_cpp=True and forwards the path as llama_cpp_env_path. Using --test_llama_path without --test emits a warning.
Improved --test_metrics parsing — now accepts both space-separated (--test_metrics mae speedup) and comma-separated (--test_metrics mae,speedup) forms. A _parse_test_metrics type function handles splitting and validation per token, and a _flatten_test_metrics helper normalises the result before it is forwarded to the pass.
Fixed add_discrepancy_check_pass update-in-place — when a config was previously generated by olive optimize --dry_run --test, the OnnxDiscrepancyCheck pass was already present and was silently skipped on subsequent olive run --test --test_metrics … calls. The function now updates the existing pass in-place, refreshing reference_model_path (resolved to absolute), report_output_dir, metric settings (max_mae / timing_iterations), and llama.cpp settings — so --test_metrics, --output_path, and --test_llama_path from olive run always take effect.
Fixed test model persistence across engine cache hits — ModelBuilder now copies the test model directory to reference_hf_model/ alongside the generated ONNX output in the engine cache. OnnxDiscrepancyCheck falls back to this cached copy when reference_model_path (e.g. out/tiny-test) no longer exists on disk, fixing the OSError: Repo id must be in the form 'repo_name' or 'namespace/repo_name' error that occurred on subsequent olive run calls that hit the model-builder cache.
New SaveTestModelConfig pass (olive/passes/pytorch/save_test_model_config.py) — a new Olive pass injected at the beginning of the passes list when --test is active. It takes an HfModelHandler, writes config.json (with the reduced hidden-layer count) and the Olive test-model marker file to test_model_path, and returns the model unchanged. This ensures test_model_path always exists as a config-only directory before ModelBuilder or any downstream pass needs it, replacing the previous _save_test_model_config_for_dry_run standalone function.
CI workflow (test-model-fast.yml) — new step that creates a llama_env virtual environment, installs gguf, safetensors, llama-cpp-python (from pre-built CPU wheels at https://abetlen.github.io/llama-cpp-python/whl/cpu), transformers, sentencepiece, and protobuf, and downloads convert_hf_to_gguf.py from the llama.cpp GitHub repository.
Updated documentation (cli-fast-test.md) — clarifies WHERE the 2-layer reduction happens (_apply_test_model_config in olive/common/hf/utils.py, called during the model-builder pass of olive run) and WHEN out/tiny-test is created (by the SaveTestModelConfig pass on first olive run, completed with weights by ModelBuilder); documents the cache fallback behaviour; explains the --test_llama_path option and llama_env setup; and clarifies that --test_metrics is always respected even when the config was generated by olive optimize --dry_run --test.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

… xadupre/merged

Copilot

Pull request overview

This PR consolidates three earlier changes around OnnxDiscrepancyCheck to improve test-mode metrics: it exposes latency values alongside speedup, adds time-to-first-token style generation metrics, and extends olive run --test so users can opt into additional metrics via --test_metrics.

Changes:

Extend OnnxDiscrepancyCheck.compare_generation to return a metrics dict including TTFT / time-to-first-N for both transformers and ORT GenAI.
Update speedup measurement to return and persist average PyTorch/ONNX latencies plus the computed speedup.
Add CLI support for selecting --test metrics (--test_metrics mae|speedup) and update/extend tests + docs accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`olive/passes/onnx/discrepancy_check.py`	Adds/returns latency metrics for generation and speedup runs; logs formatted latency values.
`olive/cli/base.py`	Introduces `--test_metrics`, injects discrepancy pass based on selected metrics, and warns when `--test_metrics` is unused.
`olive/cli/run.py`	Wires `--test_metrics` through to the injected discrepancy pass and warning helper.
`test/passes/onnx/test_discrepancy_check.py`	Updates unit tests for new generation result shape and new speedup return tuple.
`test/cli/test_cli.py`	Adjusts CLI workflow test config to reflect timing being disabled in certain cases.
`test/cli/test_base.py`	Adds focused tests for default metric selection behavior and warning behavior.
`docs/source/how-to/cli/cli-fast-test.md`	Documents metric selection for `--test` runs.

xadupre · 2026-06-29T13:06:41Z

@copilot in job Ubuntu test model fast, add a step which create a virtual environment called llama_env and install llama-cpp-python, then in OnnxDiscrepancy check, add a flag called llama-cpp which converts the dumped model into gguf format, measures the first token difference and the latencies and speed up, every use of llama-cpp is done through the virtual environment

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

… CI step

xadupre · 2026-07-01T11:23:06Z

@copilot do not use temporary directory like you said in The optimized ONNX model, its transformers config.json, and the reference_hf_model copy are routed to a temp working dir and discarded, so they no longer clutter output_path., keep the produced model in the folders, and i still don't see any onnx model

xadupre · 2026-07-01T11:28:05Z

@copilot can you save the olive configuration into olive_config.json if dry_run is not enabled

xadupre · 2026-07-01T11:36:20Z

@copilot add metrics first_token_20, it computes the first token difference with onnxruntime-genai compare to transformers, add metrics tft, time to first token, tf5t, time to first 5 tokens in OnnxDiscrepancyCheck

…r --test

Copilot · 2026-07-01T11:37:53Z

@copilot do not use temporary directory like you said in *The optimized ONNX model, its transformers config.json, and the reference_hf_model copy are routed to a temp working dir and d...

Fixed in 5b8bab5. Removed the temporary-directory routing — the optimized ONNX model is now saved to <output_path>/model/ and kept on disk. The Olive config.json stays at the <output_path> root (so the model builder's transformers config.json in the subfolder no longer overwrites it), and discrepancy_check_results.json remains at the root too. So optimize ... --test <path> produces the ONNX model in <output_path>/model/ and the small test model at the --test path.

Copilot · 2026-07-01T11:57:02Z

@copilot can you save the olive configuration into olive_config.json if dry_run is not enabled

Done in 99c5e02. When --dry_run is not enabled, the Olive config is now saved as olive_config.json (so it is never confused with the model's own config.json). --dry_run still writes config.json so olive run --config <output_path>/config.json continues to work.

@copilot add metrics first_token_20, it computes the first token difference with onnxruntime-genai compare to transformers, add metrics tft, time to first token, tf5t, time to first 5 tokens in OnnxDi...

Added in 99c5e02. OnnxDiscrepancyCheck (and --test_metrics) now support:

first_token_20: compares the first generated token between ONNX Runtime GenAI and transformers over a 20-token generation (reports both first tokens, a match flag, and matching leading tokens).
tft: time to first token (transformers + GenAI).
tf5t: time to first 5 tokens (transformers + GenAI).

The generation metrics use the optimized ONNX model directory as the GenAI model when it contains a genai_config.json. Results are logged and written to discrepancy_check_results.json.

xadupre · 2026-07-01T15:51:42Z

@copilot fix fast fast

…e 64)

Copilot · 2026-07-01T15:58:59Z

@copilot fix fast fast

Fixed the test-model-fast.yml failure (in_features 64 must be divisible by group_size 128). The tiny test model has hidden_size=64, but --precision int4 generated a GPTQ pass with the default group_size=128. _set_offline_gptq_data_config now sets group_size=32, and both test_model_discrepancy and test_documented_test_model_smoke_flow pass locally.

xadupre · 2026-07-01T16:02:08Z

@copilot attn_impl for OnnxDiscrepancyCheck should be moved to the pass created the transformers test model, OnnxDiscrepancyCheck should use that model

…nto xadupre/merged

…d model config

Copilot · 2026-07-01T16:14:14Z

@copilot attn_impl for OnnxDiscrepancyCheck should be moved to the pass created the transformers test model, OnnxDiscrepancyCheck should use that model

Done in 37a4e48. Moved attn_impl from OnnxDiscrepancyCheck to SaveTestModelConfig (the pass that creates the transformers test model, default "sdpa"). It bakes _attn_implementation into the saved test model's config.json, and OnnxDiscrepancyCheck now loads the reference model from that directory without an explicit attn_implementation, so it uses the value stored on the saved model. Verified end-to-end that _attn_implementation=sdpa is written to the config and the smoke tests pass.

xadupre · 2026-07-01T16:25:58Z

@copilot File "/home/xadupre/github/Olive/olive/passes/onnx/discrepancy_check.py", line 710, in _run_for_config
report_path.write_text(json.dumps(results, indent=2))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/json/init.py", line 238, in dumps
**kw).encode(obj)
^^^^^^^^^^^
File "/usr/lib/python3.12/json/encoder.py", line 202, in encode
chunks = list(chunks)
^^^^^^^^^^^^
File "/usr/lib/python3.12/json/encoder.py", line 432, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/usr/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.12/json/encoder.py", line 439, in _iterencode
o = _default(o)
^^^^^^^^^^^
File "/usr/lib/python3.12/json/encoder.py", line 180, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type int32 is not JSON serializable

Copilot · 2026-07-01T16:31:12Z

@copilot File "/home/xadupre/github/Olive/olive/passes/onnx/discrepancy_check.py", line 710, in _run_for_config
report_path.write_text(json.dumps(results, indent=2))
...

Fixed. The results dict contained numpy scalars (e.g. int32 token ids), which aren't JSON serializable. Added a recursive _json_sanitize that converts numpy scalars/arrays to native Python types, applied before writing discrepancy_check_results.json (and before storing in model_attributes, which the CLI also serializes).

xadupre and others added 20 commits June 22, 2026 13:45

expose latencies with the speedup in OnnxDiscrepancyCheck

b21b4f6

Extend discrepancy check unit test for latency tuple

72c5d4e

add time to first token in OnnxDiscrepancyCheck

bf0a978

Add return type annotation to _measure_speedup

804bb92

Add latency key assertions to fully matching discrepancy test

1bdee25

Handle zero max_new_tokens in generation metrics

142ddea

Use single measured transformers generation for latency metrics

39cac1c

extend command line --test to trigger speedup measure

6b5b652

Document --test_metrics speedup usage

15287d8

Fix default test metrics to be mae-only, make speedup opt-in

89d98c4

Fix test to match new default mae-only behavior

7490e55

Fix test_cli.py expected pass config to include timing_iterations=0

bf96e3f

Merge branch 'main' into xadupre/lat

d34ecd5

Merge branch 'main' into xadupre/tts

141f35a

Merge branch 'main' into xadupre/cmd

1e8c020

Merge branch 'main' into xadupre/cmd

3370626

Merge branch 'main' into xadupre/tts

9c7365d

Merge branch 'main' into xadupre/lat

922089a

Merge branch 'xadupre/tts' of https://github.com/microsoft/Olive into…

0741621

… xadupre/merged

Merge branch 'xadupre/lat' of https://github.com/microsoft/Olive into…

4c7938b

… xadupre/merged

xadupre marked this pull request as ready for review June 29, 2026 13:02

Copilot AI review requested due to automatic review settings June 29, 2026 13:02

Copilot started reviewing on behalf of xadupre June 29, 2026 13:03 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

Comment thread olive/cli/base.py

Comment thread docs/source/how-to/cli/cli-fast-test.md Outdated

Comment thread olive/passes/onnx/discrepancy_check.py Outdated

Copilot started work on behalf of xadupre June 29, 2026 13:07 View session

xadupre and others added 4 commits June 29, 2026 15:29

Potential fix for pull request finding

fc5c372

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

f1077b7

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

c1e75e6

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

feat: add llama-cpp integration to OnnxDiscrepancyCheck and llama_env…

4c73d0a

… CI step

Copilot finished work on behalf of xadupre July 1, 2026 11:07

Copilot started work on behalf of xadupre July 1, 2026 11:23 View session

Keep optimized ONNX model in output_path/model instead of temp dir fo…

811ad65

…r --test

Copilot finished work on behalf of xadupre July 1, 2026 11:38

Copilot started work on behalf of xadupre July 1, 2026 11:39 View session

Add olive_config.json save + first_token_20/tft/tf5t generation metrics

99c5e02

Copilot finished work on behalf of xadupre July 1, 2026 11:57

Copilot started work on behalf of xadupre July 1, 2026 15:52 View session

Fix fast test: set GPTQ group_size=32 for tiny test model (hidden_siz…

50f7a4f

…e 64)

Copilot finished work on behalf of xadupre July 1, 2026 15:59

Copilot started work on behalf of xadupre July 1, 2026 16:02 View session

xadupre and others added 4 commits July 1, 2026 18:10

documentation

31ecf23

Merge branch 'xadupre/merged' of https://github.com/microsoft/Olive i…

5068b0f

…nto xadupre/merged

Move attn_impl to SaveTestModelConfig; OnnxDiscrepancyCheck uses save…

37a4e48

…d model config

Merge remote-tracking branch 'origin/xadupre/merged' into xadupre/merged

38eea95

Copilot finished work on behalf of xadupre July 1, 2026 16:14

Copilot started work on behalf of xadupre July 1, 2026 16:26 View session

Fix int32 JSON serialization error in OnnxDiscrepancyCheck results

d44373f

Copilot finished work on behalf of xadupre July 1, 2026 16:31

Uh oh!

Conversation

xadupre commented Jun 29, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Checklist before requesting a review

(Optional) Issue link

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xadupre commented Jun 29, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xadupre commented Jun 29, 2026 •

edited by Copilot AI

Loading

Copilot AI commented Jul 1, 2026 •

edited

Loading