Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration#2546
Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration#2546xadupre wants to merge 68 commits into
Conversation
… xadupre/merged
… xadupre/merged
There was a problem hiding this comment.
Pull request overview
This PR consolidates three earlier changes around OnnxDiscrepancyCheck to improve test-mode metrics: it exposes latency values alongside speedup, adds time-to-first-token style generation metrics, and extends olive run --test so users can opt into additional metrics via --test_metrics.
Changes:
- Extend
OnnxDiscrepancyCheck.compare_generationto return a metrics dict including TTFT / time-to-first-N for both transformers and ORT GenAI. - Update speedup measurement to return and persist average PyTorch/ONNX latencies plus the computed speedup.
- Add CLI support for selecting
--testmetrics (--test_metrics mae|speedup) and update/extend tests + docs accordingly.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
olive/passes/onnx/discrepancy_check.py |
Adds/returns latency metrics for generation and speedup runs; logs formatted latency values. |
olive/cli/base.py |
Introduces --test_metrics, injects discrepancy pass based on selected metrics, and warns when --test_metrics is unused. |
olive/cli/run.py |
Wires --test_metrics through to the injected discrepancy pass and warning helper. |
test/passes/onnx/test_discrepancy_check.py |
Updates unit tests for new generation result shape and new speedup return tuple. |
test/cli/test_cli.py |
Adjusts CLI workflow test config to reflect timing being disabled in certain cases. |
test/cli/test_base.py |
Adds focused tests for default metric selection behavior and warning behavior. |
docs/source/how-to/cli/cli-fast-test.md |
Documents metric selection for --test runs. |
|
@copilot in job Ubuntu test model fast, add a step which create a virtual environment called llama_env and install llama-cpp-python, then in OnnxDiscrepancy check, add a flag called llama-cpp which converts the dumped model into gguf format, measures the first token difference and the latencies and speed up, every use of llama-cpp is done through the virtual environment |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
@copilot do not use temporary directory like you said in The optimized ONNX model, its transformers config.json, and the reference_hf_model copy are routed to a temp working dir and discarded, so they no longer clutter output_path., keep the produced model in the folders, and i still don't see any onnx model |
|
@copilot can you save the olive configuration into olive_config.json if dry_run is not enabled |
|
@copilot add metrics first_token_20, it computes the first token difference with onnxruntime-genai compare to transformers, add metrics tft, time to first token, tf5t, time to first 5 tokens in OnnxDiscrepancyCheck |
Fixed in 5b8bab5. Removed the temporary-directory routing — the optimized ONNX model is now saved to |
Done in 99c5e02. When
Added in 99c5e02.
The generation metrics use the optimized ONNX model directory as the GenAI model when it contains a |
|
@copilot fix fast fast |
Fixed the |
|
@copilot attn_impl for OnnxDiscrepancyCheck should be moved to the pass created the transformers test model, OnnxDiscrepancyCheck should use that model |
Done in 37a4e48. Moved |
|
@copilot File "/home/xadupre/github/Olive/olive/passes/onnx/discrepancy_check.py", line 710, in _run_for_config |
Fixed. The |
Describe your changes
Merges #2536, #2535, #2534.
Additionally adds llama.cpp integration and other improvements to
OnnxDiscrepancyCheck:llama_cppflag (bool, defaultFalse) onOnnxDiscrepancyCheck— when enabled, converts the reference HuggingFace model to GGUF format usingconvert_hf_to_gguf.pyfrom llama.cpp and compares inference with llama.cpp.llama_cpp_env_pathparameter (Optional[str]) — path to thellama_envvirtual environment wherellama-cpp-pythonandconvert_hf_to_gguf.pyare installed (defaults to"llama_env"relative to cwd). The virtual environment also isolates potentially conflicting versions oftorch,transformers, etc.compare_llama_cpp()method — saves the reference model and tokenizer tooutput_dir/hf_model(alongside the saved test model/report output) usingsave_pretrained(standard HuggingFace format), then:convert_hf_to_gguf.py(the official llama.cpp conversion CLI) insidellama_envvia subprocess to produce a GGUF F32 file atoutput_dir/model.gguf.llama_envviasubprocess.runto measure first-token latency withllama_cpp.Llama.Results include first-token match vs PyTorch,
llama_cpp_ttft_s,llama_cpp_ttfn_s,llama_cpp_total_time_s,llama_cpp_speedup_vs_pytorch, andllama_cpp_speedup_vs_onnx. Allllama_cpp/ggufimports are strictly isolated to the subprocess — the main Olive process never imports them.--test_llama_pathCLI option — specifies the path to thellama_envvirtual environment when runningolive optimize --test. When provided, the injectedOnnxDiscrepancyCheckpass automatically enablesllama_cpp=Trueand forwards the path asllama_cpp_env_path. Using--test_llama_pathwithout--testemits a warning.--test_metricsparsing — now accepts both space-separated (--test_metrics mae speedup) and comma-separated (--test_metrics mae,speedup) forms. A_parse_test_metricstype function handles splitting and validation per token, and a_flatten_test_metricshelper normalises the result before it is forwarded to the pass.add_discrepancy_check_passupdate-in-place — when a config was previously generated byolive optimize --dry_run --test, theOnnxDiscrepancyCheckpass was already present and was silently skipped on subsequentolive run --test --test_metrics …calls. The function now updates the existing pass in-place, refreshingreference_model_path(resolved to absolute),report_output_dir, metric settings (max_mae/timing_iterations), and llama.cpp settings — so--test_metrics,--output_path, and--test_llama_pathfromolive runalways take effect.ModelBuildernow copies the test model directory toreference_hf_model/alongside the generated ONNX output in the engine cache.OnnxDiscrepancyCheckfalls back to this cached copy whenreference_model_path(e.g.out/tiny-test) no longer exists on disk, fixing theOSError: Repo id must be in the form 'repo_name' or 'namespace/repo_name'error that occurred on subsequentolive runcalls that hit the model-builder cache.SaveTestModelConfigpass (olive/passes/pytorch/save_test_model_config.py) — a new Olive pass injected at the beginning of the passes list when--testis active. It takes anHfModelHandler, writesconfig.json(with the reduced hidden-layer count) and the Olive test-model marker file totest_model_path, and returns the model unchanged. This ensurestest_model_pathalways exists as a config-only directory beforeModelBuilderor any downstream pass needs it, replacing the previous_save_test_model_config_for_dry_runstandalone function.test-model-fast.yml) — new step that creates allama_envvirtual environment, installsgguf,safetensors,llama-cpp-python(from pre-built CPU wheels athttps://abetlen.github.io/llama-cpp-python/whl/cpu),transformers,sentencepiece, andprotobuf, and downloadsconvert_hf_to_gguf.pyfrom the llama.cpp GitHub repository.cli-fast-test.md) — clarifies WHERE the 2-layer reduction happens (_apply_test_model_configinolive/common/hf/utils.py, called during the model-builder pass ofolive run) and WHENout/tiny-testis created (by theSaveTestModelConfigpass on firstolive run, completed with weights byModelBuilder); documents the cache fallback behaviour; explains the--test_llama_pathoption andllama_envsetup; and clarifies that--test_metricsis always respected even when the config was generated byolive optimize --dry_run --test.Checklist before requesting a review
lintrunner -a(Optional) Issue link