🚀 [v0.3.40] Milestone Release: Reasoning Budget Control, Gemma 4 12B Support, Enhanced Jinja2ChatFormatter, NGram k/k4v Speculative Decoding, Faster Native Sampling and Multimodal Improvements #136
Replies: 2 comments
-
|
Note: If you encounter a situation where ggml.dll cannot be opened, it may be because your system is missing some necessary DLL libraries. You can replace the libomp140.x86_64.dll file in the libomp140.x86_64.zip package below with the one in your Python installation directory \Lib\site-packages\llama_cpp\lib. Update 1: I reproduced this problem on some older computers. The main issue was that some older VC++ runtime libraries were installed, causing a conflict. Uninstalling the outdated C++ runtime libraries and installing a new C++ library package resolved the problem. There was no need to replace libomp140.x86_64.dll. Some of the required DLLs are part of the VC++ 2015-22 redist. Try installing both x64 and x86 version of the redist. https://aka.ms/vs/17/release/vc_redist.x86.exe |
Beta Was this translation helpful? Give feedback.
-
|
Finally,I found the problem: windows-2022 runner has two build environments (🤣), VC142 and VC143. The old CMakefile didn't differentiate between versions, causing it to prioritize matching the VC142 DLL, but the actual build environment is VC143, leading to library conflicts. It should actually need the 600KB VC143 DLL, not the 1.6MB VC142 DLL. The new CMakefile has fixed the versioning issue. This is the log of the test workflow: Run Write-Output "ProgramFiles=$env:ProgramFiles"
ProgramFiles=C:\Program Files
ProgramFiles(x86)=C:\Program Files (x86)
Checking root: C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC
Exists: yes
MSVC version directories:
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.29.30133
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.44.35112
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\v143
OpenMP runtime candidates:
Path: C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.29.30133\debug_nonredist\x64\Microsoft.VC142.OpenMP.LLVM\libomp140.x86_64.dll
Size: 1666088 bytes / 1627.04 KB / 1.5889 MB
Path: C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.44.35112\debug_nonredist\x64\Microsoft.VC143.OpenMP.LLVM\libomp140.x86_64.dll
Size: 634936 bytes / 620.05 KB / 0.6055 MB
Checking root: C:\Program Files\Microsoft Visual Studio\2022\BuildTools\VC\Redist\MSVC
Exists: no
Checking root: C:\Program Files (x86)\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC
Exists: no
Checking root: C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Redist\MSVC
Exists: no
Checking System32 fallback:
Path: C:\Windows\System32\libomp140.x86_64.dll
Size: 634936 bytes / 620.05 KB / 0.6055 MB |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
[0.3.40-Milestone] Reasoning Budget Control, Gemma 4 12B Support, Enhanced Jinja2ChatFormatter, NGram k/k4v Speculative Decoding, Faster Native Sampling and Multimodal Improvements
Hi everyone,
I’m happy to share the
0.3.40milestone release ofllama-cpp-python.This is a fairly large release. The biggest highlight is the new Reasoning Budget Control support, but this version also brings Gemma 4 12B chat template updates, stronger HuggingFace-style Jinja chat template compatibility, faster native sampling, upgraded NGram k/k4v speculative decoding, better multimodal fallback behavior, updated OCR and VL handlers, new OCR and embedding model documentation, and a fresh sync with upstream
llama.cpp.For me, this release is not only about adding one feature. It is about making
llama-cpp-pythonmore practical for today’s local inference workloads: reasoning models, multimodal models, OCR models, embedding models, long-context generation, speculative decoding, and model-specific chat templates.Responding to Community Demand: Reasoning Budget Control
The headline feature of
0.3.40is the new Python-backedReasoningBudgetSampler.This feature is also a direct response to the community’s growing need for controllable reasoning behavior. Thanks to @teux91, @asagi4, and @abdullah-cod9 for raising and discussing the need for reasoning budget control. Their feedback helped shape this implementation direction.
Many modern reasoning models expose their thinking content through visible reasoning tags, such as:
<think> ... </think>[THINK] ... [/THINK]In real usage, we often need different levels of control over this reasoning block. Sometimes we want full reasoning. Sometimes we want to disable it. Sometimes we only want to allow a limited number of reasoning tokens before forcing the model to close the reasoning block and continue with the final answer.
0.3.40introduces a generic reasoning-budget sampler for this workflow.The new sampler supports:
reasoning_budget=-1for unrestricted reasoningreasoning_budget=0to close the first reasoning block immediatelyreasoning_budget=Nto allow up toNreasoning tokens before forcing an end sequencereasoning_startandreasoning_endtagsreasoning_budget_messagereasoning_start_in_promptfor prefilled reasoning startsforce_reasoning_budget()The sampler is intentionally model-agnostic. It does not hard-code one model family’s reasoning format. Instead, it works with user-provided start and end tags and controls only the first visible reasoning block. Once that block naturally ends or is forcibly closed, the sampler switches to passthrough mode and ignores later reasoning tags.
This keeps the behavior predictable while still supporting different model families and chat template styles.
Reasoning Budget Support Across Public APIs
Reasoning budget control is wired through the public
LlamaAPIs.Users can now pass reasoning-budget parameters from completion and chat entry points, and those options are propagated down into
generate()and the sampling parameters.The new public controls include:
reasoning_budgetreasoning_startreasoning_endreasoning_budget_messagereasoning_start_in_promptreasoning_start_max_tokensThe MTMD chat handler is also wired to the same reasoning-budget controls, so multimodal workflows can use the same mechanism.
This means reasoning budget control is not limited to one narrow path. It is available across the main text, chat, and multimodal generation flows.
More information see here: https://github.com/JamePeng/llama-cpp-python#reasoning-budget-first-reasoning-block
Gemma 4 12B and Better HuggingFace Chat Template Compatibility
This release updates the
google/gemma-4chat template Jinja support, including Gemma 4 12B related usage.At the same time,
Jinja2ChatFormatterhas been enhanced to better support HuggingFace-style chat templates. Recent model releases increasingly rely on Jinja features that simple renderers do not fully support, so this release improves that path significantly.The formatter now supports:
{% generation %}blocks throughIgnoreGenerationTags{% break %}and{% continue %}tojsonbehaviorraise_exceptionandstrftime_nowas Jinja globalsdocumentssupport for document-aware templatesspecial_tokens_mapsupport for additional model-specific tokensThe
{% generation %}support is intentionally lightweight. Transformers uses this tag to compute assistant-token masks, butllama-cpp-pythonmainly needs the rendered prompt. So the formatter treats the tag as a transparent wrapper: it removes the tag pair and renders the inner body normally.This improves compatibility with modern GGUF metadata templates without introducing unnecessary assistant-span tracking overhead.
The minimum Jinja2 dependency has also been updated to
jinja2>=3.1.0to align more closely with Transformers’ chat-template runtime behavior.Note: However, based on my personal testing, gemma4-12B's performance is inferior to e4b or the higher-end 26BA4B/31B, especially its poor support for CJK audio and text. It can only be described as an attempt by Google to tweak the encoder.
More Complete Special Token Handling
Llama.__init__now exposes more tokenizer special tokens to chat templates.In addition to BOS and EOS, this release registers extra special tokens such as:
These are passed through
special_tokens_maptoJinja2ChatFormatter.Stop token handling is also improved. Instead of always passing only EOS, the code now builds
stop_token_idsfrom valid EOS and EOT token IDs while skipping invalid-1values.This is especially helpful for chat models where end-of-turn, rather than only EOS, is the real dialogue boundary.
Faster Native Sampling
0.3.40includes an important performance optimization in the generation hot path.Previously, Python-side logits copying could happen more often than necessary. In normal native sampling, the sampler can read logits directly from the underlying C context, so copying the full
n_vocablogits array into Python on every token is unnecessary unless Python-side hooks need it.This release introduces the
copy_logitsparameter toLlama.eval()and disables unnecessary logit copies during generation unless required by:logits_alllogits_processorstopping_criteriaIn a PDF-reading summarization workload, this reduced the end-to-end completion
time from 41.32s to 25.93s, a ~37.2% improvement. The main generation hot path
also improved noticeably:
_create_completiongenerateevalget_logits()-> 18.68sget_logits_ith()decodedetokenizesampleThis significantly reduces CPU overhead and memory bandwidth pressure during generation, especially for long-context and document-heavy workloads.
NGram k/k4v Speculative Decoding
The NGram speculative decoding path has been upgraded.
LlamaNGramMapDecodingnow supports two modes:k: stores historical positions for better memory efficiencyk4v: caches continuation values directly for faster lookupThis update also adds:
min_hitsmax_entries_per_keysync_check_tokensclear(),close(), andaccept()This brings the Python implementation closer to the upstream
llama.cppngram-map design and gives users better control over memory growth, draft quality, and long-context behavior.A new benchmark script was also added under
examples/benchmarkto compare prompt lookup decoding and NGram map decoding across different workloads.Multimodal and OCR Improvements
This release also improves multimodal and OCR workflows.
The MTMD fallback chat template now has clearer BOS/EOS handling, role-based formatting, and more readable prompt serialization. This improves behavior for multimodal GGUF models that do not provide a complete chat template.
OCR-oriented model support and documentation have also been updated, including:
DeepSeek-OCR-2-GGUFMinerU2.5-Pro-2605-1.2B(SOTA OCR MODEL)PaddleOCR 1.6The fallback template improvements are especially useful for OCR-style multimodal models that may not yet ship a complete GGUF chat template or custom chat handler.
Several multimodal chat handlers were also updated:
These updates include token configuration improvements, stop sequence updates, PaddleOCR 1.6 support, and standardized
input_idsinitialization behavior.There is also a server-side fix to wire LFM VL chat handlers into the server loader.
Thanks to @JayAnderson360 for the LFM VL server loader fix.
Embedding Model Documentation Updates
The supported embeddings model table has also been updated.
This release adds documentation for:
jina-embeddings-v2-base-zhjina-embeddings-v3This should make it easier for users working with embedding and retrieval workflows to find tested or documented model options directly from the README/wiki.
Documentation and Build Updates
This release includes a large documentation refresh.
Highlights include:
<think>tags, Mistral[THINK]tags, and Gemma-style channel tagsThe build system also disables the upstream unified
llamabinary target by settingLLAMA_BUILD_APP=OFF, reducing unnecessary build artifacts for the Python package.Thanks to @0xDELUXA for contributing the Windows ROCm build instructions.
Upstream llama.cpp Sync
As usual, this release continues tracking upstream
llama.cpp.0.3.40syncs to:ggml-org/llama.cppcommitf71af352a52b8efe824c7a698d0632afa4794c01It also updates the llama, mtmd, and ggml API bindings as of
2026-06-06.Keeping the Python bindings close to upstream is important because many of the latest model, backend, sampler, and multimodal improvements land in
llama.cppfirst.Thanks
This release touches many layers:
The Reasoning Budget Sampler is the biggest new feature, but I see the broader goal of
0.3.40as improving the whole experience around modern local inference: reasoning control, model-template compatibility, faster generation, speculative decoding, OCR/multimodal support, and better documentation.Special thanks to:
llama-cpp-pythonuseful for real-world local inferenceI hope
0.3.40makes it easier to experiment with reasoning models, Gemma 4 12B, HuggingFace-style chat templates, OCR and multimodal GGUF models, embedding workflows, speculative decoding, and faster local generation.— JamePeng
Beta Was this translation helpful? Give feedback.
All reactions