🚀 [v0.3.40] Milestone Release: Reasoning Budget Control, Gemma 4 12B Support, Enhanced Jinja2ChatFormatter, NGram k/k4v Speculative Decoding, Faster Native Sampling and Multimodal Improvements #136

JamePeng · 2026-06-07T10:36:58Z

JamePeng
Jun 7, 2026
Maintainer

[0.3.40-Milestone] Reasoning Budget Control, Gemma 4 12B Support, Enhanced Jinja2ChatFormatter, NGram k/k4v Speculative Decoding, Faster Native Sampling and Multimodal Improvements

Hi everyone,

I’m happy to share the 0.3.40 milestone release of llama-cpp-python.

This is a fairly large release. The biggest highlight is the new Reasoning Budget Control support, but this version also brings Gemma 4 12B chat template updates, stronger HuggingFace-style Jinja chat template compatibility, faster native sampling, upgraded NGram k/k4v speculative decoding, better multimodal fallback behavior, updated OCR and VL handlers, new OCR and embedding model documentation, and a fresh sync with upstream llama.cpp.

For me, this release is not only about adding one feature. It is about making llama-cpp-python more practical for today’s local inference workloads: reasoning models, multimodal models, OCR models, embedding models, long-context generation, speculative decoding, and model-specific chat templates.

Responding to Community Demand: Reasoning Budget Control

The headline feature of 0.3.40 is the new Python-backed ReasoningBudgetSampler.

This feature is also a direct response to the community’s growing need for controllable reasoning behavior. Thanks to @teux91, @asagi4, and @abdullah-cod9 for raising and discussing the need for reasoning budget control. Their feedback helped shape this implementation direction.

Many modern reasoning models expose their thinking content through visible reasoning tags, such as:

<think> ... </think>
[THINK] ... [/THINK]
Gemma-style channel tags

In real usage, we often need different levels of control over this reasoning block. Sometimes we want full reasoning. Sometimes we want to disable it. Sometimes we only want to allow a limited number of reasoning tokens before forcing the model to close the reasoning block and continue with the final answer.

0.3.40 introduces a generic reasoning-budget sampler for this workflow.

The new sampler supports:

reasoning_budget=-1 for unrestricted reasoning
reasoning_budget=0 to close the first reasoning block immediately
reasoning_budget=N to allow up to N reasoning tokens before forcing an end sequence
custom reasoning_start and reasoning_end tags
custom reasoning_budget_message
reasoning_start_in_prompt for prefilled reasoning starts
manual forcing through force_reasoning_budget()
verbose state transition logs for debugging

The sampler is intentionally model-agnostic. It does not hard-code one model family’s reasoning format. Instead, it works with user-provided start and end tags and controls only the first visible reasoning block. Once that block naturally ends or is forcibly closed, the sampler switches to passthrough mode and ignores later reasoning tags.

This keeps the behavior predictable while still supporting different model families and chat template styles.

Reasoning Budget Support Across Public APIs

Reasoning budget control is wired through the public Llama APIs.

Users can now pass reasoning-budget parameters from completion and chat entry points, and those options are propagated down into generate() and the sampling parameters.

The new public controls include:

reasoning_budget
reasoning_start
reasoning_end
reasoning_budget_message
reasoning_start_in_prompt
reasoning_start_max_tokens

The MTMD chat handler is also wired to the same reasoning-budget controls, so multimodal workflows can use the same mechanism.

This means reasoning budget control is not limited to one narrow path. It is available across the main text, chat, and multimodal generation flows.

More information see here: https://github.com/JamePeng/llama-cpp-python#reasoning-budget-first-reasoning-block

Gemma 4 12B and Better HuggingFace Chat Template Compatibility

This release updates the google/gemma-4 chat template Jinja support, including Gemma 4 12B related usage.

At the same time, Jinja2ChatFormatter has been enhanced to better support HuggingFace-style chat templates. Recent model releases increasingly rely on Jinja features that simple renderers do not fully support, so this release improves that path significantly.

The formatter now supports:

{% generation %} blocks through IgnoreGenerationTags
Jinja loop controls such as {% break %} and {% continue %}
Transformers-compatible tojson behavior
raise_exception and strftime_now as Jinja globals
optional documents support for document-aware templates
special_tokens_map support for additional model-specific tokens
precomputed text stop sequences and token-id stopping criteria

The {% generation %} support is intentionally lightweight. Transformers uses this tag to compute assistant-token masks, but llama-cpp-python mainly needs the rendered prompt. So the formatter treats the tag as a transparent wrapper: it removes the tag pair and renders the inner body normally.

This improves compatibility with modern GGUF metadata templates without introducing unnecessary assistant-span tracking overhead.

The minimum Jinja2 dependency has also been updated to jinja2>=3.1.0 to align more closely with Transformers’ chat-template runtime behavior.

Note: However, based on my personal testing, gemma4-12B's performance is inferior to e4b or the higher-end 26BA4B/31B, especially its poor support for CJK audio and text. It can only be described as an attempt by Google to tweak the encoder.

More Complete Special Token Handling

Llama.__init__ now exposes more tokenizer special tokens to chat templates.

In addition to BOS and EOS, this release registers extra special tokens such as:

EOT
SEP
NL
PAD
MASK

These are passed through special_tokens_map to Jinja2ChatFormatter.

Stop token handling is also improved. Instead of always passing only EOS, the code now builds stop_token_ids from valid EOS and EOT token IDs while skipping invalid -1 values.

This is especially helpful for chat models where end-of-turn, rather than only EOS, is the real dialogue boundary.

Faster Native Sampling

0.3.40 includes an important performance optimization in the generation hot path.

Previously, Python-side logits copying could happen more often than necessary. In normal native sampling, the sampler can read logits directly from the underlying C context, so copying the full n_vocab logits array into Python on every token is unnecessary unless Python-side hooks need it.

This release introduces the copy_logits parameter to Llama.eval() and disables unnecessary logit copies during generation unless required by:

logits_all
logits_processor
stopping_criteria

In a PDF-reading summarization workload, this reduced the end-to-end completion
time from 41.32s to 25.93s, a ~37.2% improvement. The main generation hot path
also improved noticeably:

Function	0.3.39 vs 0.3.40
`_create_completion`	41.32s -> 25.93s
`generate`	37.82s -> below the top sampled entries
`eval`	35.14s -> 21.96s
logits retrieval/copy path	29.89s `get_logits()` -> 18.68s `get_logits_ith()`
`decode`	3.89s -> 2.25s
`detokenize`	2.60s -> 1.33s
`sample`	2.35s -> 2.03s

This significantly reduces CPU overhead and memory bandwidth pressure during generation, especially for long-context and document-heavy workloads.

NGram k/k4v Speculative Decoding

The NGram speculative decoding path has been upgraded.

LlamaNGramMapDecoding now supports two modes:

k: stores historical positions for better memory efficiency
k4v: caches continuation values directly for faster lookup

This update also adds:

min_hits
max_entries_per_key
sync_check_tokens
better incremental history synchronization
explicit lifecycle methods such as clear(), close(), and accept()

This brings the Python implementation closer to the upstream llama.cpp ngram-map design and gives users better control over memory growth, draft quality, and long-context behavior.

A new benchmark script was also added under examples/benchmark to compare prompt lookup decoding and NGram map decoding across different workloads.

Multimodal and OCR Improvements

This release also improves multimodal and OCR workflows.

The MTMD fallback chat template now has clearer BOS/EOS handling, role-based formatting, and more readable prompt serialization. This improves behavior for multimodal GGUF models that do not provide a complete chat template.

OCR-oriented model support and documentation have also been updated, including:

DeepSeek-OCR-2-GGUF
MinerU2.5-Pro-2605-1.2B （SOTA OCR MODEL)
PaddleOCR 1.6

The fallback template improvements are especially useful for OCR-style multimodal models that may not yet ship a complete GGUF chat template or custom chat handler.

Several multimodal chat handlers were also updated:

Qwen2.5-VL
Qwen3-VL
Qwen3-ASR
PaddleOCR

These updates include token configuration improvements, stop sequence updates, PaddleOCR 1.6 support, and standardized input_ids initialization behavior.

There is also a server-side fix to wire LFM VL chat handlers into the server loader.

Thanks to @JayAnderson360 for the LFM VL server loader fix.

Embedding Model Documentation Updates

The supported embeddings model table has also been updated.

This release adds documentation for:

jina-embeddings-v2-base-zh
jina-embeddings-v3

This should make it easier for users working with embedding and retrieval workflows to find tested or documented model options directly from the README/wiki.

Documentation and Build Updates

This release includes a large documentation refresh.

Highlights include:

Reasoning Budget Sampler README documentation
examples for default <think> tags, Mistral [THINK] tags, and Gemma-style channel tags
detailed installation and backend build guide
source-aligned CMake build notes
updated speculative decoding documentation
updated supported embeddings model table
updated OCR model links and notes
updated wiki schema and navigation
development helper docs for generating high-quality git commit messages
notes for CUDA PDL optimization on newer NVIDIA GPUs
Windows ROCm build instructions
removal of outdated macOS installation notes

The build system also disables the upstream unified llama binary target by setting LLAMA_BUILD_APP=OFF, reducing unnecessary build artifacts for the Python package.

Thanks to @0xDELUXA for contributing the Windows ROCm build instructions.

Upstream llama.cpp Sync

As usual, this release continues tracking upstream llama.cpp.

0.3.40 syncs to:

ggml-org/llama.cpp commit f71af352a52b8efe824c7a698d0632afa4794c01

It also updates the llama, mtmd, and ggml API bindings as of 2026-06-06.

Keeping the Python bindings close to upstream is important because many of the latest model, backend, sampler, and multimodal improvements land in llama.cpp first.

Thanks

This release touches many layers:

public Python APIs
sampling parameters
custom sampler lifecycle
chat handlers
Jinja template rendering
GGUF metadata templates
special token handling
multimodal formatting
OCR model workflows
embedding model documentation
native logit access
speculative decoding state
installation and backend documentation
upstream API bindings

The Reasoning Budget Sampler is the biggest new feature, but I see the broader goal of 0.3.40 as improving the whole experience around modern local inference: reasoning control, model-template compatibility, faster generation, speculative decoding, OCR/multimodal support, and better documentation.

Special thanks to:

@teux91, @asagi4, and @abdullah-cod9 for raising and discussing the community need for Reasoning Budget control
@JayAnderson360 for the LFM VL server loader fix
@0xDELUXA for the Windows ROCm build instructions
everyone who tested models, reported issues, reviewed behavior, and helped keep llama-cpp-python useful for real-world local inference

I hope 0.3.40 makes it easier to experiment with reasoning models, Gemma 4 12B, HuggingFace-style chat templates, OCR and multimodal GGUF models, embedding workflows, speculative decoding, and faster local generation.

— JamePeng

JamePeng · 2026-06-08T04:47:54Z

JamePeng
Jun 8, 2026
Maintainer Author

libomp140.x86_64.zip

Note: If you encounter a situation where ggml.dll cannot be opened, it may be because your system is missing some necessary DLL libraries. You can replace the libomp140.x86_64.dll file in the libomp140.x86_64.zip package below with the one in your Python installation directory \Lib\site-packages\llama_cpp\lib.

Update 1: I reproduced this problem on some older computers. The main issue was that some older VC++ runtime libraries were installed, causing a conflict. Uninstalling the outdated C++ runtime libraries and installing a new C++ library package resolved the problem. There was no need to replace libomp140.x86_64.dll.

Some of the required DLLs are part of the VC++ 2015-22 redist. Try installing both x64 and x86 version of the redist.

https://aka.ms/vs/17/release/vc_redist.x86.exe
https://aka.ms/vs/17/release/vc_redist.x64.exe

0 replies

JamePeng · 2026-06-08T16:04:00Z

JamePeng
Jun 8, 2026
Maintainer Author

Finally，I found the problem:

windows-2022 runner has two build environments (🤣), VC142 and VC143. The old CMakefile didn't differentiate between versions, causing it to prioritize matching the VC142 DLL, but the actual build environment is VC143, leading to library conflicts. It should actually need the 600KB VC143 DLL, not the 1.6MB VC142 DLL. The new CMakefile has fixed the versioning issue.

This is the log of the test workflow:

Run Write-Output "ProgramFiles=$env:ProgramFiles"
ProgramFiles=C:\Program Files
ProgramFiles(x86)=C:\Program Files (x86)

Checking root: C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC
  Exists: yes
  MSVC version directories:
    C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.29.30133
    C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.44.35112
    C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\v143
  OpenMP runtime candidates:
    Path: C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.29.30133\debug_nonredist\x64\Microsoft.VC142.OpenMP.LLVM\libomp140.x86_64.dll
    Size: 1666088 bytes / 1627.04 KB / 1.5889 MB
    Path: C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.44.35112\debug_nonredist\x64\Microsoft.VC143.OpenMP.LLVM\libomp140.x86_64.dll
    Size: 634936 bytes / 620.05 KB / 0.6055 MB

Checking root: C:\Program Files\Microsoft Visual Studio\2022\BuildTools\VC\Redist\MSVC
  Exists: no

Checking root: C:\Program Files (x86)\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC
  Exists: no

Checking root: C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Redist\MSVC
  Exists: no

Checking System32 fallback:
  Path: C:\Windows\System32\libomp140.x86_64.dll
  Size: 634936 bytes / 620.05 KB / 0.6055 MB

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 [v0.3.40] Milestone Release: Reasoning Budget Control, Gemma 4 12B Support, Enhanced Jinja2ChatFormatter, NGram k/k4v Speculative Decoding, Faster Native Sampling and Multimodal Improvements #136

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

🚀 [v0.3.40] Milestone Release: Reasoning Budget Control, Gemma 4 12B Support, Enhanced Jinja2ChatFormatter, NGram k/k4v Speculative Decoding, Faster Native Sampling and Multimodal Improvements #136

Uh oh!

Uh oh!

JamePeng Jun 7, 2026 Maintainer

[0.3.40-Milestone] Reasoning Budget Control, Gemma 4 12B Support, Enhanced Jinja2ChatFormatter, NGram k/k4v Speculative Decoding, Faster Native Sampling and Multimodal Improvements

Responding to Community Demand: Reasoning Budget Control

Reasoning Budget Support Across Public APIs

Gemma 4 12B and Better HuggingFace Chat Template Compatibility

More Complete Special Token Handling

Faster Native Sampling

NGram k/k4v Speculative Decoding

Multimodal and OCR Improvements

Embedding Model Documentation Updates

Documentation and Build Updates

Upstream llama.cpp Sync

Thanks

Replies: 2 comments

Uh oh!

JamePeng Jun 8, 2026 Maintainer Author

Uh oh!

Uh oh!

JamePeng Jun 8, 2026 Maintainer Author

JamePeng
Jun 7, 2026
Maintainer

JamePeng
Jun 8, 2026
Maintainer Author

JamePeng
Jun 8, 2026
Maintainer Author