[Fixed] Native Windows on ARM / Snapdragon X Build Guide & Performance Regression (55+ Tok/s)

🚀 Running BitNet Natively on Windows on ARM (Snapdragon X)

This comprehensive guide details how to achieve high-performance (50-60+ tokens/sec), hardware-accelerated BitNet inference on Snapdragon X Elite/Plus processors.

The Problem: The current main branch of the BitNet repository contains regressions that break ARM math kernels on Windows, resulting in "token salad" (coherent-looking but random gibberish) or total build failures.

The Solution: By pinning the repository to a stable commit and applying targeted "Sniper" compiler flags, we can force the Snapdragon's native i8mm and dotprod instructions to handle the 1.58-bit ternary math properly.

🛠 Prerequisites
Before starting, ensure your environment is configured for native ARM64 development:

Visual Studio 2022: You must have the "Desktop development with C++" workload installed.

v143 Build Tools: Ensure the ARM64/ARM64EC build tools are selected in the VS Installer.

LLVM for Windows: Crucial for the ClangCL toolset used in the build process.

Python 3.10+ & CMake: Added to your System PATH.

🏗 Step 1: Clone and "Golden" Checkout
The latest updates to the Microsoft BitNet repo are currently unstable for Windows on ARM. We must rewind the codebase to the last known "Golden" version where the ternary logic was stable.

PowerShell
# Clone the repository
git clone https://github.com/microsoft/BitNet.git
cd BitNet

# Rewind to the stable ARM-compatible commit (Pre-Regression)
git checkout 404980e

# Initialize submodules (This pulls the compatible llama.cpp version)
git submodule update --init --recursive
🔧 Step 2: The Missing Header "Stub"
During the build, the compiler looks for bitnet-lut-kernels.h (Look-Up Tables). On Windows ARM64, the auto-generation script often fails. We bypass this by creating a 0-byte placeholder "stub" file, which allows the compiler to proceed using the fallback logic we will define in Step 3.

PowerShell
# Create the missing include file manually
New-Item -Path "include\bitnet-lut-kernels.h" -ItemType File -Force
📝 Step 3: Manual Code Patches
We need to fix two specific issues: C++ standard library incompatibilities and the lack of hardware-specific optimization flags.

A. The "Chrono" Include Fix
Standard time-tracking functions in llama.cpp will throw "identifier not found" errors on Windows Clang without the explicit chrono header. Add #include <chrono> near the top of these four files:

3rdparty\llama.cpp\common\common.cpp

3rdparty\llama.cpp\common\log.cpp

3rdparty\llama.cpp\examples\imatrix\imatrix.cpp

3rdparty\llama.cpp\examples\perplexity\perplexity.cpp

B. The "Sniper" Math Optimization (Crucial)
This is the most important step. We are going to target the core math files and tell the compiler exactly what the Snapdragon X hardware is capable of.

Open 3rdparty\llama.cpp\ggml\src\CMakeLists.txt, scroll to the absolute end of the file, and paste the following:

CMake
# Force hardware acceleration for Snapdragon X Elite/Plus
set_source_files_properties(ggml.c ggml-quants.c ggml-aarch64.c PROPERTIES COMPILE_FLAGS "/clang:-march=armv8.2-a+fp16+i8mm+dotprod")
This forces the compiler to use Integer Matrix Multiplication (i8mm) and Dot Product instructions specifically for the heavy-lifting files.

🔨 Step 4: Compile the Engine
We use the -T ClangCL flag to ensure we are using the LLVM compiler for the best possible ARM64 optimization on Windows.

PowerShell
mkdir build
cd build

# Configure the project with the ClangCL toolset
cmake .. -T ClangCL -DCMAKE_BUILD_TYPE=Release

# Build the Release binaries using all 8 performance cores
cmake --build . --config Release -j 8
🤖 Step 5: Run the Model
Local conversion scripts (setup_env.py) currently struggle with dependency paths on Windows ARM. For a guaranteed success, download the pre-converted GGUF model directly.

Download: [Official BitNet 2B-4T GGUF (i2_s)](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/resolve/main/ggml-model-i2_s.gguf)

Execute Inference:

PowerShell
.\bin\Release\llama-cli.exe -m "C:\Path\To\ggml-model-i2_s.gguf" -c 2048 -cnv -n 256
📊 Performance & Verification
Once the model starts, look at the system_info provided in the terminal. You should see:
NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1

If MATMUL_INT8 is 1, your Snapdragon's hardware acceleration is active. You should expect an inference speed of 55+ tokens per second, providing a near-instantaneous chat experience.

<img width="1920" height="1200" alt="Image" src="https://github.com/user-attachments/assets/2d81062e-7376-4084-a4d7-b850d9d47362" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fixed] Native Windows on ARM / Snapdragon X Build Guide & Performance Regression (55+ Tok/s) #556

Clone the repository

Rewind to the stable ARM-compatible commit (Pre-Regression)

Initialize submodules (This pulls the compatible llama.cpp version)

Create the missing include file manually

Force hardware acceleration for Snapdragon X Elite/Plus

Configure the project with the ClangCL toolset

Build the Release binaries using all 8 performance cores

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Fixed] Native Windows on ARM / Snapdragon X Build Guide & Performance Regression (55+ Tok/s) #556

Description

Clone the repository

Rewind to the stable ARM-compatible commit (Pre-Regression)

Initialize submodules (This pulls the compatible llama.cpp version)

Create the missing include file manually

Force hardware acceleration for Snapdragon X Elite/Plus

Configure the project with the ClangCL toolset

Build the Release binaries using all 8 performance cores

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions