Skip to content

[Fixed] Native Windows on ARM / Snapdragon X Build Guide & Performance Regression (55+ Tok/s) #556

@anmolraj499-ai

Description

@anmolraj499-ai

🚀 Running BitNet Natively on Windows on ARM (Snapdragon X)

This comprehensive guide details how to achieve high-performance (50-60+ tokens/sec), hardware-accelerated BitNet inference on Snapdragon X Elite/Plus processors.

The Problem: The current main branch of the BitNet repository contains regressions that break ARM math kernels on Windows, resulting in "token salad" (coherent-looking but random gibberish) or total build failures.

The Solution: By pinning the repository to a stable commit and applying targeted "Sniper" compiler flags, we can force the Snapdragon's native i8mm and dotprod instructions to handle the 1.58-bit ternary math properly.

🛠 Prerequisites
Before starting, ensure your environment is configured for native ARM64 development:

Visual Studio 2022: You must have the "Desktop development with C++" workload installed.

v143 Build Tools: Ensure the ARM64/ARM64EC build tools are selected in the VS Installer.

LLVM for Windows: Crucial for the ClangCL toolset used in the build process.

Python 3.10+ & CMake: Added to your System PATH.

🏗 Step 1: Clone and "Golden" Checkout
The latest updates to the Microsoft BitNet repo are currently unstable for Windows on ARM. We must rewind the codebase to the last known "Golden" version where the ternary logic was stable.

PowerShell

Clone the repository

git clone https://github.com/microsoft/BitNet.git
cd BitNet

Rewind to the stable ARM-compatible commit (Pre-Regression)

git checkout 404980e

Initialize submodules (This pulls the compatible llama.cpp version)

git submodule update --init --recursive
🔧 Step 2: The Missing Header "Stub"
During the build, the compiler looks for bitnet-lut-kernels.h (Look-Up Tables). On Windows ARM64, the auto-generation script often fails. We bypass this by creating a 0-byte placeholder "stub" file, which allows the compiler to proceed using the fallback logic we will define in Step 3.

PowerShell

Create the missing include file manually

New-Item -Path "include\bitnet-lut-kernels.h" -ItemType File -Force
📝 Step 3: Manual Code Patches
We need to fix two specific issues: C++ standard library incompatibilities and the lack of hardware-specific optimization flags.

A. The "Chrono" Include Fix
Standard time-tracking functions in llama.cpp will throw "identifier not found" errors on Windows Clang without the explicit chrono header. Add #include near the top of these four files:

3rdparty\llama.cpp\common\common.cpp

3rdparty\llama.cpp\common\log.cpp

3rdparty\llama.cpp\examples\imatrix\imatrix.cpp

3rdparty\llama.cpp\examples\perplexity\perplexity.cpp

B. The "Sniper" Math Optimization (Crucial)
This is the most important step. We are going to target the core math files and tell the compiler exactly what the Snapdragon X hardware is capable of.

Open 3rdparty\llama.cpp\ggml\src\CMakeLists.txt, scroll to the absolute end of the file, and paste the following:

CMake

Force hardware acceleration for Snapdragon X Elite/Plus

set_source_files_properties(ggml.c ggml-quants.c ggml-aarch64.c PROPERTIES COMPILE_FLAGS "/clang:-march=armv8.2-a+fp16+i8mm+dotprod")
This forces the compiler to use Integer Matrix Multiplication (i8mm) and Dot Product instructions specifically for the heavy-lifting files.

🔨 Step 4: Compile the Engine
We use the -T ClangCL flag to ensure we are using the LLVM compiler for the best possible ARM64 optimization on Windows.

PowerShell
mkdir build
cd build

Configure the project with the ClangCL toolset

cmake .. -T ClangCL -DCMAKE_BUILD_TYPE=Release

Build the Release binaries using all 8 performance cores

cmake --build . --config Release -j 8
🤖 Step 5: Run the Model
Local conversion scripts (setup_env.py) currently struggle with dependency paths on Windows ARM. For a guaranteed success, download the pre-converted GGUF model directly.

Download: Official BitNet 2B-4T GGUF (i2_s)

Execute Inference:

PowerShell
.\bin\Release\llama-cli.exe -m "C:\Path\To\ggml-model-i2_s.gguf" -c 2048 -cnv -n 256
📊 Performance & Verification
Once the model starts, look at the system_info provided in the terminal. You should see:
NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1

If MATMUL_INT8 is 1, your Snapdragon's hardware acceleration is active. You should expect an inference speed of 55+ tokens per second, providing a near-instantaneous chat experience.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions