BigSmall — Lossless AI Model Compression

Make any AI model ~34% smaller. Bit-identical weights. Drop-in replacement for from_pretrained.

pip install bigsmall          # CLI + compression/decompression
pip install bigsmall[torch]   # add this for model loading (from_pretrained)

A 14 GB Mistral-7B becomes 9.3 GB. A fine-tuned model becomes a 5 GB patch on top of its 14 GB base. The decompressed model is every weight bit-for-bit identical to the original — each tensor's md5 is verified on decompress. (Verification is tensor-level, not file-level: safetensors re-serializes the container wrapper, so the file's md5 changes, but every weight value is bit-for-bit identical.)

~34% smaller	~65% smaller as a delta patch	25+ ready-to-use models
any BF16 LLM	≥7B instruct fine-tunes vs their base (pair-dependent)	on HuggingFace

What BigSmall does

Three use cases. Pick the one that fits.

1. Make any model smaller

bigsmall compress mistral-7b/ -o mistral-7b.bs
bigsmall decompress mistral-7b.bs -o mistral-7b-restored/

Before: 14.2 GB of safetensors. After: 9.3 GB .bs file. Saved: 4.9 GB (34%).

Every weight is bit-for-bit identical. Every calculation the model does is identical to the original. Works on any safetensors model — LLMs, diffusion, audio, vision, anything.

2. Store fine-tunes as tiny patches

bigsmall compress qwen-instruct/ --delta-from qwen-base/ -o instruct.bs
bigsmall apply qwen-base/ instruct.bs -o qwen-instruct-restored/

Before: 14.2 GB Qwen2.5-7B-Instruct. After: ~5 GB patch. Saved: 9 GB (65%).

If your users already have the public base model, they only need to download what changed. This is the biggest win in BigSmall. Use it for any fine-tune: instruction tuning, DPO, RLHF, domain adaptation, LoRA-merged checkpoints.

How much a delta saves is pair-dependent — measured from under 1% of full size (the best ≥7B SFT pairs) to ~61% (small-model full tunes, barely under standalone). The 65% saving above is the ≥7B official-instruct class, where patches measure 34–50% of full size. Full measured table: docs/delta-compression.md. Since 3.15 the engine measures both codings per tensor and never ships a delta larger than standalone.

3. Download smaller, use instantly

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall"
)

Works exactly like a normal HuggingFace model — BigSmall decompresses transparently on load. 25+ pre-compressed models ready to use (browse them all).

Prefer the CLI? bigsmall decompress works on local .bs files — download first, then decompress:

hf download wpferrell/phi-3.5-mini-instruct-bigsmall --local-dir phi-3.5-mini-bs
bigsmall decompress phi-3.5-mini-bs/model-00001-of-00002.bs -o model.safetensors

(On older huggingface_hub the equivalent command is huggingface-cli download …; the huggingface-cli entrypoint is deprecated in huggingface_hub >= 1.0 in favour of hf.)

Compression numbers (every published model)

Every row is a real measurement. Click a model to download it.

Model	Original	BigSmall	Saved
Qwen2.5-14B-Instruct	29.5 GB	19.5 GB	34%
Gemma-3-12B-it	22.7 GB	14.8 GB	35%
Gemma-2-9B-it	17.2 GB	11.3 GB	34%
Llama-3.1-8B-Instruct	15.0 GB	9.7 GB	35%
Llama-3-8B-Instruct	15.0 GB	9.8 GB	34%
Qwen3-8B	15.3 GB	10.1 GB	34%
Mistral-7B-Instruct v0.3	14.2 GB	8.9 GB	37%
Mistral-7B-Instruct v0.2	14.2 GB	8.9 GB	37%
Qwen2.5-7B-Instruct	14.2 GB	9.4 GB	34%
Phi-3.5-mini-instruct	7.1 GB	4.7 GB	34%
Gemma-3-4B-it	8.0 GB	5.2 GB	35%
Qwen3-4B-Instruct	7.5 GB	5.0 GB	34%
Llama-3.2-3B-Instruct	6.4 GB	3.9 GB	39%
Gemma-2-2B-it	4.9 GB	3.2 GB	34%
Qwen2.5-3B-Instruct	5.7 GB	3.8 GB	34%
Qwen2.5-1.5B-Instruct	2.9 GB	1.9 GB	34%
Llama-3.2-1B-Instruct	2.3 GB	1.5 GB	34%
Gemma-3-1B-it	1.9 GB	1.2 GB	35%
Qwen2.5-0.5B-Instruct	920 MB	610 MB	34%
GPT-2 (117M)	548 MB	414 MB	24%
Gemma-3-270M-it	500 MB	330 MB	34%
Gemma-3-270M	500 MB	330 MB	34%
Gemma-2-2B	9.7 GB	8.1 GB	17%

Browse all 25+ models on HuggingFace →

What "lossless" actually means

Every weight in the model is mathematically identical to the original — same bit pattern, same floating-point value, same gradient, same output.

Not quantization. Quantization rounds weights to fewer bits and the model's behaviour changes.
Not pruning. Pruning deletes weights.
Not approximation. No tricks, no calibration data, no quality drop.

BigSmall finds redundancy in the bit pattern of neural weights and stores it more compactly — the same idea as ZIP for text, but tuned for BF16 floating-point distributions. md5 is verified on every tensor at decompression. If a single bit differs, verify fails.

How it compares

Approach	Lossless?	Typical reduction	Behaviour change
BigSmall	Yes — bit-identical	~34% (65% as a delta, ≥7B instruct class)	None
Quantization (GPTQ / AWQ / bitsandbytes)	No	50–75%	Yes — weights are rounded
DFloat11 (entropy-coded BF16)	Yes — lossless	~30% (format-fixed)	None
ZipNN (entropy-coded BF16)	Yes — lossless	up to ~33% (authors' reported numbers)	None
ZIP / gzip on safetensors	Yes	~1–3%	None (but not model-aware)

Three of these are lossless weight-aware formats: BigSmall, DFloat11, and ZipNN. Head-to-head under the same accounting, BigSmall codes below DFloat11's bound on every layer type of every model measured (+0.45–0.55 pp model-level, +12–18 pp on norm scales — docs/dfloat11.md); ZipNN has not been independently measured by us, so its row carries its authors' numbers. BigSmall is also the only one of the three with delta patches, BF16-native-F32 detection, and streaming surfaces. Quantization compresses further but changes the model; generic ZIP keeps fidelity but barely shrinks BF16 weights. See docs/comparison.md for the full breakdown.

CLI reference

bigsmall compress SRC [-o OUT] [--delta-from BASE] [--auto-delta] [--resume] [--ecc]
bigsmall decompress SRC [-o OUT] [--base BASE]
bigsmall info SRC.bs                       size, ratio, codecs used
bigsmall scan SRC                          analyse before compressing
bigsmall verify SRC.bs [--fast|--sample N] integrity check
bigsmall diff A.bs B.bs [--patch P.bs]     compare or write a delta
bigsmall apply BASE PATCH.bs -o OUT        reconstruct from base + patch
bigsmall repair SRC.bs [-o OUT]            recover via Reed-Solomon ECC sidecar
bigsmall benchmark SRC                     encode/decode throughput
bigsmall migrate SRC.bs                    re-encode with current codecs
bigsmall status                            list your BigSmall HF repos
bigsmall pipeline run SRC DST              resumable download → compress → upload
bigsmall reshard SRC --output-dir DIR [--size-gb N|--shards N|--join]  reshard .bs by layer

Every command has --help. See docs/cli-reference.md for full examples.

Python API

import bigsmall

# Round-trip a model
bigsmall.compress("model/", "model.bs")
bigsmall.decompress("model.bs", "model_back/")

# Fine-tune as a delta patch
bigsmall.compress("finetune/", "patch.bs", delta_from="base/")
bigsmall.apply("base/", "patch.bs", "finetune_back/")

# Inspect before compressing
bigsmall.detect_bf16_native("model/")
bigsmall.scan_model("model/")

# Low-VRAM streaming inference (~12× less VRAM than from_pretrained)
from bigsmall import BigSmallStreamingModel
model = BigSmallStreamingModel.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall",
    device="cuda",
    lru_max_vram_gb=2.0,
)

# Stream-decompress straight from the HF CDN — no .bs written to disk (V10)
state_dict = bigsmall.stream_from_hub("wpferrell/gpt2-bigsmall", device="cpu")

# Reshard .bs files along layer boundaries, no re-encoding (V11)
bigsmall.reshard(["model.bs"], "resharded/", target_shard_size_gb=2.0)

What's new in v3.15.0

Delta compression is now fail-safe — measure-then-choose. compress_delta encodes every matched tensor both ways (XOR-delta and standalone) and keeps the smaller, so a delta file can never come out larger than standalone compression. A pre-compression gate warns when >30% of matched bytes changed — the measured delta-doesn't-pay regime (--force-delta silences it).
KV cache entry format v3 — per-depth sequential dispatcher. Plain vs sequential-exponent coding is measured per tensor at encode time, the smaller kept, and the winning blob verified bit-exact before it returns. End-to-end wins over the shipped v1 format grow with context: 0.27% (128 tokens) → 0.58% (2048 tokens), up to +7.7% on early-layer K tensors. v1/v2 entries decode forever.
FP8 KV entries (e4m3 / e5m2) and chunked streaming KV (32k+ contexts encode and decode chunk-by-chunk, nothing materialized whole; default 4096 tokens/chunk) in the same v3 entry format.
AutoKVCache — get_kv_cache(mode="auto") routes each call by device: CUDA tensors through the GPU-resident lossless codec, CPU tensors through the entry codec.
bigsmall xray — checkpoint forensics. Per-tensor substream entropies vs a matched-random control, lineage and anomaly flags (mantissa_carved, sign_imbalance, exp_near_random, …), and a model-level "looks untrained" detector that catches silently-randomized loads:
```
bigsmall xray model_dir/ --json report.json
```
Role-group stream packing (opt-in --group-streams): small same-role tensors (norm/bias chains, GQA k/v projections, MoE routers) packed into one coded stream per role when measured smaller. Grouped files need ≥3.15 readers.
bf16_native_f32_v2 — near-gate F32 tensors pick the smallest verified codec per substream; 0.6–2.7% smaller on the affected class.
Measured DFloat11 comparison — BigSmall codes below DFloat11's bound on every layer type of every model measured: docs/dfloat11.md.

What's new in v3.14

GPU-resident KV cache (V9) — GPUCompressedKVCache keeps the compressed cache and the encode/decode passes entirely on the CUDA device, with no CPU round-trip. ~47× faster than the CPU KV codec on the reference shape, bit-identical round-trip. get_kv_cache(device, mode) auto-picks the GPU backend when CUDA is available. V9B adds fused Triton pack/unpack kernels.
Progressive HTTP streaming (V10) — stream_from_hub(repo_id) decompresses a model directly from the HuggingFace CDN over HTTP byte-range requests. With the default cache=False, zero .bs bytes are written to disk.
Reshard (V11) — bigsmall reshard splits, joins, or rebalances .bs shards along transformer-layer boundaries with no re-encoding. Every output tensor is md5-verified.
numba is now a hard dependency (numba>=0.61) — guarantees the JIT codec path runs everywhere instead of a silent slow fallback.
CI green across the full matrix — Ubuntu / Windows / macOS × Python 3.10 / 3.11 / 3.12.

Earlier highlights still current: delta compression (fine-tunes as ~34%-size patches), --auto-delta base detection, BF16-native F32 auto-routing (Whisper-class), --resume, verify --fast/--sample, mmap decode, Reed-Solomon --ecc + repair, and BigSmallStreamingModel(lru_max_vram_gb=…).

Full changelog →

Research

The lossless compression ceiling for BF16 neural weights has been measured. It is ~62% of raw BF16 for any model, ~34% for ≥7B instruct fine-tunes with delta compression. We ran 300+ experiments across every known mathematical approach — entropy coding, cross-tensor prediction, learned translators, persistent homology, optimal transport, quantum-inspired methods, and more — and proved that there is no further compression available within the strict bit-identity contract.

The floor is measured at the wall, not extrapolated: across 4,143 weight matrices in 8 architectures the per-tensor entropy floor is flat (coefficient of variation ≈ 0), and trained mantissa/sign bits are coder-equivalent to matched random controls in every family tested — training only writes the exponent. The floor exists at initialization and never moves during training. Details: docs/research.md.

Full findings, all experiments, all dead-ends: 10.5281/zenodo.20279247. Plain-English summary: docs/research.md.

Install

pip install bigsmall                  # core
pip install "bigsmall[hf]"            # + HuggingFace integration
pip install "bigsmall[ecc]"           # + Reed-Solomon error recovery
pip install "bigsmall[all]"           # everything

Requires Python 3.9+. Works on Linux, macOS, and Windows. CPU, NVIDIA, AMD, and Apple Silicon.

License

Code: Elastic License 2.0. Free for personal, research, and commercial use. SaaS providers should see LICENSING.md.

Model weights distributed in .bs format keep the license of the original model.

Links

PyPI — https://pypi.org/project/bigsmall/
GitHub — https://github.com/wpferrell/Bigsmall
HuggingFace — https://huggingface.co/wpferrell
Paper / DOI — https://doi.org/10.5281/zenodo.20279247 (always resolves to the latest version)
Paper (PDF) — https://github.com/wpferrell/Bigsmall/blob/main/paper.pdf
Docs — docs/
Changelog — CHANGELOG.md
Contact — wpferrell@gmail.com

Feedback & Community

Did BigSmall work for your model? We'd love to know.

Open a Discussion — share your compression results, ask questions, or suggest improvements
File an Issue — if something didn't work, tell us exactly what happened
HuggingFace — all compressed models are at huggingface.co/wpferrell

We especially want to hear:

Which model you compressed and what ratio you got
Any errors or unexpected behaviour
Use cases we haven't thought of

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.github/workflows		.github/workflows
bigsmall		bigsmall
docs		docs
examples		examples
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
LICENSING.md		LICENSING.md
README.md		README.md
benchmark.py		benchmark.py
paper.pdf		paper.pdf
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BigSmall — Lossless AI Model Compression

What BigSmall does

1. Make any model smaller

2. Store fine-tunes as tiny patches

3. Download smaller, use instantly

Compression numbers (every published model)

What "lossless" actually means

How it compares

CLI reference

Python API

What's new in v3.15.0

What's new in v3.14

Research

Install

License

Links

Feedback & Community

About

Uh oh!

Releases 19

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BigSmall — Lossless AI Model Compression

What BigSmall does

1. Make any model smaller

2. Store fine-tunes as tiny patches

3. Download smaller, use instantly

Compression numbers (every published model)

What "lossless" actually means

How it compares

CLI reference

Python API

What's new in v3.15.0

What's new in v3.14

Research

Install

License

Links

Feedback & Community

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 19

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages