Make any AI model ~34% smaller. Bit-identical weights. Drop-in replacement for from_pretrained.
pip install bigsmall # CLI + compression/decompression
pip install bigsmall[torch] # add this for model loading (from_pretrained)A 14 GB Mistral-7B becomes 9.3 GB. A fine-tuned model becomes a 5 GB patch on top of its 14 GB base. The decompressed model is every weight bit-for-bit identical to the original — each tensor's md5 is verified on decompress. (Verification is tensor-level, not file-level: safetensors re-serializes the container wrapper, so the file's md5 changes, but every weight value is bit-for-bit identical.)
| ~34% smaller | ~65% smaller as a delta patch | 25+ ready-to-use models |
|---|---|---|
| any BF16 LLM | ≥7B instruct fine-tunes vs their base (pair-dependent) | on HuggingFace |
Three use cases. Pick the one that fits.
bigsmall compress mistral-7b/ -o mistral-7b.bs
bigsmall decompress mistral-7b.bs -o mistral-7b-restored/Before: 14.2 GB of safetensors. After: 9.3 GB .bs file. Saved: 4.9 GB (34%).
Every weight is bit-for-bit identical. Every calculation the model does is identical to the original. Works on any safetensors model — LLMs, diffusion, audio, vision, anything.
bigsmall compress qwen-instruct/ --delta-from qwen-base/ -o instruct.bs
bigsmall apply qwen-base/ instruct.bs -o qwen-instruct-restored/Before: 14.2 GB Qwen2.5-7B-Instruct. After: ~5 GB patch. Saved: 9 GB (65%).
If your users already have the public base model, they only need to download what changed. This is the biggest win in BigSmall. Use it for any fine-tune: instruction tuning, DPO, RLHF, domain adaptation, LoRA-merged checkpoints.
How much a delta saves is pair-dependent — measured from under 1% of full size (the best ≥7B SFT pairs) to ~61% (small-model full tunes, barely under standalone). The 65% saving above is the ≥7B official-instruct class, where patches measure 34–50% of full size. Full measured table: docs/delta-compression.md. Since 3.15 the engine measures both codings per tensor and never ships a delta larger than standalone.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"wpferrell/phi-3.5-mini-instruct-bigsmall"
)Works exactly like a normal HuggingFace model — BigSmall decompresses transparently on load. 25+ pre-compressed models ready to use (browse them all).
Prefer the CLI? bigsmall decompress works on local .bs files — download first, then decompress:
hf download wpferrell/phi-3.5-mini-instruct-bigsmall --local-dir phi-3.5-mini-bs
bigsmall decompress phi-3.5-mini-bs/model-00001-of-00002.bs -o model.safetensors(On older huggingface_hub the equivalent command is huggingface-cli download …; the huggingface-cli entrypoint is deprecated in huggingface_hub >= 1.0 in favour of hf.)
Every row is a real measurement. Click a model to download it.
| Model | Original | BigSmall | Saved |
|---|---|---|---|
| Qwen2.5-14B-Instruct | 29.5 GB | 19.5 GB | 34% |
| Gemma-3-12B-it | 22.7 GB | 14.8 GB | 35% |
| Gemma-2-9B-it | 17.2 GB | 11.3 GB | 34% |
| Llama-3.1-8B-Instruct | 15.0 GB | 9.7 GB | 35% |
| Llama-3-8B-Instruct | 15.0 GB | 9.8 GB | 34% |
| Qwen3-8B | 15.3 GB | 10.1 GB | 34% |
| Mistral-7B-Instruct v0.3 | 14.2 GB | 8.9 GB | 37% |
| Mistral-7B-Instruct v0.2 | 14.2 GB | 8.9 GB | 37% |
| Qwen2.5-7B-Instruct | 14.2 GB | 9.4 GB | 34% |
| Phi-3.5-mini-instruct | 7.1 GB | 4.7 GB | 34% |
| Gemma-3-4B-it | 8.0 GB | 5.2 GB | 35% |
| Qwen3-4B-Instruct | 7.5 GB | 5.0 GB | 34% |
| Llama-3.2-3B-Instruct | 6.4 GB | 3.9 GB | 39% |
| Gemma-2-2B-it | 4.9 GB | 3.2 GB | 34% |
| Qwen2.5-3B-Instruct | 5.7 GB | 3.8 GB | 34% |
| Qwen2.5-1.5B-Instruct | 2.9 GB | 1.9 GB | 34% |
| Llama-3.2-1B-Instruct | 2.3 GB | 1.5 GB | 34% |
| Gemma-3-1B-it | 1.9 GB | 1.2 GB | 35% |
| Qwen2.5-0.5B-Instruct | 920 MB | 610 MB | 34% |
| GPT-2 (117M) | 548 MB | 414 MB | 24% |
| Gemma-3-270M-it | 500 MB | 330 MB | 34% |
| Gemma-3-270M | 500 MB | 330 MB | 34% |
| Gemma-2-2B | 9.7 GB | 8.1 GB | 17% |
Browse all 25+ models on HuggingFace →
Every weight in the model is mathematically identical to the original — same bit pattern, same floating-point value, same gradient, same output.
- Not quantization. Quantization rounds weights to fewer bits and the model's behaviour changes.
- Not pruning. Pruning deletes weights.
- Not approximation. No tricks, no calibration data, no quality drop.
BigSmall finds redundancy in the bit pattern of neural weights and stores it more compactly — the same idea as ZIP for text, but tuned for BF16 floating-point distributions. md5 is verified on every tensor at decompression. If a single bit differs, verify fails.
| Approach | Lossless? | Typical reduction | Behaviour change |
|---|---|---|---|
| BigSmall | Yes — bit-identical | ~34% (65% as a delta, ≥7B instruct class) | None |
| Quantization (GPTQ / AWQ / bitsandbytes) | No | 50–75% | Yes — weights are rounded |
| DFloat11 (entropy-coded BF16) | Yes — lossless | ~30% (format-fixed) | None |
| ZipNN (entropy-coded BF16) | Yes — lossless | up to ~33% (authors' reported numbers) | None |
| ZIP / gzip on safetensors | Yes | ~1–3% | None (but not model-aware) |
Three of these are lossless weight-aware formats: BigSmall, DFloat11, and ZipNN. Head-to-head under the same accounting, BigSmall codes below DFloat11's bound on every layer type of every model measured (+0.45–0.55 pp model-level, +12–18 pp on norm scales — docs/dfloat11.md); ZipNN has not been independently measured by us, so its row carries its authors' numbers. BigSmall is also the only one of the three with delta patches, BF16-native-F32 detection, and streaming surfaces. Quantization compresses further but changes the model; generic ZIP keeps fidelity but barely shrinks BF16 weights. See docs/comparison.md for the full breakdown.
bigsmall compress SRC [-o OUT] [--delta-from BASE] [--auto-delta] [--resume] [--ecc]
bigsmall decompress SRC [-o OUT] [--base BASE]
bigsmall info SRC.bs size, ratio, codecs used
bigsmall scan SRC analyse before compressing
bigsmall verify SRC.bs [--fast|--sample N] integrity check
bigsmall diff A.bs B.bs [--patch P.bs] compare or write a delta
bigsmall apply BASE PATCH.bs -o OUT reconstruct from base + patch
bigsmall repair SRC.bs [-o OUT] recover via Reed-Solomon ECC sidecar
bigsmall benchmark SRC encode/decode throughput
bigsmall migrate SRC.bs re-encode with current codecs
bigsmall status list your BigSmall HF repos
bigsmall pipeline run SRC DST resumable download → compress → upload
bigsmall reshard SRC --output-dir DIR [--size-gb N|--shards N|--join] reshard .bs by layer
Every command has --help. See docs/cli-reference.md for full examples.
import bigsmall
# Round-trip a model
bigsmall.compress("model/", "model.bs")
bigsmall.decompress("model.bs", "model_back/")
# Fine-tune as a delta patch
bigsmall.compress("finetune/", "patch.bs", delta_from="base/")
bigsmall.apply("base/", "patch.bs", "finetune_back/")
# Inspect before compressing
bigsmall.detect_bf16_native("model/")
bigsmall.scan_model("model/")
# Low-VRAM streaming inference (~12× less VRAM than from_pretrained)
from bigsmall import BigSmallStreamingModel
model = BigSmallStreamingModel.from_pretrained(
"wpferrell/phi-3.5-mini-instruct-bigsmall",
device="cuda",
lru_max_vram_gb=2.0,
)
# Stream-decompress straight from the HF CDN — no .bs written to disk (V10)
state_dict = bigsmall.stream_from_hub("wpferrell/gpt2-bigsmall", device="cpu")
# Reshard .bs files along layer boundaries, no re-encoding (V11)
bigsmall.reshard(["model.bs"], "resharded/", target_shard_size_gb=2.0)- Delta compression is now fail-safe — measure-then-choose.
compress_deltaencodes every matched tensor both ways (XOR-delta and standalone) and keeps the smaller, so a delta file can never come out larger than standalone compression. A pre-compression gate warns when >30% of matched bytes changed — the measured delta-doesn't-pay regime (--force-deltasilences it). - KV cache entry format v3 — per-depth sequential dispatcher. Plain vs sequential-exponent coding is measured per tensor at encode time, the smaller kept, and the winning blob verified bit-exact before it returns. End-to-end wins over the shipped v1 format grow with context: 0.27% (128 tokens) → 0.58% (2048 tokens), up to +7.7% on early-layer K tensors. v1/v2 entries decode forever.
- FP8 KV entries (e4m3 / e5m2) and chunked streaming KV (32k+ contexts encode and decode chunk-by-chunk, nothing materialized whole; default 4096 tokens/chunk) in the same v3 entry format.
AutoKVCache—get_kv_cache(mode="auto")routes each call by device: CUDA tensors through the GPU-resident lossless codec, CPU tensors through the entry codec.bigsmall xray— checkpoint forensics. Per-tensor substream entropies vs a matched-random control, lineage and anomaly flags (mantissa_carved,sign_imbalance,exp_near_random, …), and a model-level "looks untrained" detector that catches silently-randomized loads:bigsmall xray model_dir/ --json report.json
- Role-group stream packing (opt-in
--group-streams): small same-role tensors (norm/bias chains, GQA k/v projections, MoE routers) packed into one coded stream per role when measured smaller. Grouped files need ≥3.15 readers. bf16_native_f32_v2— near-gate F32 tensors pick the smallest verified codec per substream; 0.6–2.7% smaller on the affected class.- Measured DFloat11 comparison — BigSmall codes below DFloat11's bound on every layer type of every model measured: docs/dfloat11.md.
- GPU-resident KV cache (V9) —
GPUCompressedKVCachekeeps the compressed cache and the encode/decode passes entirely on the CUDA device, with no CPU round-trip. ~47× faster than the CPU KV codec on the reference shape, bit-identical round-trip.get_kv_cache(device, mode)auto-picks the GPU backend when CUDA is available. V9B adds fused Triton pack/unpack kernels. - Progressive HTTP streaming (V10) —
stream_from_hub(repo_id)decompresses a model directly from the HuggingFace CDN over HTTP byte-range requests. With the defaultcache=False, zero.bsbytes are written to disk. - Reshard (V11) —
bigsmall reshardsplits, joins, or rebalances.bsshards along transformer-layer boundaries with no re-encoding. Every output tensor is md5-verified. - numba is now a hard dependency (
numba>=0.61) — guarantees the JIT codec path runs everywhere instead of a silent slow fallback. - CI green across the full matrix — Ubuntu / Windows / macOS × Python 3.10 / 3.11 / 3.12.
Earlier highlights still current: delta compression (fine-tunes as ~34%-size patches), --auto-delta base detection, BF16-native F32 auto-routing (Whisper-class), --resume, verify --fast/--sample, mmap decode, Reed-Solomon --ecc + repair, and BigSmallStreamingModel(lru_max_vram_gb=…).
The lossless compression ceiling for BF16 neural weights has been measured. It is ~62% of raw BF16 for any model, ~34% for ≥7B instruct fine-tunes with delta compression. We ran 300+ experiments across every known mathematical approach — entropy coding, cross-tensor prediction, learned translators, persistent homology, optimal transport, quantum-inspired methods, and more — and proved that there is no further compression available within the strict bit-identity contract.
The floor is measured at the wall, not extrapolated: across 4,143 weight matrices in 8 architectures the per-tensor entropy floor is flat (coefficient of variation ≈ 0), and trained mantissa/sign bits are coder-equivalent to matched random controls in every family tested — training only writes the exponent. The floor exists at initialization and never moves during training. Details: docs/research.md.
Full findings, all experiments, all dead-ends: 10.5281/zenodo.20279247. Plain-English summary: docs/research.md.
pip install bigsmall # core
pip install "bigsmall[hf]" # + HuggingFace integration
pip install "bigsmall[ecc]" # + Reed-Solomon error recovery
pip install "bigsmall[all]" # everythingRequires Python 3.9+. Works on Linux, macOS, and Windows. CPU, NVIDIA, AMD, and Apple Silicon.
Code: Elastic License 2.0. Free for personal, research, and commercial use. SaaS providers should see LICENSING.md.
Model weights distributed in .bs format keep the license of the original model.
- PyPI — https://pypi.org/project/bigsmall/
- GitHub — https://github.com/wpferrell/Bigsmall
- HuggingFace — https://huggingface.co/wpferrell
- Paper / DOI — https://doi.org/10.5281/zenodo.20279247 (always resolves to the latest version)
- Paper (PDF) — https://github.com/wpferrell/Bigsmall/blob/main/paper.pdf
- Docs — docs/
- Changelog — CHANGELOG.md
- Contact — wpferrell@gmail.com
Did BigSmall work for your model? We'd love to know.
- Open a Discussion — share your compression results, ask questions, or suggest improvements
- File an Issue — if something didn't work, tell us exactly what happened
- HuggingFace — all compressed models are at huggingface.co/wpferrell
We especially want to hear:
- Which model you compressed and what ratio you got
- Any errors or unexpected behaviour
- Use cases we haven't thought of