Skip to content

wpferrell/Bigsmall

Repository files navigation

PyPI version DOI License Python Downloads

BigSmall — Lossless AI Model Compression

Make any AI model ~34% smaller. Bit-identical weights. Drop-in replacement for from_pretrained.

pip install bigsmall          # CLI + compression/decompression
pip install bigsmall[torch]   # add this for model loading (from_pretrained)

A 14 GB Mistral-7B becomes 9.3 GB. A fine-tuned model becomes a 5 GB patch on top of its 14 GB base. The decompressed model is every weight bit-for-bit identical to the original — each tensor's md5 is verified on decompress. (Verification is tensor-level, not file-level: safetensors re-serializes the container wrapper, so the file's md5 changes, but every weight value is bit-for-bit identical.)

~34% smaller ~65% smaller as a delta patch 25+ ready-to-use models
any BF16 LLM ≥7B instruct fine-tunes vs their base (pair-dependent) on HuggingFace

What BigSmall does

Three use cases. Pick the one that fits.

1. Make any model smaller

bigsmall compress mistral-7b/ -o mistral-7b.bs
bigsmall decompress mistral-7b.bs -o mistral-7b-restored/

Before: 14.2 GB of safetensors. After: 9.3 GB .bs file. Saved: 4.9 GB (34%).

Every weight is bit-for-bit identical. Every calculation the model does is identical to the original. Works on any safetensors model — LLMs, diffusion, audio, vision, anything.

2. Store fine-tunes as tiny patches

bigsmall compress qwen-instruct/ --delta-from qwen-base/ -o instruct.bs
bigsmall apply qwen-base/ instruct.bs -o qwen-instruct-restored/

Before: 14.2 GB Qwen2.5-7B-Instruct. After: ~5 GB patch. Saved: 9 GB (65%).

If your users already have the public base model, they only need to download what changed. This is the biggest win in BigSmall. Use it for any fine-tune: instruction tuning, DPO, RLHF, domain adaptation, LoRA-merged checkpoints.

How much a delta saves is pair-dependent — measured from under 1% of full size (the best ≥7B SFT pairs) to ~61% (small-model full tunes, barely under standalone). The 65% saving above is the ≥7B official-instruct class, where patches measure 34–50% of full size. Full measured table: docs/delta-compression.md. Since 3.15 the engine measures both codings per tensor and never ships a delta larger than standalone.

3. Download smaller, use instantly

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall"
)

Works exactly like a normal HuggingFace model — BigSmall decompresses transparently on load. 25+ pre-compressed models ready to use (browse them all).

Prefer the CLI? bigsmall decompress works on local .bs files — download first, then decompress:

hf download wpferrell/phi-3.5-mini-instruct-bigsmall --local-dir phi-3.5-mini-bs
bigsmall decompress phi-3.5-mini-bs/model-00001-of-00002.bs -o model.safetensors

(On older huggingface_hub the equivalent command is huggingface-cli download …; the huggingface-cli entrypoint is deprecated in huggingface_hub >= 1.0 in favour of hf.)


Compression numbers (every published model)

Every row is a real measurement. Click a model to download it.

Model Original BigSmall Saved
Qwen2.5-14B-Instruct 29.5 GB 19.5 GB 34%
Gemma-3-12B-it 22.7 GB 14.8 GB 35%
Gemma-2-9B-it 17.2 GB 11.3 GB 34%
Llama-3.1-8B-Instruct 15.0 GB 9.7 GB 35%
Llama-3-8B-Instruct 15.0 GB 9.8 GB 34%
Qwen3-8B 15.3 GB 10.1 GB 34%
Mistral-7B-Instruct v0.3 14.2 GB 8.9 GB 37%
Mistral-7B-Instruct v0.2 14.2 GB 8.9 GB 37%
Qwen2.5-7B-Instruct 14.2 GB 9.4 GB 34%
Phi-3.5-mini-instruct 7.1 GB 4.7 GB 34%
Gemma-3-4B-it 8.0 GB 5.2 GB 35%
Qwen3-4B-Instruct 7.5 GB 5.0 GB 34%
Llama-3.2-3B-Instruct 6.4 GB 3.9 GB 39%
Gemma-2-2B-it 4.9 GB 3.2 GB 34%
Qwen2.5-3B-Instruct 5.7 GB 3.8 GB 34%
Qwen2.5-1.5B-Instruct 2.9 GB 1.9 GB 34%
Llama-3.2-1B-Instruct 2.3 GB 1.5 GB 34%
Gemma-3-1B-it 1.9 GB 1.2 GB 35%
Qwen2.5-0.5B-Instruct 920 MB 610 MB 34%
GPT-2 (117M) 548 MB 414 MB 24%
Gemma-3-270M-it 500 MB 330 MB 34%
Gemma-3-270M 500 MB 330 MB 34%
Gemma-2-2B 9.7 GB 8.1 GB 17%

Browse all 25+ models on HuggingFace →


What "lossless" actually means

Every weight in the model is mathematically identical to the original — same bit pattern, same floating-point value, same gradient, same output.

  • Not quantization. Quantization rounds weights to fewer bits and the model's behaviour changes.
  • Not pruning. Pruning deletes weights.
  • Not approximation. No tricks, no calibration data, no quality drop.

BigSmall finds redundancy in the bit pattern of neural weights and stores it more compactly — the same idea as ZIP for text, but tuned for BF16 floating-point distributions. md5 is verified on every tensor at decompression. If a single bit differs, verify fails.


How it compares

Approach Lossless? Typical reduction Behaviour change
BigSmall Yes — bit-identical ~34% (65% as a delta, ≥7B instruct class) None
Quantization (GPTQ / AWQ / bitsandbytes) No 50–75% Yes — weights are rounded
DFloat11 (entropy-coded BF16) Yes — lossless ~30% (format-fixed) None
ZipNN (entropy-coded BF16) Yes — lossless up to ~33% (authors' reported numbers) None
ZIP / gzip on safetensors Yes ~1–3% None (but not model-aware)

Three of these are lossless weight-aware formats: BigSmall, DFloat11, and ZipNN. Head-to-head under the same accounting, BigSmall codes below DFloat11's bound on every layer type of every model measured (+0.45–0.55 pp model-level, +12–18 pp on norm scales — docs/dfloat11.md); ZipNN has not been independently measured by us, so its row carries its authors' numbers. BigSmall is also the only one of the three with delta patches, BF16-native-F32 detection, and streaming surfaces. Quantization compresses further but changes the model; generic ZIP keeps fidelity but barely shrinks BF16 weights. See docs/comparison.md for the full breakdown.


CLI reference

bigsmall compress SRC [-o OUT] [--delta-from BASE] [--auto-delta] [--resume] [--ecc]
bigsmall decompress SRC [-o OUT] [--base BASE]
bigsmall info SRC.bs                       size, ratio, codecs used
bigsmall scan SRC                          analyse before compressing
bigsmall verify SRC.bs [--fast|--sample N] integrity check
bigsmall diff A.bs B.bs [--patch P.bs]     compare or write a delta
bigsmall apply BASE PATCH.bs -o OUT        reconstruct from base + patch
bigsmall repair SRC.bs [-o OUT]            recover via Reed-Solomon ECC sidecar
bigsmall benchmark SRC                     encode/decode throughput
bigsmall migrate SRC.bs                    re-encode with current codecs
bigsmall status                            list your BigSmall HF repos
bigsmall pipeline run SRC DST              resumable download → compress → upload
bigsmall reshard SRC --output-dir DIR [--size-gb N|--shards N|--join]  reshard .bs by layer

Every command has --help. See docs/cli-reference.md for full examples.


Python API

import bigsmall

# Round-trip a model
bigsmall.compress("model/", "model.bs")
bigsmall.decompress("model.bs", "model_back/")

# Fine-tune as a delta patch
bigsmall.compress("finetune/", "patch.bs", delta_from="base/")
bigsmall.apply("base/", "patch.bs", "finetune_back/")

# Inspect before compressing
bigsmall.detect_bf16_native("model/")
bigsmall.scan_model("model/")

# Low-VRAM streaming inference (~12× less VRAM than from_pretrained)
from bigsmall import BigSmallStreamingModel
model = BigSmallStreamingModel.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall",
    device="cuda",
    lru_max_vram_gb=2.0,
)

# Stream-decompress straight from the HF CDN — no .bs written to disk (V10)
state_dict = bigsmall.stream_from_hub("wpferrell/gpt2-bigsmall", device="cpu")

# Reshard .bs files along layer boundaries, no re-encoding (V11)
bigsmall.reshard(["model.bs"], "resharded/", target_shard_size_gb=2.0)

What's new in v3.15.0

  • Delta compression is now fail-safe — measure-then-choose. compress_delta encodes every matched tensor both ways (XOR-delta and standalone) and keeps the smaller, so a delta file can never come out larger than standalone compression. A pre-compression gate warns when >30% of matched bytes changed — the measured delta-doesn't-pay regime (--force-delta silences it).
  • KV cache entry format v3 — per-depth sequential dispatcher. Plain vs sequential-exponent coding is measured per tensor at encode time, the smaller kept, and the winning blob verified bit-exact before it returns. End-to-end wins over the shipped v1 format grow with context: 0.27% (128 tokens) → 0.58% (2048 tokens), up to +7.7% on early-layer K tensors. v1/v2 entries decode forever.
  • FP8 KV entries (e4m3 / e5m2) and chunked streaming KV (32k+ contexts encode and decode chunk-by-chunk, nothing materialized whole; default 4096 tokens/chunk) in the same v3 entry format.
  • AutoKVCacheget_kv_cache(mode="auto") routes each call by device: CUDA tensors through the GPU-resident lossless codec, CPU tensors through the entry codec.
  • bigsmall xray — checkpoint forensics. Per-tensor substream entropies vs a matched-random control, lineage and anomaly flags (mantissa_carved, sign_imbalance, exp_near_random, …), and a model-level "looks untrained" detector that catches silently-randomized loads:
    bigsmall xray model_dir/ --json report.json
  • Role-group stream packing (opt-in --group-streams): small same-role tensors (norm/bias chains, GQA k/v projections, MoE routers) packed into one coded stream per role when measured smaller. Grouped files need ≥3.15 readers.
  • bf16_native_f32_v2 — near-gate F32 tensors pick the smallest verified codec per substream; 0.6–2.7% smaller on the affected class.
  • Measured DFloat11 comparison — BigSmall codes below DFloat11's bound on every layer type of every model measured: docs/dfloat11.md.

What's new in v3.14

  • GPU-resident KV cache (V9)GPUCompressedKVCache keeps the compressed cache and the encode/decode passes entirely on the CUDA device, with no CPU round-trip. ~47× faster than the CPU KV codec on the reference shape, bit-identical round-trip. get_kv_cache(device, mode) auto-picks the GPU backend when CUDA is available. V9B adds fused Triton pack/unpack kernels.
  • Progressive HTTP streaming (V10)stream_from_hub(repo_id) decompresses a model directly from the HuggingFace CDN over HTTP byte-range requests. With the default cache=False, zero .bs bytes are written to disk.
  • Reshard (V11)bigsmall reshard splits, joins, or rebalances .bs shards along transformer-layer boundaries with no re-encoding. Every output tensor is md5-verified.
  • numba is now a hard dependency (numba>=0.61) — guarantees the JIT codec path runs everywhere instead of a silent slow fallback.
  • CI green across the full matrix — Ubuntu / Windows / macOS × Python 3.10 / 3.11 / 3.12.

Earlier highlights still current: delta compression (fine-tunes as ~34%-size patches), --auto-delta base detection, BF16-native F32 auto-routing (Whisper-class), --resume, verify --fast/--sample, mmap decode, Reed-Solomon --ecc + repair, and BigSmallStreamingModel(lru_max_vram_gb=…).

Full changelog →


Research

The lossless compression ceiling for BF16 neural weights has been measured. It is ~62% of raw BF16 for any model, ~34% for ≥7B instruct fine-tunes with delta compression. We ran 300+ experiments across every known mathematical approach — entropy coding, cross-tensor prediction, learned translators, persistent homology, optimal transport, quantum-inspired methods, and more — and proved that there is no further compression available within the strict bit-identity contract.

The floor is measured at the wall, not extrapolated: across 4,143 weight matrices in 8 architectures the per-tensor entropy floor is flat (coefficient of variation ≈ 0), and trained mantissa/sign bits are coder-equivalent to matched random controls in every family tested — training only writes the exponent. The floor exists at initialization and never moves during training. Details: docs/research.md.

Full findings, all experiments, all dead-ends: 10.5281/zenodo.20279247. Plain-English summary: docs/research.md.


Install

pip install bigsmall                  # core
pip install "bigsmall[hf]"            # + HuggingFace integration
pip install "bigsmall[ecc]"           # + Reed-Solomon error recovery
pip install "bigsmall[all]"           # everything

Requires Python 3.9+. Works on Linux, macOS, and Windows. CPU, NVIDIA, AMD, and Apple Silicon.


License

Code: Elastic License 2.0. Free for personal, research, and commercial use. SaaS providers should see LICENSING.md.

Model weights distributed in .bs format keep the license of the original model.


Links


Feedback & Community

Did BigSmall work for your model? We'd love to know.

  • Open a Discussion — share your compression results, ask questions, or suggest improvements
  • File an Issue — if something didn't work, tell us exactly what happened
  • HuggingFace — all compressed models are at huggingface.co/wpferrell

We especially want to hear:

  • Which model you compressed and what ratio you got
  • Any errors or unexpected behaviour
  • Use cases we haven't thought of