wpferrell
/

gpt2-bigsmall

@@ -10,23 +10,29 @@ tags:
 # GPT-2 117M (BigSmall compressed)
-**0.55 GB -> 0.39 GB (FP32). Full quality -- not quantization. Zero inference overhead.**
-Losslessly compressed with [BigSmall](https://github.com/wpferrell/Bigsmall). Every weight is bit-identical to the original. Decompresses once at load time then runs at full native speed -- no inference overhead, ever.
-## BigSmall vs DFloat11 -- what is the difference?
-Both are lossless. The difference is *when* decompression happens:
 | | BigSmall | DFloat11 |
 |--|--|--|
-| Decompresses | Once at load time | Every forward pass on GPU |
-| Inference overhead | **None** | ~2x slower at batch=1 |
 | Hardware | **CPU, Apple Silicon, AMD, any GPU** | CUDA only |
-| Use case | Smaller downloads, faster loads | Less VRAM during inference |
-**Use BigSmall if** you want to download less, load faster, and run at full native speed on any hardware.
-**Use DFloat11 if** you need the model to stay compressed in GPU memory during inference and have a CUDA GPU.
 ## Install
@@ -35,7 +41,7 @@ Both are lossless. The difference is *when* decompression happens:
 pip install bigsmall
 `
-## Load (transparent -- works like any HuggingFace model)
 `python
 import bigsmall
@@ -45,7 +51,7 @@ from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained("wpferrell/gpt2-bigsmall")
 `
-## Or stream layer by layer (peak RAM under 2GB even for 7B models)
 `python
 from bigsmall import StreamingLoader
@@ -57,9 +63,9 @@ with StreamingLoader("wpferrell/gpt2-bigsmall", device="cuda") as loader:
 ## Compression stats
-| Original | Compressed | Ratio | Format | Lossless |
 |----------|------------|-------|--------|---------|
-| 0.55 GB | 0.39 GB | 70.9% | FP32 | md5 verified every tensor |
 - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
 - PyPI: pip install bigsmall

 # GPT-2 117M (BigSmall compressed)
+**0.55 GB -> 0.39 GB (FP32). Lossless. Zero inference overhead. Any hardware.**
+Compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) -- decompresses once at load time, then runs at full native speed. Every weight is bit-identical to the original.
+## Why BigSmall
+### vs quantization (llama.cpp, GGUF, AWQ, bitsandbytes)
+Quantization permanently degrades weights. BigSmall is lossless -- bit-identical weights, no accuracy loss, fine-tuning safe, fully reproducible.
+### vs DFloat11 (runtime lossless compression)
+DFloat11 keeps weights compressed during inference -- saves VRAM but adds ~2x overhead at batch=1, CUDA only. BigSmall decompresses once at load time and runs at full native speed on any hardware.
 | | BigSmall | DFloat11 |
 |--|--|--|
+| Compression ratio (BF16) | **65-66%** | ~70% |
+| Inference overhead | **None** | ~2x at batch=1 |
 | Hardware | **CPU, Apple Silicon, AMD, any GPU** | CUDA only |
+| FP32 / FP16 / FP8 support | **Yes** | BF16 only |
+| Fine-tuning safe | **Yes** | No |
+| Streaming loader (< 2GB RAM) | **Yes** | No |
+### vs ZipNN (storage lossless compression)
+Same category as BigSmall -- decompresses at load time. BigSmall compresses better (65% vs 67% BF16) and supports more formats. BigSmall also has a streaming loader so you can run 70B models with under 2GB peak RAM.
 ## Install
 pip install bigsmall
 `
+## Load
 `python
 import bigsmall
 model = AutoModelForCausalLM.from_pretrained("wpferrell/gpt2-bigsmall")
 `
+## Stream layer by layer (peak RAM under 2GB even for 7B models)
 `python
 from bigsmall import StreamingLoader
 ## Compression stats
+| Original | Compressed | Ratio | Format | Verified |
 |----------|------------|-------|--------|---------|
+| 0.55 GB | 0.39 GB | 70.9% | FP32 | md5 every tensor |
 - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
 - PyPI: pip install bigsmall