wpferrell
/

gpt2-bigsmall

@@ -12,46 +12,24 @@ tags:
 **0.55 GB -> 0.39 GB (FP32). Lossless. Zero inference overhead. Any hardware.**
-Compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) -- decompresses once at load time, then runs at full native speed. Every weight is bit-identical to the original.
-## Why BigSmall
-### vs quantization (llama.cpp, GGUF, AWQ, bitsandbytes)
-Quantization permanently degrades weights. BigSmall is lossless -- bit-identical weights, no accuracy loss, fine-tuning safe, fully reproducible.
-### vs DFloat11 (runtime lossless compression)
-DFloat11 keeps weights compressed during inference -- saves VRAM but adds ~2x overhead at batch=1, CUDA only. BigSmall decompresses once at load time and runs at full native speed on any hardware.
-| | BigSmall | DFloat11 |
-|--|--|--|
-| Compression ratio (BF16) | **65-66%** | ~70% |
-| Inference overhead | **None** | ~2x at batch=1 |
-| Hardware | **CPU, Apple Silicon, AMD, any GPU** | CUDA only |
-| FP32 / FP16 / FP8 support | **Yes** | BF16 only |
-| Fine-tuning safe | **Yes** | No |
-| Streaming loader (< 2GB RAM) | **Yes** | No |
-### vs ZipNN (storage lossless compression)
-Same category as BigSmall -- decompresses at load time. BigSmall compresses better (65% vs 67% BF16) and supports more formats. BigSmall also has a streaming loader so you can run 70B models with under 2GB peak RAM.
-## Install
 `ash
 pip install bigsmall
 `
-## Load
 `python
 import bigsmall
 bigsmall.install_hook()
 from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained("wpferrell/gpt2-bigsmall")
 `
-## Stream layer by layer (peak RAM under 2GB even for 7B models)
 `python
 from bigsmall import StreamingLoader
@@ -61,6 +39,30 @@ with StreamingLoader("wpferrell/gpt2-bigsmall", device="cuda") as loader:
     model = loader.load_model(AutoModelForCausalLM)
 `
 ## Compression stats
 | Original | Compressed | Ratio | Format | Verified |
@@ -68,5 +70,4 @@ with StreamingLoader("wpferrell/gpt2-bigsmall", device="cuda") as loader:
 | 0.55 GB | 0.39 GB | 70.9% | FP32 | md5 every tensor |
 - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
-- PyPI: pip install bigsmall
-- All pre-compressed models: [huggingface.co/wpferrell](https://huggingface.co/wpferrell)

 **0.55 GB -> 0.39 GB (FP32). Lossless. Zero inference overhead. Any hardware.**
+Compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) -- decompresses once at load time, runs at full native speed. Every weight is bit-identical to the original.
+## Quick start
 `ash
 pip install bigsmall
 `
 `python
 import bigsmall
 bigsmall.install_hook()
 from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained("wpferrell/gpt2-bigsmall")
 `
+## Streaming loader -- run on any hardware
+BigSmall's streaming loader decompresses one layer at a time directly into VRAM. Peak memory is one layer -- not the whole model. A 4 GB GPU can run Mistral 7B losslessly.
 `python
 from bigsmall import StreamingLoader
     model = loader.load_model(AutoModelForCausalLM)
 `
+| Your GPU | Models you can run |
+|----------|--------------------|
+| 2 GB | Small models, GPT-2, Gemma 270M |
+| 4 GB | Mistral 7B, Llama 3.1 8B, Gemma 2B, Llama 3.2 3B |
+| 8 GB | Qwen 2.5 14B, Gemma 2 9B |
+| 24 GB | Llama 70B, Qwen 72B, DeepSeek V4-Flash |
+| CPU only | Everything -- slower but full quality |
+BigSmall is the only lossless compression tool with a streaming loader. DFloat11 and ZipNN load the full model into memory.
+## Why BigSmall vs DFloat11
+| | BigSmall | DFloat11 |
+|--|--|--|
+| Inference overhead | **None** | ~2x at batch=1 |
+| Hardware | **CPU, Apple Silicon, AMD, any GPU** | CUDA only |
+| FP32 support | **Yes** | No |
+| Fine-tuning safe | **Yes** | No |
+| Streaming loader | **Yes -- peak RAM < 2 GB** | No |
+## Why BigSmall vs quantization
+Lossless -- bit-identical weights, no accuracy loss, fine-tuning safe, reproducible outputs.
 ## Compression stats
 | Original | Compressed | Ratio | Format | Verified |
 | 0.55 GB | 0.39 GB | 70.9% | FP32 | md5 every tensor |
 - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
+- All models: [huggingface.co/wpferrell](https://huggingface.co/wpferrell)