wpferrell commited on
Commit
307423d
·
verified ·
1 Parent(s): d0677c7

Clarify BigSmall vs DFloat11 positioning

Browse files
Files changed (1) hide show
  1. README.md +38 -42
README.md CHANGED
@@ -5,66 +5,62 @@ tags:
5
  - compression
6
  - lossless
7
  - gpt2
 
8
  ---
9
 
10
- # GPT-2 (BigSmall compressed)
11
 
12
- **548MB 414MB (75.5%). Bit-identical. Under 500MB peak RAM with streaming.**
13
 
14
- This is GPT-2 117M compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression. Not quantization. Every weight is bit-identical to the original.
15
 
16
- ## Install
17
-
18
- ```bash
19
- pip install bigsmall
20
- ```
21
 
22
- ## Load and run inference (streaming)
23
 
24
- ```python
25
- from bigsmall import StreamingLoader
26
- from transformers import GPT2LMHeadModel, GPT2Tokenizer
 
 
 
27
 
28
- # Streams one layer at a time under 500MB peak RAM
29
- loader = StreamingLoader("wpferrell/gpt2-bigsmall")
30
- model = loader.load_model(GPT2LMHeadModel)
31
- tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
32
 
33
- inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")
34
- outputs = model.generate(**inputs, max_new_tokens=50)
35
- print(tokenizer.decode(outputs[0]))
36
- ```
37
 
38
- ## Or decompress to disk first
39
 
40
- ```python
41
- from bigsmall import from_pretrained
42
- from transformers import GPT2LMHeadModel
43
- model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
44
- ```
45
 
46
- ## What's inside
47
 
48
- | File | Original | Compressed | Ratio |
49
- |------|----------|------------|-------|
50
- | model.safetensors (FP32) | 548 MB | 414 MB | 75.5% |
51
 
52
- Verified lossless: md5 of every weight tensor matches original after decompression.
 
 
53
 
54
- ## Comparison
55
 
56
- | Tool | BF16 Ratio | FP32 Ratio | Inference Overhead | Hardware |
57
- |------|------------|------------|-------------------|---------|
58
- | [ZipNN](https://arxiv.org/abs/2411.05239) | 67% | 83% | None | CPU |
59
- | [DFloat11](https://arxiv.org/abs/2504.11651) | ~70% | BF16 only | ~2x at batch=1 | CUDA only |
60
- | [ZipServ](https://arxiv.org/abs/2603.17435) | ~70% | BF16 only | 1.22x faster | GDDR GPU |
61
- | **BigSmall** | **65.6%** | **75.5%** | **None** | **CPU + any GPU** |
62
 
63
- *Lower ratio = better compression. BigSmall BF16 measured on Mistral 7B.*
 
 
64
 
65
- ## About BigSmall
66
 
67
- BigSmall compresses at the joint entropy floor for neural network weights. It codes sign+exponent jointly and mantissa conditioned on exponent, achieving the information-theoretic minimum. The streaming loader decompresses one transformer layer at a time directly into VRAM.
 
 
68
 
69
  - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
70
- - PyPI: `pip install bigsmall`
 
 
5
  - compression
6
  - lossless
7
  - gpt2
8
+ - openai
9
  ---
10
 
11
+ # GPT-2 117M (BigSmall compressed)
12
 
13
+ **0.55 GB -> 0.39 GB (FP32). Full quality -- not quantization. Zero inference overhead.**
14
 
15
+ Losslessly compressed with [BigSmall](https://github.com/wpferrell/Bigsmall). Every weight is bit-identical to the original. Decompresses once at load time then runs at full native speed -- no inference overhead, ever.
16
 
17
+ ## BigSmall vs DFloat11 -- what is the difference?
 
 
 
 
18
 
19
+ Both are lossless. The difference is *when* decompression happens:
20
 
21
+ | | BigSmall | DFloat11 |
22
+ |--|--|--|
23
+ | Decompresses | Once at load time | Every forward pass on GPU |
24
+ | Inference overhead | **None** | ~2x slower at batch=1 |
25
+ | Hardware | **CPU, Apple Silicon, AMD, any GPU** | CUDA only |
26
+ | Use case | Smaller downloads, faster loads | Less VRAM during inference |
27
 
28
+ **Use BigSmall if** you want to download less, load faster, and run at full native speed on any hardware.
29
+ **Use DFloat11 if** you need the model to stay compressed in GPU memory during inference and have a CUDA GPU.
 
 
30
 
 
 
 
 
31
 
32
+ ## Install
33
 
34
+ `ash
35
+ pip install bigsmall
36
+ `
 
 
37
 
38
+ ## Load (transparent -- works like any HuggingFace model)
39
 
40
+ `python
41
+ import bigsmall
42
+ bigsmall.install_hook()
43
 
44
+ from transformers import AutoModelForCausalLM
45
+ model = AutoModelForCausalLM.from_pretrained("wpferrell/gpt2-bigsmall")
46
+ `
47
 
48
+ ## Or stream layer by layer (peak RAM under 2GB even for 7B models)
49
 
50
+ `python
51
+ from bigsmall import StreamingLoader
52
+ from transformers import AutoModelForCausalLM
 
 
 
53
 
54
+ with StreamingLoader("wpferrell/gpt2-bigsmall", device="cuda") as loader:
55
+ model = loader.load_model(AutoModelForCausalLM)
56
+ `
57
 
58
+ ## Compression stats
59
 
60
+ | Original | Compressed | Ratio | Format | Lossless |
61
+ |----------|------------|-------|--------|---------|
62
+ | 0.55 GB | 0.39 GB | 70.9% | FP32 | md5 verified every tensor |
63
 
64
  - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
65
+ - PyPI: pip install bigsmall
66
+ - All pre-compressed models: [huggingface.co/wpferrell](https://huggingface.co/wpferrell)