wpferrell commited on
Commit
840d2b4
·
verified ·
1 Parent(s): f7335f8

Add streaming loader guide and hardware table

Browse files
Files changed (1) hide show
  1. README.md +30 -29
README.md CHANGED
@@ -12,46 +12,24 @@ tags:
12
 
13
  **0.55 GB -> 0.39 GB (FP32). Lossless. Zero inference overhead. Any hardware.**
14
 
15
- Compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) -- decompresses once at load time, then runs at full native speed. Every weight is bit-identical to the original.
16
 
17
- ## Why BigSmall
18
-
19
- ### vs quantization (llama.cpp, GGUF, AWQ, bitsandbytes)
20
- Quantization permanently degrades weights. BigSmall is lossless -- bit-identical weights, no accuracy loss, fine-tuning safe, fully reproducible.
21
-
22
- ### vs DFloat11 (runtime lossless compression)
23
- DFloat11 keeps weights compressed during inference -- saves VRAM but adds ~2x overhead at batch=1, CUDA only. BigSmall decompresses once at load time and runs at full native speed on any hardware.
24
-
25
- | | BigSmall | DFloat11 |
26
- |--|--|--|
27
- | Compression ratio (BF16) | **65-66%** | ~70% |
28
- | Inference overhead | **None** | ~2x at batch=1 |
29
- | Hardware | **CPU, Apple Silicon, AMD, any GPU** | CUDA only |
30
- | FP32 / FP16 / FP8 support | **Yes** | BF16 only |
31
- | Fine-tuning safe | **Yes** | No |
32
- | Streaming loader (< 2GB RAM) | **Yes** | No |
33
-
34
- ### vs ZipNN (storage lossless compression)
35
- Same category as BigSmall -- decompresses at load time. BigSmall compresses better (65% vs 67% BF16) and supports more formats. BigSmall also has a streaming loader so you can run 70B models with under 2GB peak RAM.
36
-
37
-
38
- ## Install
39
 
40
  `ash
41
  pip install bigsmall
42
  `
43
 
44
- ## Load
45
-
46
  `python
47
  import bigsmall
48
  bigsmall.install_hook()
49
-
50
  from transformers import AutoModelForCausalLM
51
  model = AutoModelForCausalLM.from_pretrained("wpferrell/gpt2-bigsmall")
52
  `
53
 
54
- ## Stream layer by layer (peak RAM under 2GB even for 7B models)
 
 
55
 
56
  `python
57
  from bigsmall import StreamingLoader
@@ -61,6 +39,30 @@ with StreamingLoader("wpferrell/gpt2-bigsmall", device="cuda") as loader:
61
  model = loader.load_model(AutoModelForCausalLM)
62
  `
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ## Compression stats
65
 
66
  | Original | Compressed | Ratio | Format | Verified |
@@ -68,5 +70,4 @@ with StreamingLoader("wpferrell/gpt2-bigsmall", device="cuda") as loader:
68
  | 0.55 GB | 0.39 GB | 70.9% | FP32 | md5 every tensor |
69
 
70
  - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
71
- - PyPI: pip install bigsmall
72
- - All pre-compressed models: [huggingface.co/wpferrell](https://huggingface.co/wpferrell)
 
12
 
13
  **0.55 GB -> 0.39 GB (FP32). Lossless. Zero inference overhead. Any hardware.**
14
 
15
+ Compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) -- decompresses once at load time, runs at full native speed. Every weight is bit-identical to the original.
16
 
17
+ ## Quick start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  `ash
20
  pip install bigsmall
21
  `
22
 
 
 
23
  `python
24
  import bigsmall
25
  bigsmall.install_hook()
 
26
  from transformers import AutoModelForCausalLM
27
  model = AutoModelForCausalLM.from_pretrained("wpferrell/gpt2-bigsmall")
28
  `
29
 
30
+ ## Streaming loader -- run on any hardware
31
+
32
+ BigSmall's streaming loader decompresses one layer at a time directly into VRAM. Peak memory is one layer -- not the whole model. A 4 GB GPU can run Mistral 7B losslessly.
33
 
34
  `python
35
  from bigsmall import StreamingLoader
 
39
  model = loader.load_model(AutoModelForCausalLM)
40
  `
41
 
42
+ | Your GPU | Models you can run |
43
+ |----------|--------------------|
44
+ | 2 GB | Small models, GPT-2, Gemma 270M |
45
+ | 4 GB | Mistral 7B, Llama 3.1 8B, Gemma 2B, Llama 3.2 3B |
46
+ | 8 GB | Qwen 2.5 14B, Gemma 2 9B |
47
+ | 24 GB | Llama 70B, Qwen 72B, DeepSeek V4-Flash |
48
+ | CPU only | Everything -- slower but full quality |
49
+
50
+ BigSmall is the only lossless compression tool with a streaming loader. DFloat11 and ZipNN load the full model into memory.
51
+
52
+ ## Why BigSmall vs DFloat11
53
+
54
+ | | BigSmall | DFloat11 |
55
+ |--|--|--|
56
+ | Inference overhead | **None** | ~2x at batch=1 |
57
+ | Hardware | **CPU, Apple Silicon, AMD, any GPU** | CUDA only |
58
+ | FP32 support | **Yes** | No |
59
+ | Fine-tuning safe | **Yes** | No |
60
+ | Streaming loader | **Yes -- peak RAM < 2 GB** | No |
61
+
62
+ ## Why BigSmall vs quantization
63
+
64
+ Lossless -- bit-identical weights, no accuracy loss, fine-tuning safe, reproducible outputs.
65
+
66
  ## Compression stats
67
 
68
  | Original | Compressed | Ratio | Format | Verified |
 
70
  | 0.55 GB | 0.39 GB | 70.9% | FP32 | md5 every tensor |
71
 
72
  - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
73
+ - All models: [huggingface.co/wpferrell](https://huggingface.co/wpferrell)