johnsonchromia commited on
Commit
64c1f18
Β·
verified Β·
1 Parent(s): 4992ee1

README: add wllama-safe Q4_K_M + Q2_K variants (embed @ q5_K)

Browse files
Files changed (1) hide show
  1. README.md +47 -16
README.md CHANGED
@@ -24,8 +24,9 @@ pipeline_tag: text-generation
24
  > use it and for complying with all applicable laws.
25
 
26
  GGUF quantizations of [`evalengine/unbound-e4b`](https://huggingface.co/evalengine/unbound-e4b)
27
- for on-device deployment via Ollama, llama.cpp, LM Studio, and similar
28
- desktop runtimes.
 
29
 
30
  Built by [Chromia](https://x.com/Chromia) and [Eval Engine](https://x.com/eval_engine).
31
 
@@ -35,20 +36,35 @@ All quants ship as **split multi-part GGUFs** (`*-00001-of-0000N.gguf` ...)
35
  so desktop runtimes can parallel-download chunks. Ollama, llama.cpp, and LM
36
  Studio auto-stitch on the first part β€” same UX as a single file.
37
 
38
- | Quant | Parts | Total | Largest part | wllama (browser) | Desktop (Ollama / llama.cpp / LM Studio) | Notes |
39
- |---------|-------|---------|--------------|------------------|------------------------------------------|-------|
40
- | Q2_K | 4 | 4.08 GB | ~2.15 GB | ❌ | βœ… | Smallest disk footprint; biggest quality drop |
41
- | Q3_K_M | 4 | 4.49 GB | ~2.15 GB | ❌ | βœ… | Modest size win over Q4 (embedding precision dominates) |
42
- | Q4_K_M | 4 | 4.94 GB | ~2.15 GB | ❌ | βœ… | **Recommended on-device default β€” best size/quality** |
43
- | Q6_K | 5 | 5.75 GB | ~2.2 GB | ❌ | βœ… | Higher fidelity |
44
- | Q8_0 | 6 | 7.43 GB | ~2.7 GB | ❌ | βœ… | Highest fidelity |
45
-
46
- **Why no wllama:** E4B's `per_layer_token_embd` is a single atomic tensor
47
- that exceeds 2 GB in every quant we ship (the smallest one, Q4_K_M, lands
48
- at ~2.2 GB for that tensor). wllama's underlying browser ArrayBuffer caps at
49
- 2 GB, so no split scheme can fit. For in-browser inference, use
50
- [`evalengine/unbound-e2b-GGUF`](https://huggingface.co/evalengine/unbound-e2b-GGUF)
51
- instead β€” its tensors are small enough that Q4_K_M and Q6_K both fit.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  ## Recommended sampling
54
 
@@ -76,6 +92,21 @@ ollama run hf.co/evalengine/unbound-e4b-GGUF
76
  ./llama-cli -m unbound-e4b-Q4_K_M-00001-of-00004.gguf -p "your prompt"
77
  ```
78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  ## About the base
80
 
81
  See [`evalengine/unbound-e4b`](https://huggingface.co/evalengine/unbound-e4b)
 
24
  > use it and for complying with all applicable laws.
25
 
26
  GGUF quantizations of [`evalengine/unbound-e4b`](https://huggingface.co/evalengine/unbound-e4b)
27
+ for on-device deployment via Ollama, llama.cpp, LM Studio,
28
+ [wllama](https://github.com/ngxson/wllama) (in-browser β€” see the wllama
29
+ section below), and similar runtimes.
30
 
31
  Built by [Chromia](https://x.com/Chromia) and [Eval Engine](https://x.com/eval_engine).
32
 
 
36
  so desktop runtimes can parallel-download chunks. Ollama, llama.cpp, and LM
37
  Studio auto-stitch on the first part β€” same UX as a single file.
38
 
39
+ ### Desktop builds (Ollama / llama.cpp / LM Studio)
40
+
41
+ These keep `per_layer_token_embd` at the llama.cpp default of Q6_K, which
42
+ maximizes quality but pushes the largest split part above 2 GB β€” fine for
43
+ desktop, won't load in browser.
44
+
45
+ | Quant | Parts | Total | Largest part | Notes |
46
+ |---------|-------|---------|--------------|-------|
47
+ | Q2_K | 4 | 4.08 GB | ~2.15 GB | Smallest disk footprint; biggest quality drop |
48
+ | Q3_K_M | 4 | 4.49 GB | ~2.15 GB | Modest size win over Q4 (embedding precision dominates total size) |
49
+ | Q4_K_M | 4 | 4.94 GB | ~2.15 GB | **Recommended desktop default β€” best size/quality** |
50
+ | Q6_K | 5 | 5.75 GB | ~2.2 GB | Higher fidelity |
51
+ | Q8_0 | 6 | 7.43 GB | ~2.7 GB | Highest fidelity |
52
+
53
+ ### Browser builds (wllama)
54
+
55
+ E4B's `per_layer_token_embd` is a single 2.82-billion-value tensor; at the
56
+ default Q6_K precision it lands at ~2.2 GB, just over the browser
57
+ ArrayBuffer cap. These variants force the embedding tensors to `q5_K`,
58
+ which shrinks the largest part below 2 GB at near-zero quality cost.
59
+
60
+ | Quant variant | Parts | Total | Largest part | wllama | Notes |
61
+ |---------------------|-------|---------|--------------|--------|-------|
62
+ | **Q4_K_M-wllama** | 4 | 4.51 GB | **1848 MB** | βœ… | **Recommended browser default** β€” layers @ Q4_K_M, embed @ q5_K |
63
+ | **Q2_K-wllama** | 4 | 3.69 GB | **1848 MB** | βœ… | Smallest browser-loadable build β€” layers @ Q2_K, embed @ q5_K |
64
+
65
+ The wllama files use the same split-multi-part naming, so Ollama / llama.cpp
66
+ will auto-stitch them too if you prefer the smaller embed quant for any
67
+ reason.
68
 
69
  ## Recommended sampling
70
 
 
92
  ./llama-cli -m unbound-e4b-Q4_K_M-00001-of-00004.gguf -p "your prompt"
93
  ```
94
 
95
+ ## Run in the browser (wllama)
96
+
97
+ [wllama](https://github.com/ngxson/wllama) is a WebAssembly port of llama.cpp
98
+ that runs entirely in the browser. Use one of the wllama-safe variants
99
+ above:
100
+
101
+ ```js
102
+ import { Wllama } from '@wllama/wllama';
103
+ const wllama = new Wllama(/* … */);
104
+ await wllama.loadModelFromHF(
105
+ 'evalengine/unbound-e4b-GGUF',
106
+ 'unbound-e4b-Q4_K_M-wllama-00001-of-00004.gguf' // wllama follows the chain
107
+ );
108
+ ```
109
+
110
  ## About the base
111
 
112
  See [`evalengine/unbound-e4b`](https://huggingface.co/evalengine/unbound-e4b)