johnsonchromia commited on
Commit
6e58098
Β·
verified Β·
1 Parent(s): 377f26c

README: folder-per-quant layout + wllama -> -web naming

Browse files
Files changed (1) hide show
  1. README.md +23 -21
README.md CHANGED
@@ -32,9 +32,9 @@ Built by [Chromia](https://x.com/Chromia) and [Eval Engine](https://x.com/eval_e
32
 
33
  ## Available quants
34
 
35
- All quants ship as **split multi-part GGUFs** (`*-00001-of-0000N.gguf` ...)
36
- so desktop runtimes can parallel-download chunks. Ollama, llama.cpp, and LM
37
- Studio auto-stitch on the first part β€” same UX as a single file.
38
 
39
  ### Desktop builds (Ollama / llama.cpp / LM Studio)
40
 
@@ -42,28 +42,30 @@ These keep `per_layer_token_embd` at the llama.cpp default of Q6_K, which
42
  maximizes quality but pushes the largest split part above 2 GB β€” fine for
43
  desktop, won't load in browser.
44
 
45
- | Quant | Parts | Total | Largest part | Notes |
46
- |---------|-------|---------|--------------|-------|
47
- | Q2_K | 4 | 4.08 GB | ~2.15 GB | Smallest disk footprint; biggest quality drop |
48
- | Q3_K_M | 4 | 4.49 GB | ~2.15 GB | Modest size win over Q4 (embedding precision dominates total size) |
49
- | Q4_K_M | 4 | 4.94 GB | ~2.15 GB | **Recommended desktop default β€” best size/quality** |
50
- | Q6_K | 5 | 5.75 GB | ~2.2 GB | Higher fidelity |
51
- | Q8_0 | 6 | 7.43 GB | ~2.7 GB | Highest fidelity |
52
 
53
  ### Browser builds (wllama)
54
 
55
  E4B's `per_layer_token_embd` is a single 2.82-billion-value tensor; at the
56
  default Q6_K precision it lands at ~2.2 GB, just over the browser
57
  ArrayBuffer cap. These variants force the embedding tensors to `q5_K`,
58
- which shrinks the largest part below 2 GB at near-zero quality cost.
 
59
 
60
- | Quant variant | Parts | Total | Largest part | wllama | Notes |
61
- |---------------------|-------|---------|--------------|--------|-------|
62
- | **Q4_K_M-wllama** | 4 | 4.51 GB | **1848 MB** | βœ… | **Recommended browser default** β€” layers @ Q4_K_M, embed @ q5_K |
63
- | **Q2_K-wllama** | 4 | 3.69 GB | **1848 MB** | βœ… | Smallest browser-loadable build β€” layers @ Q2_K, embed @ q5_K |
64
 
65
- The wllama files use the same split-multi-part naming, so Ollama / llama.cpp
66
- will auto-stitch them too if you prefer the smaller embed quant for any
 
67
  reason.
68
 
69
  ## Recommended sampling
@@ -89,7 +91,7 @@ ollama run hf.co/evalengine/unbound-e4b-GGUF
89
 
90
  ```bash
91
  # point at the FIRST part β€” llama.cpp follows the chain automatically
92
- ./llama-cli -m unbound-e4b-Q4_K_M-00001-of-00004.gguf -p "your prompt"
93
  ```
94
 
95
  ## Vision / image input (optional)
@@ -114,7 +116,7 @@ image-to-text inference.
114
 
115
  ```bash
116
  ./llama-mtmd-cli \
117
- -m unbound-e4b-Q4_K_M-00001-of-00004.gguf \
118
  --mmproj mmproj-unbound-e4b.gguf \
119
  --image path/to/your/image.png \
120
  -p "What is in this image?"
@@ -125,7 +127,7 @@ image-to-text inference.
125
  ### Run text-only (no `--mmproj`)
126
 
127
  ```bash
128
- ./llama-cli -m unbound-e4b-Q4_K_M-00001-of-00004.gguf -p "your prompt"
129
  ```
130
 
131
  The LM quants work standalone β€” you do **not** need `mmproj-unbound-e4b.gguf`
@@ -145,7 +147,7 @@ import { Wllama } from '@wllama/wllama';
145
  const wllama = new Wllama(/* … */);
146
  await wllama.loadModelFromHF(
147
  'evalengine/unbound-e4b-GGUF',
148
- 'unbound-e4b-Q4_K_M-wllama-00001-of-00004.gguf' // wllama follows the chain
149
  );
150
  ```
151
 
 
32
 
33
  ## Available quants
34
 
35
+ Each quant lives in its own folder; inside, the model is split into multi-part
36
+ GGUFs (`*-00001-of-0000N.gguf` ...). Ollama, llama.cpp, LM Studio, and wllama
37
+ auto-stitch on the first part β€” same UX as a single file.
38
 
39
  ### Desktop builds (Ollama / llama.cpp / LM Studio)
40
 
 
42
  maximizes quality but pushes the largest split part above 2 GB β€” fine for
43
  desktop, won't load in browser.
44
 
45
+ | Quant | Folder | Parts | Total | Notes |
46
+ |---------|-------------|-------|---------|-------|
47
+ | Q2_K | `Q2_K/` | 4 | 4.08 GB | Smallest disk footprint; biggest quality drop |
48
+ | Q3_K_M | `Q3_K_M/` | 4 | 4.49 GB | Modest size win over Q4 (embedding precision dominates total size) |
49
+ | Q4_K_M | `Q4_K_M/` | 4 | 4.94 GB | **Recommended desktop default β€” best size/quality** |
50
+ | Q6_K | `Q6_K/` | 5 | 5.75 GB | Higher fidelity |
51
+ | Q8_0 | `Q8_0/` | 6 | 7.43 GB | Highest fidelity |
52
 
53
  ### Browser builds (wllama)
54
 
55
  E4B's `per_layer_token_embd` is a single 2.82-billion-value tensor; at the
56
  default Q6_K precision it lands at ~2.2 GB, just over the browser
57
  ArrayBuffer cap. These variants force the embedding tensors to `q5_K`,
58
+ which shrinks the largest part below 2 GB at near-zero quality cost. The
59
+ folder names use a `-web` suffix to mark them.
60
 
61
+ | Quant variant | Folder | Parts | Total | wllama | Notes |
62
+ |-------------------|----------------|-------|---------|--------|-------|
63
+ | **Q4_K_M-web** | `Q4_K_M-web/` | 4 | 4.51 GB | βœ… | **Recommended browser default** β€” layers @ Q4_K_M, embed @ q5_K |
64
+ | **Q2_K-web** | `Q2_K-web/` | 4 | 3.69 GB | βœ… | Smallest browser-loadable build β€” layers @ Q2_K, embed @ q5_K |
65
 
66
+ The web files keep the canonical quant tag in the filename (so HF GGUF cards
67
+ render correctly) and use the same split-multi-part scheme, so Ollama and
68
+ llama.cpp will auto-stitch them too if you prefer the smaller embed quant for any
69
  reason.
70
 
71
  ## Recommended sampling
 
91
 
92
  ```bash
93
  # point at the FIRST part β€” llama.cpp follows the chain automatically
94
+ ./llama-cli -m Q4_K_M/unbound-e4b-Q4_K_M-00001-of-00004.gguf -p "your prompt"
95
  ```
96
 
97
  ## Vision / image input (optional)
 
116
 
117
  ```bash
118
  ./llama-mtmd-cli \
119
+ -m Q4_K_M/unbound-e4b-Q4_K_M-00001-of-00004.gguf \
120
  --mmproj mmproj-unbound-e4b.gguf \
121
  --image path/to/your/image.png \
122
  -p "What is in this image?"
 
127
  ### Run text-only (no `--mmproj`)
128
 
129
  ```bash
130
+ ./llama-cli -m Q4_K_M/unbound-e4b-Q4_K_M-00001-of-00004.gguf -p "your prompt"
131
  ```
132
 
133
  The LM quants work standalone β€” you do **not** need `mmproj-unbound-e4b.gguf`
 
147
  const wllama = new Wllama(/* … */);
148
  await wllama.loadModelFromHF(
149
  'evalengine/unbound-e4b-GGUF',
150
+ 'Q4_K_M-web/unbound-e4b-Q4_K_M-00001-of-00004.gguf' // wllama follows the chain
151
  );
152
  ```
153