johnsonchromia commited on
Commit
f38886b
Β·
verified Β·
1 Parent(s): 20475c0

README: rewrite paths for canonical flat layout

Browse files
Files changed (1) hide show
  1. README.md +27 -29
README.md CHANGED
@@ -28,36 +28,34 @@ for Ollama, llama.cpp, LM Studio, and [wllama](https://github.com/ngxson/wllama)
28
 
29
  ## Available quants
30
 
31
- Each quant lives in its own folder; inside, the model is split into
32
- multi-part GGUFs. All runtimes auto-stitch on the first part β€” same UX as a
33
- single file.
34
 
35
- ### Desktop builds (Ollama / llama.cpp / LM Studio)
36
 
37
- Embedding tensor kept at the llama.cpp default of Q6_K; largest split part
38
- ~2.15 GB β€” fine for desktop, **won't load in browser**.
39
 
40
- | Quant | Folder | Parts | Total | Notes |
41
- |---------|-------------|-------|---------|-------|
42
- | Q2_K | `Q2_K/` | 4 | 4.08 GB | Smallest, biggest quality drop |
43
- | Q3_K_M | `Q3_K_M/` | 4 | 4.49 GB | Modest size win over Q4 |
44
- | Q4_K_M | `Q4_K_M/` | 4 | 4.94 GB | **Recommended desktop default** |
45
- | Q6_K | `Q6_K/` | 5 | 5.75 GB | Higher fidelity |
46
- | Q8_0 | `Q8_0/` | 6 | 7.43 GB | Highest fidelity |
47
 
48
- ### Browser builds (wllama)
49
 
50
- `per_layer_token_embd` is a 2.82B-value tensor; at the default Q6_K it
51
- lands at ~2.2 GB, over wllama's 2 GB ArrayBuffer cap. These variants force
52
- embeddings to `q5_K` (~1848 MB) so the largest part fits.
 
 
53
 
54
- | Quant variant | Folder | Parts | Total | Notes |
55
- |---------------|----------------|-------|---------|-------|
56
- | Q4_K_M-web | `Q4_K_M-web/` | 4 | 4.51 GB | **Recommended browser default** β€” layers @ Q4_K_M, embed @ q5_K |
57
- | Q2_K-web | `Q2_K-web/` | 4 | 3.69 GB | Smallest browser-loadable β€” layers @ Q2_K, embed @ q5_K |
58
-
59
- `mmproj-unbound-e4b.gguf` (vision projector, ~946 MB) sits at the repo
60
- root. See **Vision** below.
61
 
62
  ## Sampling
63
 
@@ -66,7 +64,7 @@ root. See **Vision** below.
66
  - llama.cpp: pass `--jinja`. Gemma 4 thinking mode is on by default; set
67
  `enable_thinking: false` in chat-template kwargs for shorter replies.
68
 
69
- For Ollama specifically, pull from the **Ollama Registry** β€”
70
  `ollama pull hf.co/...` [doesn't yet support sharded GGUFs](https://github.com/ollama/ollama/issues/5245).
71
  The registry version is a single-file Q4_K_M with a bundled Modelfile
72
  (`temperature=0.6, top_p=0.95, top_k=64, repeat_penalty=1.05, num_ctx=8192`
@@ -81,8 +79,8 @@ ollama run evalengine/unbound-e4b
81
  ```
82
 
83
  ```bash
84
- # llama.cpp β€” point at FIRST split part
85
- ./llama-cli -m Q4_K_M/unbound-e4b-Q4_K_M-00001-of-00004.gguf -p "your prompt"
86
  ```
87
 
88
  ```js
@@ -91,7 +89,7 @@ import { Wllama } from '@wllama/wllama';
91
  const wllama = new Wllama(/* … */);
92
  await wllama.loadModelFromHF(
93
  'evalengine/unbound-e4b-GGUF',
94
- 'Q4_K_M-web/unbound-e4b-Q4_K_M-00001-of-00004.gguf'
95
  );
96
  ```
97
 
@@ -102,7 +100,7 @@ await wllama.loadModelFromHF(
102
 
103
  ```bash
104
  ./llama-mtmd-cli \
105
- -m Q4_K_M/unbound-e4b-Q4_K_M-00001-of-00004.gguf \
106
  --mmproj mmproj-unbound-e4b.gguf \
107
  --image path/to/your/image.png \
108
  -p "What is in this image?"
 
28
 
29
  ## Available quants
30
 
31
+ Each quant is shipped as a sharded multi-part GGUF. Ollama, llama.cpp, LM
32
+ Studio, and wllama auto-stitch on the first part β€” same UX as a single file.
 
33
 
34
+ ### Desktop builds β€” `unbound-e4b.<QUANT>-NNNNN-of-NNNNN.gguf`
35
 
36
+ Embedding tensor kept at the llama.cpp default of Q6_K; largest part ~2.15 GB
37
+ β€” fine for desktop, **won't load in browser**.
38
 
39
+ | Quant | Parts | Total | Notes |
40
+ |---------|-------|---------|-------|
41
+ | Q2_K | 4 | 4.08 GB | Smallest, biggest quality drop |
42
+ | Q3_K_M | 4 | 4.49 GB | Modest size win over Q4 (embedding precision dominates) |
43
+ | Q4_K_M | 4 | 4.94 GB | **Recommended desktop default** |
44
+ | Q6_K | 5 | 5.75 GB | Higher fidelity |
45
+ | Q8_0 | 6 | 7.43 GB | Highest fidelity |
46
 
47
+ ### Browser builds β€” `unbound-e4b-web.<QUANT>-NNNNN-of-NNNNN.gguf`
48
 
49
+ E4B's `per_layer_token_embd` is a 2.82-billion-value tensor; at the default
50
+ Q6_K precision it lands at ~2.2 GB, over wllama's 2 GB ArrayBuffer cap.
51
+ These variants force embeddings to `q5_K` (~1848 MB) so the largest part
52
+ fits. They use a distinct `unbound-e4b-web` model prefix so HF's GGUF UI
53
+ doesn't aggregate them with the same-quant desktop files.
54
 
55
+ | Variant | Parts | Total | Notes |
56
+ |-----------------|-------|---------|-------|
57
+ | Q4_K_M (web) | 4 | 4.51 GB | **Recommended browser default** β€” layers @ Q4_K_M, embed @ q5_K |
58
+ | Q2_K (web) | 4 | 3.69 GB | Smallest browser-loadable β€” layers @ Q2_K, embed @ q5_K |
 
 
 
59
 
60
  ## Sampling
61
 
 
64
  - llama.cpp: pass `--jinja`. Gemma 4 thinking mode is on by default; set
65
  `enable_thinking: false` in chat-template kwargs for shorter replies.
66
 
67
+ For Ollama, pull from the **Ollama Registry** β€”
68
  `ollama pull hf.co/...` [doesn't yet support sharded GGUFs](https://github.com/ollama/ollama/issues/5245).
69
  The registry version is a single-file Q4_K_M with a bundled Modelfile
70
  (`temperature=0.6, top_p=0.95, top_k=64, repeat_penalty=1.05, num_ctx=8192`
 
79
  ```
80
 
81
  ```bash
82
+ # llama.cpp β€” point at FIRST shard
83
+ ./llama-cli -m unbound-e4b.Q4_K_M-00001-of-00004.gguf -p "your prompt"
84
  ```
85
 
86
  ```js
 
89
  const wllama = new Wllama(/* … */);
90
  await wllama.loadModelFromHF(
91
  'evalengine/unbound-e4b-GGUF',
92
+ 'unbound-e4b-web.Q4_K_M-00001-of-00004.gguf'
93
  );
94
  ```
95
 
 
100
 
101
  ```bash
102
  ./llama-mtmd-cli \
103
+ -m unbound-e4b.Q4_K_M-00001-of-00004.gguf \
104
  --mmproj mmproj-unbound-e4b.gguf \
105
  --image path/to/your/image.png \
106
  -p "What is in this image?"