johnsonchromia commited on
Commit
28c3815
Β·
verified Β·
1 Parent(s): 357c043

README: 3-quant lineup, desktop-only (per_layer_token_embd > 2GB)

Browse files
Files changed (1) hide show
  1. README.md +19 -24
README.md CHANGED
@@ -24,23 +24,29 @@ pipeline_tag: text-generation
24
  > use it and for complying with all applicable laws.
25
 
26
  GGUF quantizations of [`evalengine/unbound-e4b`](https://huggingface.co/evalengine/unbound-e4b)
27
- for on-device deployment via Ollama, llama.cpp, LM Studio, [wllama](https://github.com/ngxson/wllama)
28
- (in-browser), and similar runtimes.
29
 
30
  Built by [Chromia](https://x.com/Chromia) and [Eval Engine](https://x.com/eval_engine).
31
 
32
  ## Available quants
33
 
34
- All quants ship as **split multi-part GGUFs** (`*-00001-of-0000N.gguf` ...) so
35
- they work in browsers (wllama's 2 GB ArrayBuffer cap) and let desktop
36
- runtimes parallel-download chunks. Ollama, llama.cpp, and LM Studio
37
- auto-stitch on the first part β€” same UX as a single file.
38
 
39
- | Quant | Parts | Total | Largest part | wllama (browser) | Desktop (Ollama/llama.cpp/LM Studio) | Notes |
40
- |---------|--------|------------|--------------|------------------|--------------------------------------|-------|
41
- | Q4_K_M | [TBD] | ~5 GB | [TBD] | [TBD] | βœ… | Recommended on-device default β€” best size/quality |
42
- | Q6_K | [TBD] | [TBD] | [TBD] | [TBD] | βœ… | Higher fidelity |
43
- | Q8_0 | [TBD] | [TBD] | [TBD] | [TBD] | βœ… | Highest fidelity; large-tensor quants typically exceed the 2 GB browser ArrayBuffer limit β€” desktop only |
 
 
 
 
 
 
 
44
 
45
  ## Recommended sampling
46
 
@@ -50,7 +56,7 @@ auto-stitch on the first part β€” same UX as a single file.
50
  for sharper recall.
51
  - **llama.cpp**: pass `--jinja` for proper chat-template handling.
52
  - **Gemma 4 thinking mode** is on by default. Set `enable_thinking: false`
53
- in the chat-template kwargs.
54
 
55
  ## Run with Ollama
56
 
@@ -65,18 +71,7 @@ ollama run hf.co/evalengine/unbound-e4b-GGUF
65
 
66
  ```bash
67
  # point at the FIRST part β€” llama.cpp follows the chain automatically
68
- ./llama-cli -m unbound-e4b-Q4_K_M-00001-of-0000N.gguf -p "your prompt"
69
- ```
70
-
71
- ## Run in the browser (wllama)
72
-
73
- ```js
74
- import { Wllama } from '@wllama/wllama';
75
- const wllama = new Wllama(/* … */);
76
- await wllama.loadModelFromHF(
77
- 'evalengine/unbound-e4b-GGUF',
78
- 'unbound-e4b-Q4_K_M-00001-of-0000N.gguf' // wllama follows the chain
79
- );
80
  ```
81
 
82
  ## About the base
 
24
  > use it and for complying with all applicable laws.
25
 
26
  GGUF quantizations of [`evalengine/unbound-e4b`](https://huggingface.co/evalengine/unbound-e4b)
27
+ for on-device deployment via Ollama, llama.cpp, LM Studio, and similar
28
+ desktop runtimes.
29
 
30
  Built by [Chromia](https://x.com/Chromia) and [Eval Engine](https://x.com/eval_engine).
31
 
32
  ## Available quants
33
 
34
+ All quants ship as **split multi-part GGUFs** (`*-00001-of-0000N.gguf` ...)
35
+ so desktop runtimes can parallel-download chunks. Ollama, llama.cpp, and LM
36
+ Studio auto-stitch on the first part β€” same UX as a single file.
 
37
 
38
+ | Quant | Parts | Total | Largest part | wllama (browser) | Desktop (Ollama / llama.cpp / LM Studio) | Notes |
39
+ |---------|-------|---------|--------------|------------------|------------------------------------------|-------|
40
+ | Q4_K_M | 4 | 4.92 GB | ~2.15 GB | ❌ | βœ… | Recommended on-device default β€” best size/quality |
41
+ | Q6_K | 5 | 5.73 GB | ~2.2 GB | ❌ | βœ… | Higher fidelity |
42
+ | Q8_0 | 6 | 7.41 GB | ~2.7 GB | ❌ | βœ… | Highest fidelity |
43
+
44
+ **Why no wllama:** E4B's `per_layer_token_embd` is a single atomic tensor
45
+ that exceeds 2 GB in every quant we ship (the smallest one, Q4_K_M, lands
46
+ at ~2.2 GB for that tensor). wllama's underlying browser ArrayBuffer caps at
47
+ 2 GB, so no split scheme can fit. For in-browser inference, use
48
+ [`evalengine/unbound-e2b-GGUF`](https://huggingface.co/evalengine/unbound-e2b-GGUF)
49
+ instead β€” its tensors are small enough that Q4_K_M and Q6_K both fit.
50
 
51
  ## Recommended sampling
52
 
 
56
  for sharper recall.
57
  - **llama.cpp**: pass `--jinja` for proper chat-template handling.
58
  - **Gemma 4 thinking mode** is on by default. Set `enable_thinking: false`
59
+ in the chat-template kwargs for shorter/faster replies.
60
 
61
  ## Run with Ollama
62
 
 
71
 
72
  ```bash
73
  # point at the FIRST part β€” llama.cpp follows the chain automatically
74
+ ./llama-cli -m unbound-e4b-Q4_K_M-00001-of-00004.gguf -p "your prompt"
 
 
 
 
 
 
 
 
 
 
 
75
  ```
76
 
77
  ## About the base