johnsonchromia commited on
Commit
252cc13
·
verified ·
1 Parent(s): 6e58098

README: compact pass — keep essentials, drop redundancy

Browse files
Files changed (1) hide show
  1. README.md +56 -99
README.md CHANGED
@@ -18,101 +18,81 @@ pipeline_tag: text-generation
18
 
19
  # Unbound E4B GGUF — *because there is no boundary*
20
 
21
- > **No guarantee — use at your own risk.** This model has reduced safety filtering
22
- > and can produce harmful, false, biased, or otherwise unsafe output. Provided
23
- > as-is, with no warranty of any kind. You are solely responsible for how you
24
- > use it and for complying with all applicable laws.
25
 
26
- GGUF quantizations of [`evalengine/unbound-e4b`](https://huggingface.co/evalengine/unbound-e4b)
27
- for on-device deployment via Ollama, llama.cpp, LM Studio,
28
- [wllama](https://github.com/ngxson/wllama) (in-browser — see the wllama
29
- section below), and similar runtimes.
30
-
31
- Built by [Chromia](https://x.com/Chromia) and [Eval Engine](https://x.com/eval_engine).
32
 
33
  ## Available quants
34
 
35
- Each quant lives in its own folder; inside, the model is split into multi-part
36
- GGUFs (`*-00001-of-0000N.gguf` ...). Ollama, llama.cpp, LM Studio, and wllama
37
- auto-stitch on the first part — same UX as a single file.
38
 
39
  ### Desktop builds (Ollama / llama.cpp / LM Studio)
40
 
41
- These keep `per_layer_token_embd` at the llama.cpp default of Q6_K, which
42
- maximizes quality but pushes the largest split part above 2 GB — fine for
43
- desktop, won't load in browser.
44
 
45
  | Quant | Folder | Parts | Total | Notes |
46
  |---------|-------------|-------|---------|-------|
47
- | Q2_K | `Q2_K/` | 4 | 4.08 GB | Smallest disk footprint; biggest quality drop |
48
- | Q3_K_M | `Q3_K_M/` | 4 | 4.49 GB | Modest size win over Q4 (embedding precision dominates total size) |
49
- | Q4_K_M | `Q4_K_M/` | 4 | 4.94 GB | **Recommended desktop default — best size/quality** |
50
  | Q6_K | `Q6_K/` | 5 | 5.75 GB | Higher fidelity |
51
  | Q8_0 | `Q8_0/` | 6 | 7.43 GB | Highest fidelity |
52
 
53
  ### Browser builds (wllama)
54
 
55
- E4B's `per_layer_token_embd` is a single 2.82-billion-value tensor; at the
56
- default Q6_K precision it lands at ~2.2 GB, just over the browser
57
- ArrayBuffer cap. These variants force the embedding tensors to `q5_K`,
58
- which shrinks the largest part below 2 GB at near-zero quality cost. The
59
- folder names use a `-web` suffix to mark them.
60
 
61
- | Quant variant | Folder | Parts | Total | wllama | Notes |
62
- |-------------------|----------------|-------|---------|--------|-------|
63
- | **Q4_K_M-web** | `Q4_K_M-web/` | 4 | 4.51 GB | ✅ | **Recommended browser default** — layers @ Q4_K_M, embed @ q5_K |
64
- | **Q2_K-web** | `Q2_K-web/` | 4 | 3.69 GB | ✅ | Smallest browser-loadable build — layers @ Q2_K, embed @ q5_K |
65
 
66
- The web files keep the canonical quant tag in the filename (so HF GGUF cards
67
- render correctly) and use the same split-multi-part scheme, so Ollama and
68
- llama.cpp will auto-stitch them too if you prefer the smaller embed quant for any
69
- reason.
70
 
71
- ## Recommended sampling
72
 
73
- - **Creative writing / open-ended / general chat** → Gemma defaults:
74
- `temperature=1.0, top_p=0.95, top_k=64`.
75
- - **Factual or brand/identity questions** lower `temperature` to ~0.3–0.5
76
- for sharper recall.
77
- - **llama.cpp**: pass `--jinja` for proper chat-template handling.
78
- - **Gemma 4 thinking mode** is on by default. Set `enable_thinking: false`
79
- in the chat-template kwargs for shorter/faster replies.
80
 
81
- ## Run with Ollama
82
 
83
  ```bash
 
84
  ollama pull hf.co/evalengine/unbound-e4b-GGUF
85
  ollama run hf.co/evalengine/unbound-e4b-GGUF
86
  ```
87
 
88
- (Defaults to Q4_K_M. Ollama auto-stitches the split parts on load.)
89
-
90
- ## Run with llama.cpp
91
-
92
  ```bash
93
- # point at the FIRST part — llama.cpp follows the chain automatically
94
  ./llama-cli -m Q4_K_M/unbound-e4b-Q4_K_M-00001-of-00004.gguf -p "your prompt"
95
  ```
96
 
97
- ## Vision / image input (optional)
98
-
99
- Gemma 4 E4B ships a vision tower; we extracted it as `mmproj-unbound-e4b.gguf`
100
- (946 MB) in this repo. Pair it with any of the LM quants above to enable
101
- image-to-text inference.
 
 
 
 
102
 
103
- > **Disclaimer.** The vision encoder is **Google's original weights, unchanged**.
104
- > Unbound's abliteration + SFT-heal only touched the *language model* — the
105
- > vision tower was frozen during training. Practical consequences:
106
- >
107
- > - The LM is uncensored, so it will discuss whatever it *sees* directly.
108
- > - But the vision encoder still has Google's original alignment baked into
109
- > visual feature extraction. It may down-weight or distort features for
110
- > content classes Google's base model was tuned to suppress.
111
- > - We have **not benchmarked the visual axis** (no measured refusal rate /
112
- > coherence / hallucination on image inputs). Treat vision as a preview
113
- > feature, not a flagship one.
114
 
115
- ### Run with vision (llama.cpp `llama-mtmd-cli`)
 
116
 
117
  ```bash
118
  ./llama-mtmd-cli \
@@ -122,46 +102,23 @@ image-to-text inference.
122
  -p "What is in this image?"
123
  ```
124
 
125
- `llama-gemma3-cli` works the same way and is Gemma-specific.
126
-
127
- ### Run text-only (no `--mmproj`)
128
-
129
- ```bash
130
- ./llama-cli -m Q4_K_M/unbound-e4b-Q4_K_M-00001-of-00004.gguf -p "your prompt"
131
- ```
132
-
133
- The LM quants work standalone — you do **not** need `mmproj-unbound-e4b.gguf`
134
- unless you want image input. Ollama / LM Studio's standard text chat works
135
- out of the box; the mmproj file is only loaded when you point a multimodal
136
- runtime at it.
137
-
138
- ## Run in the browser (wllama)
139
-
140
- [wllama](https://github.com/ngxson/wllama) is a WebAssembly port of llama.cpp
141
- that runs entirely in the browser. Use one of the wllama-safe variants
142
- above. Browser inference is **text-only** for this model (wllama doesn't
143
- currently load `mmproj` for vision):
144
-
145
- ```js
146
- import { Wllama } from '@wllama/wllama';
147
- const wllama = new Wllama(/* … */);
148
- await wllama.loadModelFromHF(
149
- 'evalengine/unbound-e4b-GGUF',
150
- 'Q4_K_M-web/unbound-e4b-Q4_K_M-00001-of-00004.gguf' // wllama follows the chain
151
- );
152
- ```
153
-
154
- ## About the base
155
 
156
- See [`evalengine/unbound-e4b`](https://huggingface.co/evalengine/unbound-e4b)
157
- for the full model card, benchmarks, intended use, and the merged HF weights.
158
 
159
  ## Acknowledgements
160
 
161
- - Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) + Huggingface's [TRL](https://github.com/huggingface/trl).
162
- - Abliteration via [heretic](https://github.com/p-e-w/heretic).
163
- - Environment and training discipline ported from [autoresearch](https://github.com/karpathy/autoresearch).
 
164
 
165
  ## License
166
 
167
- Apache-2.0, inherited from `google/gemma-4-E4B-it`.
 
 
18
 
19
  # Unbound E4B GGUF — *because there is no boundary*
20
 
21
+ > **No guarantee — use at your own risk.** Reduced safety filtering; can
22
+ > produce harmful or false output. Provided as-is.
 
 
23
 
24
+ GGUF quants of [`evalengine/unbound-e4b`](https://huggingface.co/evalengine/unbound-e4b)
25
+ for Ollama, llama.cpp, LM Studio, and [wllama](https://github.com/ngxson/wllama)
26
+ (in-browser). Built by [Chromia](https://x.com/Chromia) and
27
+ [Eval Engine](https://x.com/eval_engine).
 
 
28
 
29
  ## Available quants
30
 
31
+ Each quant lives in its own folder; inside, the model is split into
32
+ multi-part GGUFs. All runtimes auto-stitch on the first part same UX as a
33
+ single file.
34
 
35
  ### Desktop builds (Ollama / llama.cpp / LM Studio)
36
 
37
+ Embedding tensor kept at the llama.cpp default of Q6_K; largest split part
38
+ ~2.15 GB — fine for desktop, **won't load in browser**.
 
39
 
40
  | Quant | Folder | Parts | Total | Notes |
41
  |---------|-------------|-------|---------|-------|
42
+ | Q2_K | `Q2_K/` | 4 | 4.08 GB | Smallest, biggest quality drop |
43
+ | Q3_K_M | `Q3_K_M/` | 4 | 4.49 GB | Modest size win over Q4 |
44
+ | Q4_K_M | `Q4_K_M/` | 4 | 4.94 GB | **Recommended desktop default** |
45
  | Q6_K | `Q6_K/` | 5 | 5.75 GB | Higher fidelity |
46
  | Q8_0 | `Q8_0/` | 6 | 7.43 GB | Highest fidelity |
47
 
48
  ### Browser builds (wllama)
49
 
50
+ `per_layer_token_embd` is a 2.82B-value tensor; at the default Q6_K it
51
+ lands at ~2.2 GB, over wllama's 2 GB ArrayBuffer cap. These variants force
52
+ embeddings to `q5_K` (~1848 MB) so the largest part fits.
 
 
53
 
54
+ | Quant variant | Folder | Parts | Total | Notes |
55
+ |---------------|----------------|-------|---------|-------|
56
+ | Q4_K_M-web | `Q4_K_M-web/` | 4 | 4.51 GB | **Recommended browser default** — layers @ Q4_K_M, embed @ q5_K |
57
+ | Q2_K-web | `Q2_K-web/` | 4 | 3.69 GB | Smallest browser-loadable — layers @ Q2_K, embed @ q5_K |
58
 
59
+ `mmproj-unbound-e4b.gguf` (vision projector, ~946 MB) sits at the repo
60
+ root. See **Vision** below.
 
 
61
 
62
+ ## Sampling
63
 
64
+ - **Creative / open-ended** → `temperature=1.0, top_p=0.95, top_k=64`.
65
+ - **Factual / brand questions** → drop `temperature` to ~0.3–0.5.
66
+ - llama.cpp: pass `--jinja`. Gemma 4 thinking mode is on by default; set
67
+ `enable_thinking: false` in chat-template kwargs for shorter replies.
 
 
 
68
 
69
+ ## Run
70
 
71
  ```bash
72
+ # Ollama (defaults to Q4_K_M)
73
  ollama pull hf.co/evalengine/unbound-e4b-GGUF
74
  ollama run hf.co/evalengine/unbound-e4b-GGUF
75
  ```
76
 
 
 
 
 
77
  ```bash
78
+ # llama.cpp point at FIRST split part
79
  ./llama-cli -m Q4_K_M/unbound-e4b-Q4_K_M-00001-of-00004.gguf -p "your prompt"
80
  ```
81
 
82
+ ```js
83
+ // wllama (browser) — use a -web variant; desktop builds won't fit
84
+ import { Wllama } from '@wllama/wllama';
85
+ const wllama = new Wllama(/* */);
86
+ await wllama.loadModelFromHF(
87
+ 'evalengine/unbound-e4b-GGUF',
88
+ 'Q4_K_M-web/unbound-e4b-Q4_K_M-00001-of-00004.gguf'
89
+ );
90
+ ```
91
 
92
+ ## Vision / image input (optional)
 
 
 
 
 
 
 
 
 
 
93
 
94
+ `mmproj-unbound-e4b.gguf` enables image-to-text. Pair with any LM quant via
95
+ `llama-mtmd-cli` or `llama-gemma3-cli`:
96
 
97
  ```bash
98
  ./llama-mtmd-cli \
 
102
  -p "What is in this image?"
103
  ```
104
 
105
+ > **Disclaimer.** The vision encoder is **Google's original weights,
106
+ > unchanged** — abliteration only touched the language model. The LM is
107
+ > uncensored, but the vision encoder may still suppress features for
108
+ > content classes Google's base was tuned against. We have **not
109
+ > benchmarked the visual axis**. Treat as preview.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
+ Text-only: skip `--mmproj`. Standard `llama-cli` / Ollama / LM Studio do
112
+ not need the mmproj file.
113
 
114
  ## Acknowledgements
115
 
116
+ Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) + HF
117
+ [TRL](https://github.com/huggingface/trl). Abliteration via
118
+ [heretic](https://github.com/p-e-w/heretic). Environment from
119
+ [autoresearch](https://github.com/karpathy/autoresearch).
120
 
121
  ## License
122
 
123
+ Apache-2.0, inherited from `google/gemma-4-E4B-it`. Full model card +
124
+ benchmarks at [`evalengine/unbound-e4b`](https://huggingface.co/evalengine/unbound-e4b).