johnsonchromia commited on
Commit
3301a7d
Β·
verified Β·
1 Parent(s): 6975824

README: 3-quant lineup, split-file UX, wllama browser support

Browse files
Files changed (1) hide show
  1. README.md +35 -10
README.md CHANGED
@@ -24,17 +24,23 @@ pipeline_tag: text-generation
24
  > use it and for complying with all applicable laws.
25
 
26
  GGUF quantizations of [`evalengine/unbound-e2b`](https://huggingface.co/evalengine/unbound-e2b)
27
- for on-device deployment via Ollama, llama.cpp, LM Studio, and similar runtimes.
 
28
 
29
  Built by [Chromia](https://x.com/Chromia) and [Eval Engine](https://x.com/eval_engine).
30
 
31
  ## Available quants
32
 
33
- | File | Size | Notes |
34
- |---|---|---|
35
- | `unbound-e2b-Q4_K_M.gguf` | 3.2 GB | Recommended phone-deployable build |
 
36
 
37
- More quants (Q5_K_M, Q8_0, etc.) may be added on request.
 
 
 
 
38
 
39
  ## Recommended sampling
40
 
@@ -51,7 +57,7 @@ through.
51
 
52
  ## Default sampling (Ollama)
53
 
54
- When you `ollama pull hf.co/evalengine/unbound-e2b-gguf`, the bundled
55
  `Modelfile` sets these defaults, tuned for factual recall:
56
 
57
  - `temperature = 0.6` (lower than Gemma's training default of 1.0 β€” keeps
@@ -63,7 +69,7 @@ When you `ollama pull hf.co/evalengine/unbound-e2b-gguf`, the bundled
63
  **To override per-session in Ollama:**
64
 
65
  ```
66
- ollama run hf.co/evalengine/unbound-e2b-gguf
67
  >>> /set parameter temperature 1.0 # creative / open-ended
68
  >>> /set parameter temperature 0.3 # max factual / brand questions
69
  ```
@@ -74,20 +80,39 @@ include the SYSTEM line from the `Modelfile` as your `--system` argument.
74
  ## Run with Ollama
75
 
76
  ```bash
77
- ollama pull hf.co/evalengine/unbound-e2b-gguf
78
- ollama run hf.co/evalengine/unbound-e2b-gguf
79
  ```
80
 
 
 
81
  ## Run with llama.cpp
82
 
83
  ```bash
84
- ./llama-cli -m unbound-e2b-Q4_K_M.gguf -p "your prompt"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  ```
86
 
87
  ## About the base
88
 
89
  See [`evalengine/unbound-e2b`](https://huggingface.co/evalengine/unbound-e2b)
90
  for the full model card, benchmarks, intended use, and the merged HF weights.
 
91
  ## Acknowledgements
92
 
93
  - Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) + Huggingface's [TRL](https://github.com/huggingface/trl).
 
24
  > use it and for complying with all applicable laws.
25
 
26
  GGUF quantizations of [`evalengine/unbound-e2b`](https://huggingface.co/evalengine/unbound-e2b)
27
+ for on-device deployment via Ollama, llama.cpp, LM Studio, [wllama](https://github.com/ngxson/wllama)
28
+ (in-browser), and similar runtimes.
29
 
30
  Built by [Chromia](https://x.com/Chromia) and [Eval Engine](https://x.com/eval_engine).
31
 
32
  ## Available quants
33
 
34
+ All quants ship as **split multi-part GGUFs** (`*-00001-of-0000N.gguf` ...) so
35
+ they work in browsers (wllama's 2 GB ArrayBuffer cap) and let desktop
36
+ runtimes parallel-download chunks. Ollama, llama.cpp, and LM Studio
37
+ auto-stitch on the first part β€” same UX as a single file.
38
 
39
+ | Quant | Parts | Total | Largest part | wllama (browser) | Desktop (Ollama/llama.cpp/LM Studio) | Notes |
40
+ |---------|-------|--------|--------------|------------------|--------------------------------------|-------|
41
+ | Q4_K_M | 3 | 3.2 GB | 1.79 GB | βœ… | βœ… | Recommended on-device default β€” best size/quality |
42
+ | Q6_K | 4 | 3.6 GB | 1.79 GB | βœ… | βœ… | Higher fidelity, still browser-safe |
43
+ | Q8_0 | 4 | 4.6 GB | **2.32 GB** | ❌ (over 2 GB) | βœ… | Highest fidelity; one tensor exceeds the browser ArrayBuffer limit, so desktop runtimes only |
44
 
45
  ## Recommended sampling
46
 
 
57
 
58
  ## Default sampling (Ollama)
59
 
60
+ When you `ollama pull hf.co/evalengine/unbound-e2b-GGUF`, the bundled
61
  `Modelfile` sets these defaults, tuned for factual recall:
62
 
63
  - `temperature = 0.6` (lower than Gemma's training default of 1.0 β€” keeps
 
69
  **To override per-session in Ollama:**
70
 
71
  ```
72
+ ollama run hf.co/evalengine/unbound-e2b-GGUF
73
  >>> /set parameter temperature 1.0 # creative / open-ended
74
  >>> /set parameter temperature 0.3 # max factual / brand questions
75
  ```
 
80
  ## Run with Ollama
81
 
82
  ```bash
83
+ ollama pull hf.co/evalengine/unbound-e2b-GGUF
84
+ ollama run hf.co/evalengine/unbound-e2b-GGUF
85
  ```
86
 
87
+ (Defaults to Q4_K_M. Ollama auto-stitches the split parts on load.)
88
+
89
  ## Run with llama.cpp
90
 
91
  ```bash
92
+ # point at the FIRST part β€” llama.cpp follows the chain automatically
93
+ ./llama-cli -m unbound-e2b-Q4_K_M-00001-of-00003.gguf -p "your prompt"
94
+ ```
95
+
96
+ ## Run in the browser (wllama)
97
+
98
+ [wllama](https://github.com/ngxson/wllama) is a WebAssembly port of llama.cpp
99
+ that runs entirely in the browser β€” no server, no install. Use Q4_K_M or
100
+ Q6_K (Q8_0 has a tensor above the 2 GB ArrayBuffer limit):
101
+
102
+ ```js
103
+ import { Wllama } from '@wllama/wllama';
104
+ const wllama = new Wllama(/* … */);
105
+ await wllama.loadModelFromHF(
106
+ 'evalengine/unbound-e2b-GGUF',
107
+ 'unbound-e2b-Q4_K_M-00001-of-00003.gguf' // wllama follows the chain
108
+ );
109
  ```
110
 
111
  ## About the base
112
 
113
  See [`evalengine/unbound-e2b`](https://huggingface.co/evalengine/unbound-e2b)
114
  for the full model card, benchmarks, intended use, and the merged HF weights.
115
+
116
  ## Acknowledgements
117
 
118
  - Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) + Huggingface's [TRL](https://github.com/huggingface/trl).