igorls commited on
Commit
61d9293
·
verified ·
1 Parent(s): a3624e6

docs: reframe model card around the model itself (drop project-specific shipping framing)

Browse files
Files changed (1) hide show
  1. README.md +6 -21
README.md CHANGED
@@ -21,7 +21,7 @@ A modality-stripped variant of [`google/gemma-4-E4B-it`](https://huggingface.co/
21
 
22
  **Headline:** Same instruction-tuned text behavior as the official Gemma 4 E4B-it — including its multilingual coverage — but at **6.5 GB resident VRAM instead of 10.6 GB** (Ollama Q4_K_M, RTX 3090, Linux). All safety alignment is preserved — this is **not** an abliterated or uncensored variant.
23
 
24
- **For 8 GB GPU users:** this is the recommended Gemma 4 E4B variant. The official `gemma4:e4b-it-q4_K_M` does not fit on 8 GB cards even at short contexts (10.2 GB resident at ctx=8192). This variant fits with ~2 GB headroom and preserves every measured capability of the base model.
25
 
26
  ## Why this exists
27
 
@@ -47,32 +47,17 @@ All accuracy deltas are within statistical noise at n=100. The 4.1 GB VRAM win i
47
 
48
  ## Multilingual robustness
49
 
50
- The strip preserves Gemma 4's multilingual capability. Measured against the MemPalace harness translated to Portuguese (pt-BR), Spanish (es), and Chinese (zh), at parity context (ctx=8192) and with a multilingual scoring embedding (`embeddinggemma`) so cross-lingual cosine isn't penalized by the EN-only `nomic-embed-text` v1:
51
 
52
  | Task | en | pt-BR | es | zh |
53
  |---|---:|---:|---:|---:|
54
  | Calibration | 1.000 | 0.950 | 0.950 | 0.950 |
55
- | Room classification (closed) | 0.624 | 0.584 | 0.584 | 0.584 |
56
- | Room classification (open) | **0.676** | **0.636** | **0.641** | **0.639** |
57
  | Entity extraction (F1) | 0.732 | 0.747 | 0.747 | 0.694 |
58
  | Memory coverage | 0.912 | 0.850 | 0.850 | 0.912 |
59
 
60
- This model is the **most language-stable** of the four 4B-class local candidates evaluated closed/open room classification stays within ±0.02 across languages, where competing Qwen 3 variants degrade visibly on zh (closed-set drops to 0.535 for `qwen3:4b-instruct-2507-q8_0`).
61
-
62
- ### When to pick this model vs Qwen 3 4B alternatives
63
-
64
- Same harness, same matrix, ctx=8192, full datasets:
65
-
66
- | Capability | Winner | Notes |
67
- |---|---|---|
68
- | **Open-set room classification** ★ | **this model** | 0.636-0.676 across 4 languages vs Qwen 0.56-0.63. The unique Gemma 4 strength replicating across every language tested. |
69
- | Closed-set room classification | rough tie | This model and `qwen3.5:4b-q4_K_M` trade the lead by 1-3 points. |
70
- | Memory extraction | rough tie (~0.85) | This model, `qwen3:4b-instruct-2507-q8_0`, and official Gemma 4 within 0.02 of each other. |
71
- | Entity extraction (F1) | Qwen 3 4B Q8 | `qwen3:4b-instruct-2507-q8_0` leads by 5-7 points on entity extraction across all 4 languages. |
72
- | TPS (output throughput) | Qwen 3 4B Q8 | 2x faster (220+ TPS vs ~130 TPS at ctx=8192). |
73
- | VRAM resident at ctx=8192 | rough tie | This model 6.1 GB, qwen3:4b-q8 5.8 GB, qwen3.5:4b-q4 6.0 GB. |
74
-
75
- **Pick this model** when slug-quality matters (open-set room routing / "what room does this conversation go in" UX) or when multilingual stability matters. **Pick `qwen3:4b-instruct-2507-q8_0`** when speed matters more than open-set slug quality, or when entity extraction is the dominant load.
76
 
77
  ## What was actually dropped
78
 
@@ -167,7 +152,7 @@ The official MTP drafter [`google/gemma-4-E4B-it-assistant`](https://huggingface
167
  | JSON entity list (128 tok) | 128 | 12291 ms | 6712 ms | 1.83x |
168
  | JSON memories (114 tok) | 114 | 8425 ms | **2771 ms** | **3.04x** |
169
 
170
- Speedup tracks output predictability — structured JSON outputs (the most common MemPalace surface) land at the high end (3x).
171
 
172
  ```python
173
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
21
 
22
  **Headline:** Same instruction-tuned text behavior as the official Gemma 4 E4B-it — including its multilingual coverage — but at **6.5 GB resident VRAM instead of 10.6 GB** (Ollama Q4_K_M, RTX 3090, Linux). All safety alignment is preserved — this is **not** an abliterated or uncensored variant.
23
 
24
+ Fits comfortably on **8 GB GPUs at Q4_K_M** with realistic context lengths (5.85 GB resident at ctx=4096, 5.96 GB at ctx=8192). The official multimodal Q4_K_M sits at 10.2 GB resident even at ctx=8192 and won't load on 8 GB cards.
25
 
26
  ## Why this exists
27
 
 
47
 
48
  ## Multilingual robustness
49
 
50
+ The strip preserves the base model's multilingual capability. Same classification + extraction tasks were run with inputs translated into Portuguese (pt-BR), Spanish (es), and Chinese (zh) labels and the slug taxonomy kept in English to test the realistic cross-lingual mapping case. Scoring uses `embeddinggemma` for semantic similarity so cross-lingual cosine isn't artificially penalized.
51
 
52
  | Task | en | pt-BR | es | zh |
53
  |---|---:|---:|---:|---:|
54
  | Calibration | 1.000 | 0.950 | 0.950 | 0.950 |
55
+ | Room classification (closed-set) | 0.624 | 0.584 | 0.584 | 0.584 |
56
+ | Room classification (open-set) | 0.676 | 0.636 | 0.641 | 0.639 |
57
  | Entity extraction (F1) | 0.732 | 0.747 | 0.747 | 0.694 |
58
  | Memory coverage | 0.912 | 0.850 | 0.850 | 0.912 |
59
 
60
+ Closed/open room classification stays within ±0.02 across all four languages; entity F1 within ±0.05; memory coverage within ±0.06. The strip did not introduce a multilingual regression. Models still emit responses in the input language by default — if your application needs same-language extraction (e.g. memories phrased in Portuguese for Portuguese conversations), the model does that natively.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
  ## What was actually dropped
63
 
 
152
  | JSON entity list (128 tok) | 128 | 12291 ms | 6712 ms | 1.83x |
153
  | JSON memories (114 tok) | 114 | 8425 ms | **2771 ms** | **3.04x** |
154
 
155
+ Speedup tracks output predictability — structured JSON outputs land at the high end (3x), short slug/letter classifications around 1.5-2x, free-form continuations near 1x.
156
 
157
  ```python
158
  from transformers import AutoModelForCausalLM, AutoTokenizer