File size: 5,411 Bytes
60ca60a
 
 
 
 
 
 
 
 
 
 
 
 
6bf524c
60ca60a
2740775
 
60ca60a
 
 
 
2740775
60ca60a
 
 
fbc85b5
 
2740775
 
 
fbc85b5
60ca60a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2740775
60ca60a
 
 
 
 
785e49d
 
 
 
 
 
 
 
60ca60a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b555d05
60ca60a
 
 
 
 
 
 
 
 
 
2740775
60ca60a
 
2740775
60ca60a
 
 
 
 
 
 
 
 
2740775
60ca60a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2740775
60ca60a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1597942
60ca60a
 
2740775
60ca60a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1597942
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
license: other
license_name: embedl-models-community-licence-1.0
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
base_model:
- google/gemma-3-1b-it
tags:
- text-generation-inference
---


# gemma-3-1b-it-FlashHead

![FlashHead](https://huggingface.co/datasets/embedl/documentation-images/resolve/main/flashhead.png)

[![GitHub](https://img.shields.io/badge/GitHub-flash--head-black?logo=github)](https://github.com/embedl/flash-head)

**Optimized version of gemma-3-1b-it using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.**
Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:

- FlashHead
- vLLM plugin via [`flash-head`](https://github.com/embedl/flash-head)

FlashHead matches the gemma-3-1b-it baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency.

### Quickstart

```bash
pip install flash-head
vllm serve embedl/gemma-3-1b-it-FlashHead
```
---

## Model Details
| **Field** | **Value** |
|------------|------------|
| **Base Model** | gemma-3-1b-it |
| **Input / Output** | Text → Text |
| **Release Date** | 2025-12-08 |
| **Version** | 1.0 |
| **Optimizations** | FlashHead LM Head|
| **Developers** | Embedl |
| **Licenses** | Upstream: Gemma Terms of Use. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* |
| **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |

---

## Optimizations

- **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
- **vLLM Plugin Integration** - compatible with **vLLM (0.14.0+)** via the [`flash-head`](https://github.com/embedl/flash-head) plugin.

---

## Performance

<a href="https://huggingface.co/spaces/embedl/Edge-Inference-Benchmarks" target="_blank" rel="noopener">
  <img
    src="https://huggingface.co/datasets/embedl/documentation-images/resolve/main/Edge-Inference-Benchmarks/Gemma-3__agx_thor.svg"
    alt="Edge Inference Benchmarks for Gemma-3"
    width="100%"
  />
</a>

### Token Generation Speed (RTX 3500 Ada, batch size = 1)

| **Precision** | **Tokens/sec** | **Speedup vs BF16** |
|----------------|----------------|----------------------|
| BF16 baseline | 148 | 1.0× |
| **FlashHead (Embedl)** | **178** | **1.20×** |
| W4A16 baseline | 243 | 1.64x× |
| **FlashHead W4A16 (Embedl)** | **336** | **2.27×** |

FlashHead improves end-to-end speed by **1.38×** over state-of-the-art, while maintaining full accuracy parity.

**Measurement setup:** vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.

---

## Accuracy (Parity with Baseline)

| **Method** | **MMLU-Pro** | **IFEval** | **BBH** | **TruthfulQA** | **GSM8K** |
|-------------|---------------|--------------|-------------|----------------|--------------|
| **Baseline** | 0.15 | 0.55 | 0.38 | 0.31 | 0.42 |
| **FlashHead** | 0.15 | 0.49 | 0.38 | 0.31 | 0.39 |

FlashHead closely matches baseline accuracy.

---

## Installation

```bash
pip install flash-head
```

The [`flash-head`](https://github.com/embedl/flash-head) vLLM plugin is required. It activates automatically at startup.

---

## Usage Examples
**Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`).

### vLLM Inference

```python
from vllm import LLM, SamplingParams

model_id = "embedl/gemma-3-1b-it-FlashHead"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)
```

---

## Limitations

- Requires **vLLM >= 0.14.0**
- Currently optimized for **NVIDIA RTX GPUs**

---

## Roadmap

Planned improvements:

- Advanced mixed precision quantization
- Huggingface transformers generation
- vLLM CLI benchmarking for detailed latency evaluation
- `lm-eval-harness` integration for detailed accuracy evaluation
- Upstream support in **Transformers** and **vLLM**
- Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc.
- Broader model coverage (larger models, VLMs, VLAs)

---

## License

- **Upstream:** Gemma Terms of Use.
- **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*

---

## Contact

**Enterprise & Commercial Inquiries**
[models@embedl.com](mailto:models@embedl.com)

**Technical Issues & Early Access**
[https://github.com/embedl/flash-head](https://github.com/embedl/flash-head)

**More Information & Model Releases**
[https://embedl.com](https://embedl.com)

---

### Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

- Embedl SDK - AI optimization tools & profiling
- Embedl HUB - benchmarking platform
- Engineering support for on-prem/edge deployments
- Migration guidance (Llama / Qwen / Gemma)
- Early access & partner co-marketing opportunities

Contact: [models@embedl.com](mailto:models@embedl.com)