File size: 5,411 Bytes
60ca60a 6bf524c 60ca60a 2740775 60ca60a 2740775 60ca60a fbc85b5 2740775 fbc85b5 60ca60a 2740775 60ca60a 785e49d 60ca60a b555d05 60ca60a 2740775 60ca60a 2740775 60ca60a 2740775 60ca60a 2740775 60ca60a 1597942 60ca60a 2740775 60ca60a 1597942 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | ---
license: other
license_name: embedl-models-community-licence-1.0
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
base_model:
- google/gemma-3-1b-it
tags:
- text-generation-inference
---
# gemma-3-1b-it-FlashHead

[](https://github.com/embedl/flash-head)
**Optimized version of gemma-3-1b-it using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.**
Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
- FlashHead
- vLLM plugin via [`flash-head`](https://github.com/embedl/flash-head)
FlashHead matches the gemma-3-1b-it baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency.
### Quickstart
```bash
pip install flash-head
vllm serve embedl/gemma-3-1b-it-FlashHead
```
---
## Model Details
| **Field** | **Value** |
|------------|------------|
| **Base Model** | gemma-3-1b-it |
| **Input / Output** | Text → Text |
| **Release Date** | 2025-12-08 |
| **Version** | 1.0 |
| **Optimizations** | FlashHead LM Head|
| **Developers** | Embedl |
| **Licenses** | Upstream: Gemma Terms of Use. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* |
| **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |
---
## Optimizations
- **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
- **vLLM Plugin Integration** - compatible with **vLLM (0.14.0+)** via the [`flash-head`](https://github.com/embedl/flash-head) plugin.
---
## Performance
<a href="https://huggingface.co/spaces/embedl/Edge-Inference-Benchmarks" target="_blank" rel="noopener">
<img
src="https://huggingface.co/datasets/embedl/documentation-images/resolve/main/Edge-Inference-Benchmarks/Gemma-3__agx_thor.svg"
alt="Edge Inference Benchmarks for Gemma-3"
width="100%"
/>
</a>
### Token Generation Speed (RTX 3500 Ada, batch size = 1)
| **Precision** | **Tokens/sec** | **Speedup vs BF16** |
|----------------|----------------|----------------------|
| BF16 baseline | 148 | 1.0× |
| **FlashHead (Embedl)** | **178** | **1.20×** |
| W4A16 baseline | 243 | 1.64x× |
| **FlashHead W4A16 (Embedl)** | **336** | **2.27×** |
FlashHead improves end-to-end speed by **1.38×** over state-of-the-art, while maintaining full accuracy parity.
**Measurement setup:** vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.
---
## Accuracy (Parity with Baseline)
| **Method** | **MMLU-Pro** | **IFEval** | **BBH** | **TruthfulQA** | **GSM8K** |
|-------------|---------------|--------------|-------------|----------------|--------------|
| **Baseline** | 0.15 | 0.55 | 0.38 | 0.31 | 0.42 |
| **FlashHead** | 0.15 | 0.49 | 0.38 | 0.31 | 0.39 |
FlashHead closely matches baseline accuracy.
---
## Installation
```bash
pip install flash-head
```
The [`flash-head`](https://github.com/embedl/flash-head) vLLM plugin is required. It activates automatically at startup.
---
## Usage Examples
**Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`).
### vLLM Inference
```python
from vllm import LLM, SamplingParams
model_id = "embedl/gemma-3-1b-it-FlashHead"
if __name__ == "__main__":
sampling = SamplingParams(max_tokens=128, temperature=0.0)
llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
prompt = "Write a haiku about coffee."
output = llm.generate([prompt], sampling)
print(output[0].outputs[0].text)
```
---
## Limitations
- Requires **vLLM >= 0.14.0**
- Currently optimized for **NVIDIA RTX GPUs**
---
## Roadmap
Planned improvements:
- Advanced mixed precision quantization
- Huggingface transformers generation
- vLLM CLI benchmarking for detailed latency evaluation
- `lm-eval-harness` integration for detailed accuracy evaluation
- Upstream support in **Transformers** and **vLLM**
- Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc.
- Broader model coverage (larger models, VLMs, VLAs)
---
## License
- **Upstream:** Gemma Terms of Use.
- **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*
---
## Contact
**Enterprise & Commercial Inquiries**
[models@embedl.com](mailto:models@embedl.com)
**Technical Issues & Early Access**
[https://github.com/embedl/flash-head](https://github.com/embedl/flash-head)
**More Information & Model Releases**
[https://embedl.com](https://embedl.com)
---
### Partner & Developer Opportunities
If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:
- Embedl SDK - AI optimization tools & profiling
- Embedl HUB - benchmarking platform
- Engineering support for on-prem/edge deployments
- Migration guidance (Llama / Qwen / Gemma)
- Early access & partner co-marketing opportunities
Contact: [models@embedl.com](mailto:models@embedl.com)
|