File size: 6,428 Bytes
d8ad33a
 
 
 
 
 
 
 
 
 
 
 
 
6428714
 
d8ad33a
 
 
 
 
 
 
126857a
d8ad33a
 
 
 
 
 
 
 
 
 
4a91e70
d8ad33a
 
 
 
 
 
 
 
 
7684592
3ae76e2
d8ad33a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
844dc16
 
 
 
d8ad33a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
579b0dd
d8ad33a
 
 
 
 
 
 
 
 
5a0c6b9
 
579b0dd
5a0c6b9
 
 
 
d8ad33a
 
 
 
 
 
 
 
 
 
 
 
 
 
5a0c6b9
 
 
 
579b0dd
 
5a0c6b9
d8ad33a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
license: other
license_name: embedl-models-community-licence-1.0
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
base_model:
- meta-llama/Llama-3.2-1B-Instruct
tags:
- text-generation-inference
---


# Llama-3.2-1B-Instruct-FlashHead-W4A16

![My model banner](assets/FlashHead.png)

**Optimized version of Llama-3.2-1B-Instruct using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.**
Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:

- FlashHead
- Quantization (W4A16)
- Custom vLLM generation via `embedl-models`

FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.

---

## Model Details
| **Field** | **Value** |
|------------|------------|
| **Base Model** | Llama-3.2-1B-Instruct |
| **Input / Output** | Text → Text |
| **Release Date** | 2025-12-08 |
| **Version** | 1.0 |
| **Optimizations** | FlashHead LM Head, Quantization (W4A16)|
| **Developers** | Embedl |
| **Licenses** | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* |
| **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |

---

## Optimizations

- **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
- **Quantization (W4A16)** - large reduction in memory footprint and accuracy.
- **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package.

---

## Performance

### Token Generation Speed (RTX 3500 Ada, batch size = 1)

| **Precision** | **Tokens/sec** | **Speedup vs BF16** |
|----------------|----------------|----------------------|
| BF16 baseline | 130 | 1.0× |
| **FlashHead (Embedl)** | **163** | **1.25×** |
| W4A16 baseline | 278 | 2.14× |
| **FlashHead W4A16 (Embedl)** | **485** | **3.73×** |

FlashHead improves end-to-end speed by **1.75×** over state-of-the-art, while maintaining full accuracy parity.

**Measurement setup:** vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.

**NVIDIA H200 measurement:** **FP8**, **512 Tokens/sec**.

---

## Accuracy (Parity with Baseline)

| **Method** | **MMLU-Pro** | **HellaSwag** | **IFEval** | **BoolQ** | **BBH** | **TruthfulQA** | **GSM8K** |
|-------------|---------------|----------------|--------------|-------------|-------------|----------------|--------------|
| **Baseline** | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |
| **FlashHead** | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |

FlashHead matches baseline performance exactly across all evaluation benchmarks.

---

## Installation

```bash
pip install embedl-models
```

The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime.

---

## Usage Examples
**Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`).

### vLLM Inference

```python
from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)
```

---

### Interactive REPL Example

The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled.  
It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context.

```python
import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )
```
---

---

## ⚠️ Important Warning: Hugging Face Transformers Support

> **FlashHead is currently not applied when using the Hugging Face `transformers` pipeline.**  
> Generation through `transformers` will fall back to the standard dense LM head, **disabling FlashHead acceleration**.  
> 
> For now, **we strongly recommend using the vLLM integration** (`embedl.models.vllm.LLM`) to ensure FlashHead is active and optimized for low-latency inference.
>
> Full support for the Hugging Face `transformers` pipeline with FlashHead integration will be released **in the coming days**.

---

## Limitations

- Limited to **vLLM 0.10.2** (pinned dependency)
- **Batch size = 1** (real-time generation)
- Currently optimized for **NVIDIA RTX GPUs**

---

## Roadmap

Planned improvements:

- Huggingface transformers generation
- Advanced mixed precision quantization
- vLLM CLI benchmarking for detailed latency evaluation
- `lm-eval-harness` integration for detailed accuracy evaluation
- Upstream support in **Transformers** and **vLLM**
- Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc.
- Broader model coverage (larger models, VLMs, VLAs)

---

## License

- **Upstream:** Meta Llama 3.2 License
- **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*

---

## Contact

**Enterprise & Commercial Inquiries**
[sales@embedl.com](mailto:sales@embedl.com)

**Technical Issues & Early Access**
[https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models)

**More Information & Model Releases**
[https://embedl.com](https://embedl.com)

---

### Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

- Embedl SDK - AI optimization tools & profiling
- Embedl HUB - benchmarking platform
- Engineering support for on-prem/edge deployments
- Migration guidance (Llama / Qwen / Gemma)
- Early access & partner co-marketing opportunities

Contact: [sales@embedl.com](mailto:sales@embedl.com)