File size: 6,102 Bytes
5cbbc5f
 
 
 
 
 
 
 
 
 
 
 
 
c2f6abd
 
5cbbc5f
 
 
 
 
 
 
10e648a
5cbbc5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a59054b
5cbbc5f
 
 
 
 
 
 
 
 
 
4a14c74
 
 
 
5cbbc5f
4a14c74
5cbbc5f
 
 
 
 
4a14c74
 
 
 
5cbbc5f
337df94
4a14c74
5cbbc5f
 
 
 
 
 
 
 
 
 
 
 
 
ca55dec
5cbbc5f
 
 
 
 
 
 
 
 
4ab8c36
 
ca55dec
4ab8c36
 
 
 
5cbbc5f
 
 
 
 
 
 
 
 
 
 
 
 
 
4ab8c36
 
 
 
ca55dec
 
4ab8c36
5cbbc5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
license: other
license_name: embedl-models-community-licence-1.0
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
base_model:
- meta-llama/Llama-3.2-3B-Instruct
tags:
- text-generation-inference
---


# Llama-3.2-3B-Instruct-FlashHead-W4A16

![My model banner](assets/FlashHead.png)

**Optimized version of Llama-3.2-3B-Instruct using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.**
Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:

- FlashHead
- Quantization (W4A16)
- Custom vLLM generation via `embedl-models`

FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency

---

## Model Details
| **Field** | **Value** |
|------------|------------|
| **Base Model** | Llama-3.2-3B-Instruct |
| **Input / Output** | Text → Text |
| **Release Date** | 2025-12-08 |
| **Version** | 1.0 |
| **Optimizations** | FlashHead LM Head, Quantization (W4A16)|
| **Developers** | Embedl |
| **Licenses** | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* |
| **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |

---

## Optimizations

- **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
- **Quantization (W4A16)** - large reduction in memory footprint and latency.
- **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package.

---

## Performance

### Token Generation Speed (RTX 3500 Ada, batch size = 1)

| **Precision** | **Tokens/sec** | **Speedup vs BF16** |
|----------------|----------------|----------------------|
| BF16 baseline | 54 | 1.0× |
| **FlashHead (Embedl)** | **58** | **1.07×** |
| W4A16 baseline | 141 | 2.61× |
| **FlashHead W4A16 (Embedl)** | **177** | **3.28×** |

FlashHead improves end-to-end speed by **1.26×** over state-of-the-art, while maintaining full accuracy parity.

---

## Accuracy (Parity with Baseline)

| **Method** | **MMLU-Pro** | **IFEval** | **BBH** | **TruthfulQA** | **GSM8K** |
|-------------|---------------|-------------|-------------|----------------|--------------|
| **Baseline** | 0.31 | 0.57 | 0.57 | 0.57 | 0.77 |
| **FlashHead** | 0.31 | 0.56 | 0.57 | 0.58 | 0.77 |

FlashHead closely matches baseline accuracy.

---

## Installation

```bash
pip install embedl-models
```

The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime.

---

## Usage Examples
**Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`).

### vLLM Inference

```python
from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)
```

---

### Interactive REPL Example

The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled.  
It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context.

```python
import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )
```
---

---

## ⚠️ Important Warning: Hugging Face Transformers Support

> **FlashHead is currently not applied when using the Hugging Face `transformers` pipeline.**  
> Generation through `transformers` will fall back to the standard dense LM head, **disabling FlashHead acceleration**.  
> 
> For now, **we strongly recommend using the vLLM integration** (`embedl.models.vllm.LLM`) to ensure FlashHead is active and optimized for low-latency inference.
>
> Full support for the Hugging Face `transformers` pipeline with FlashHead integration will be released **in the coming days**.

---

## Limitations

- Limited to **vLLM 0.10.2** (pinned dependency)
- **Batch size = 1** (real-time generation)
- Currently optimized for **NVIDIA RTX GPUs**

---

## Roadmap

Planned improvements:

- Huggingface transformers generation
- Advanced mixed precision quantization
- vLLM CLI benchmarking for detailed latency evaluation
- `lm-eval-harness` integration for detailed accuracy evaluation
- Upstream support in **Transformers** and **vLLM**
- Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc.
- Broader model coverage (larger models, VLMs, VLAs)

---

## License

- **Upstream:** Meta Llama 3.2 License
- **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*

---

## Contact

**Enterprise & Commercial Inquiries**
[sales@embedl.com](mailto:sales@embedl.com)

**Technical Issues & Early Access**
[https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models)

**More Information & Model Releases**
[https://embedl.com](https://embedl.com)

---

### Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

- Embedl SDK - AI optimization tools & profiling
- Embedl HUB - benchmarking platform
- Engineering support for on-prem/edge deployments
- Migration guidance (Llama / Qwen / Gemma)
- Early access & partner co-marketing opportunities

Contact: [sales@embedl.com](mailto:sales@embedl.com)