File size: 6,143 Bytes
9f6e041 c55bd9d 9f6e041 2d0c82f 9f6e041 92b4263 1054258 92b4263 9f6e041 ad0382d 9f6e041 ad0382d 9f6e041 ad0382d 0fdf533 ad0382d 9f6e041 1fed785 9f6e041 f1afbd4 9f6e041 f1afbd4 9f6e041 f1afbd4 9f6e041 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
---
license: other
license_name: embedl-models-community-licence-1.0
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
base_model:
- meta-llama/Llama-3.2-3B-Instruct
tags:
- text-generation-inference
---
# Llama-3.2-3B-Instruct-FlashHead

**Optimized version of Llama-3.2-3B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.**
Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
- FlashHead
- Custom vLLM generation via `embedl-models`
FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency
### Quickstart
Launch a chat window with commands for /reset and /exit with
```shell
pip install embedl-models
python3 -m embedl.models.vllm.demo --model embedl/Llama-3.2-3B-Instruct-FlashHead
```
---
## Model Details
| **Field** | **Value** |
|------------|------------|
| **Base Model** | Llama-3.2-3B-Instruct |
| **Input / Output** | Text → Text |
| **Release Date** | 2025-12-08 |
| **Version** | 1.0 |
| **Optimizations** | FlashHead LM Head|
| **Developers** | Embedl |
| **Licenses** | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* |
| **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |
---
## Optimizations
- **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
- **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package.
---
## Performance
### Token Generation Speed (RTX 3500 Ada, batch size = 1)
| **Precision** | **Tokens/sec** | **Speedup vs BF16** |
|----------------|----------------|----------------------|
| BF16 baseline | 54 | 1.0× |
| **FlashHead (Embedl)** | **58** | **1.07×** |
| W4A16 baseline | 141 | 2.61× |
| **FlashHead W4A16 (Embedl)** | **177** | **3.28×** |
FlashHead improves end-to-end speed by **1.26×** over state-of-the-art, while maintaining full accuracy parity.
---
## Accuracy (Parity with Baseline)
| **Method** | **MMLU-Pro** | **IFEval** | **BBH** | **TruthfulQA** | **GSM8K** |
|-------------|---------------|-------------|-------------|----------------|--------------|
| **Baseline** | 0.31 | 0.57 | 0.57 | 0.57 | 0.77 |
| **FlashHead** | 0.31 | 0.56 | 0.57 | 0.58 | 0.77 |
FlashHead closely matches baseline accuracy.
---
## Installation
```bash
pip install embedl-models
```
The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime.
---
## Usage Examples
**Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`).
### vLLM Inference
```python
from vllm import SamplingParams
from embedl.models.vllm import LLM
model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead"
if __name__ == "__main__":
sampling = SamplingParams(max_tokens=128, temperature=0.0)
llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
prompt = "Write a haiku about coffee."
output = llm.generate([prompt], sampling)
print(output[0].outputs[0].text)
```
---
### Interactive REPL Example
The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context.
```python
import asyncio
from embedl.models.vllm.demo import run_repl
model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead"
if __name__ == "__main__":
asyncio.run(
run_repl(
model=model_id,
max_model_len=131072
)
)
```
---
---
## ⚠️ Important Warning: Hugging Face Transformers Support
> **FlashHead is currently not applied when using the Hugging Face `transformers` pipeline.**
> Generation through `transformers` will fall back to the standard dense LM head, **disabling FlashHead acceleration**.
>
> For now, **we strongly recommend using the vLLM integration** (`embedl.models.vllm.LLM`) to ensure FlashHead is active and optimized for low-latency inference.
>
> Full support for the Hugging Face `transformers` pipeline with FlashHead integration will be released **in the coming days**.
---
## Limitations
- Limited to **vLLM 0.10.2** (pinned dependency)
- **Batch size = 1** (real-time generation)
- Currently optimized for **NVIDIA RTX GPUs**
---
## Roadmap
Planned improvements:
- Advanced mixed precision quantization
- Huggingface transformers generation
- vLLM CLI benchmarking for detailed latency evaluation
- `lm-eval-harness` integration for detailed accuracy evaluation
- Upstream support in **Transformers** and **vLLM**
- Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc.
- Broader model coverage (larger models, VLMs, VLAs)
---
## License
- **Upstream:** Meta Llama 3.2 License
- **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*
---
## Contact
**Enterprise & Commercial Inquiries**
[sales@embedl.com](mailto:sales@embedl.com)
**Technical Issues & Early Access**
[https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models)
**More Information & Model Releases**
[https://embedl.com](https://embedl.com)
---
### Partner & Developer Opportunities
If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:
- Embedl SDK - AI optimization tools & profiling
- Embedl HUB - benchmarking platform
- Engineering support for on-prem/edge deployments
- Migration guidance (Llama / Qwen / Gemma)
- Early access & partner co-marketing opportunities
Contact: [sales@embedl.com](mailto:sales@embedl.com) |