File size: 6,102 Bytes
5cbbc5f c2f6abd 5cbbc5f 10e648a 5cbbc5f a59054b 5cbbc5f 4a14c74 5cbbc5f 4a14c74 5cbbc5f 4a14c74 5cbbc5f 337df94 4a14c74 5cbbc5f ca55dec 5cbbc5f 4ab8c36 ca55dec 4ab8c36 5cbbc5f 4ab8c36 ca55dec 4ab8c36 5cbbc5f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
---
license: other
license_name: embedl-models-community-licence-1.0
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
base_model:
- meta-llama/Llama-3.2-3B-Instruct
tags:
- text-generation-inference
---
# Llama-3.2-3B-Instruct-FlashHead-W4A16

**Optimized version of Llama-3.2-3B-Instruct using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.**
Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
- FlashHead
- Quantization (W4A16)
- Custom vLLM generation via `embedl-models`
FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency
---
## Model Details
| **Field** | **Value** |
|------------|------------|
| **Base Model** | Llama-3.2-3B-Instruct |
| **Input / Output** | Text → Text |
| **Release Date** | 2025-12-08 |
| **Version** | 1.0 |
| **Optimizations** | FlashHead LM Head, Quantization (W4A16)|
| **Developers** | Embedl |
| **Licenses** | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* |
| **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |
---
## Optimizations
- **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
- **Quantization (W4A16)** - large reduction in memory footprint and latency.
- **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package.
---
## Performance
### Token Generation Speed (RTX 3500 Ada, batch size = 1)
| **Precision** | **Tokens/sec** | **Speedup vs BF16** |
|----------------|----------------|----------------------|
| BF16 baseline | 54 | 1.0× |
| **FlashHead (Embedl)** | **58** | **1.07×** |
| W4A16 baseline | 141 | 2.61× |
| **FlashHead W4A16 (Embedl)** | **177** | **3.28×** |
FlashHead improves end-to-end speed by **1.26×** over state-of-the-art, while maintaining full accuracy parity.
---
## Accuracy (Parity with Baseline)
| **Method** | **MMLU-Pro** | **IFEval** | **BBH** | **TruthfulQA** | **GSM8K** |
|-------------|---------------|-------------|-------------|----------------|--------------|
| **Baseline** | 0.31 | 0.57 | 0.57 | 0.57 | 0.77 |
| **FlashHead** | 0.31 | 0.56 | 0.57 | 0.58 | 0.77 |
FlashHead closely matches baseline accuracy.
---
## Installation
```bash
pip install embedl-models
```
The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime.
---
## Usage Examples
**Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`).
### vLLM Inference
```python
from vllm import SamplingParams
from embedl.models.vllm import LLM
model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16"
if __name__ == "__main__":
sampling = SamplingParams(max_tokens=128, temperature=0.0)
llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
prompt = "Write a haiku about coffee."
output = llm.generate([prompt], sampling)
print(output[0].outputs[0].text)
```
---
### Interactive REPL Example
The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context.
```python
import asyncio
from embedl.models.vllm.demo import run_repl
model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16"
if __name__ == "__main__":
asyncio.run(
run_repl(
model=model_id,
max_model_len=131072
)
)
```
---
---
## ⚠️ Important Warning: Hugging Face Transformers Support
> **FlashHead is currently not applied when using the Hugging Face `transformers` pipeline.**
> Generation through `transformers` will fall back to the standard dense LM head, **disabling FlashHead acceleration**.
>
> For now, **we strongly recommend using the vLLM integration** (`embedl.models.vllm.LLM`) to ensure FlashHead is active and optimized for low-latency inference.
>
> Full support for the Hugging Face `transformers` pipeline with FlashHead integration will be released **in the coming days**.
---
## Limitations
- Limited to **vLLM 0.10.2** (pinned dependency)
- **Batch size = 1** (real-time generation)
- Currently optimized for **NVIDIA RTX GPUs**
---
## Roadmap
Planned improvements:
- Huggingface transformers generation
- Advanced mixed precision quantization
- vLLM CLI benchmarking for detailed latency evaluation
- `lm-eval-harness` integration for detailed accuracy evaluation
- Upstream support in **Transformers** and **vLLM**
- Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc.
- Broader model coverage (larger models, VLMs, VLAs)
---
## License
- **Upstream:** Meta Llama 3.2 License
- **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*
---
## Contact
**Enterprise & Commercial Inquiries**
[sales@embedl.com](mailto:sales@embedl.com)
**Technical Issues & Early Access**
[https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models)
**More Information & Model Releases**
[https://embedl.com](https://embedl.com)
---
### Partner & Developer Opportunities
If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:
- Embedl SDK - AI optimization tools & profiling
- Embedl HUB - benchmarking platform
- Engineering support for on-prem/edge deployments
- Migration guidance (Llama / Qwen / Gemma)
- Early access & partner co-marketing opportunities
Contact: [sales@embedl.com](mailto:sales@embedl.com) |