|
|
--- |
|
|
license: other |
|
|
license_name: embedl-models-community-licence-1.0 |
|
|
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE |
|
|
base_model: |
|
|
- meta-llama/Llama-3.2-3B-Instruct |
|
|
tags: |
|
|
- text-generation-inference |
|
|
--- |
|
|
|
|
|
|
|
|
# Llama-3.2-3B-Instruct-FlashHead-W4A16 |
|
|
|
|
|
 |
|
|
|
|
|
**Optimized version of Llama-3.2-3B-Instruct using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.** |
|
|
Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging: |
|
|
|
|
|
- FlashHead |
|
|
- Quantization (W4A16) |
|
|
- Custom vLLM generation via `embedl-models` |
|
|
|
|
|
FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
| **Field** | **Value** | |
|
|
|------------|------------| |
|
|
| **Base Model** | Llama-3.2-3B-Instruct | |
|
|
| **Input / Output** | Text → Text | |
|
|
| **Release Date** | 2025-12-08 | |
|
|
| **Version** | 1.0 | |
|
|
| **Optimizations** | FlashHead LM Head, Quantization (W4A16)| |
|
|
| **Developers** | Embedl | |
|
|
| **Licenses** | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* | |
|
|
| **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs | |
|
|
|
|
|
--- |
|
|
|
|
|
## Optimizations |
|
|
|
|
|
- **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput. |
|
|
- **Quantization (W4A16)** - large reduction in memory footprint and latency. |
|
|
- **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package. |
|
|
|
|
|
--- |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Token Generation Speed (RTX 3500 Ada, batch size = 1) |
|
|
|
|
|
| **Precision** | **Tokens/sec** | **Speedup vs BF16** | |
|
|
|----------------|----------------|----------------------| |
|
|
| BF16 baseline | 54 | 1.0× | |
|
|
| **FlashHead (Embedl)** | **58** | **1.07×** | |
|
|
| W4A16 baseline | 141 | 2.61× | |
|
|
| **FlashHead W4A16 (Embedl)** | **177** | **3.28×** | |
|
|
|
|
|
FlashHead improves end-to-end speed by **1.26×** over state-of-the-art, while maintaining full accuracy parity. |
|
|
|
|
|
--- |
|
|
|
|
|
## Accuracy (Parity with Baseline) |
|
|
|
|
|
| **Method** | **MMLU-Pro** | **IFEval** | **BBH** | **TruthfulQA** | **GSM8K** | |
|
|
|-------------|---------------|-------------|-------------|----------------|--------------| |
|
|
| **Baseline** | 0.31 | 0.57 | 0.57 | 0.57 | 0.77 | |
|
|
| **FlashHead** | 0.31 | 0.56 | 0.57 | 0.58 | 0.77 | |
|
|
|
|
|
FlashHead closely matches baseline accuracy. |
|
|
|
|
|
--- |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install embedl-models |
|
|
``` |
|
|
|
|
|
The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime. |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage Examples |
|
|
**Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`). |
|
|
|
|
|
### vLLM Inference |
|
|
|
|
|
```python |
|
|
from vllm import SamplingParams |
|
|
from embedl.models.vllm import LLM |
|
|
|
|
|
model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16" |
|
|
|
|
|
if __name__ == "__main__": |
|
|
sampling = SamplingParams(max_tokens=128, temperature=0.0) |
|
|
llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072) |
|
|
|
|
|
prompt = "Write a haiku about coffee." |
|
|
output = llm.generate([prompt], sampling) |
|
|
print(output[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### Interactive REPL Example |
|
|
|
|
|
The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled. |
|
|
It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context. |
|
|
|
|
|
```python |
|
|
import asyncio |
|
|
from embedl.models.vllm.demo import run_repl |
|
|
|
|
|
model_id = "embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16" |
|
|
|
|
|
if __name__ == "__main__": |
|
|
asyncio.run( |
|
|
run_repl( |
|
|
model=model_id, |
|
|
max_model_len=131072 |
|
|
) |
|
|
) |
|
|
``` |
|
|
--- |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚠️ Important Warning: Hugging Face Transformers Support |
|
|
|
|
|
> **FlashHead is currently not applied when using the Hugging Face `transformers` pipeline.** |
|
|
> Generation through `transformers` will fall back to the standard dense LM head, **disabling FlashHead acceleration**. |
|
|
> |
|
|
> For now, **we strongly recommend using the vLLM integration** (`embedl.models.vllm.LLM`) to ensure FlashHead is active and optimized for low-latency inference. |
|
|
> |
|
|
> Full support for the Hugging Face `transformers` pipeline with FlashHead integration will be released **in the coming days**. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Limited to **vLLM 0.10.2** (pinned dependency) |
|
|
- **Batch size = 1** (real-time generation) |
|
|
- Currently optimized for **NVIDIA RTX GPUs** |
|
|
|
|
|
--- |
|
|
|
|
|
## Roadmap |
|
|
|
|
|
Planned improvements: |
|
|
|
|
|
- Huggingface transformers generation |
|
|
- Advanced mixed precision quantization |
|
|
- vLLM CLI benchmarking for detailed latency evaluation |
|
|
- `lm-eval-harness` integration for detailed accuracy evaluation |
|
|
- Upstream support in **Transformers** and **vLLM** |
|
|
- Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc. |
|
|
- Broader model coverage (larger models, VLMs, VLAs) |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
- **Upstream:** Meta Llama 3.2 License |
|
|
- **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)* |
|
|
|
|
|
--- |
|
|
|
|
|
## Contact |
|
|
|
|
|
**Enterprise & Commercial Inquiries** |
|
|
[sales@embedl.com](mailto:sales@embedl.com) |
|
|
|
|
|
**Technical Issues & Early Access** |
|
|
[https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models) |
|
|
|
|
|
**More Information & Model Releases** |
|
|
[https://embedl.com](https://embedl.com) |
|
|
|
|
|
--- |
|
|
|
|
|
### Partner & Developer Opportunities |
|
|
|
|
|
If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for: |
|
|
|
|
|
- Embedl SDK - AI optimization tools & profiling |
|
|
- Embedl HUB - benchmarking platform |
|
|
- Engineering support for on-prem/edge deployments |
|
|
- Migration guidance (Llama / Qwen / Gemma) |
|
|
- Early access & partner co-marketing opportunities |
|
|
|
|
|
Contact: [sales@embedl.com](mailto:sales@embedl.com) |