--- license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model: - meta-llama/Llama-3.2-1B-Instruct tags: - text-generation-inference --- # Llama-3.2-1B-Instruct-FlashHead-W4A16 ![My model banner](assets/FlashHead.png) **Optimized version of Llama-3.2-1B-Instruct using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.** Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging: - FlashHead - Quantization (W4A16) - Custom vLLM generation via `embedl-models` FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs. ### Quickstart Launch a chat window with commands for /reset and /exit with ```shell pip install embedl-models python3 -m embedl.models.vllm.demo --model embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16 ``` --- ## Model Details | **Field** | **Value** | |------------|------------| | **Base Model** | Llama-3.2-1B-Instruct | | **Input / Output** | Text → Text | | **Release Date** | 2025-12-08 | | **Version** | 1.0 | | **Optimizations** | FlashHead LM Head, Quantization (W4A16)| | **Developers** | Embedl | | **Licenses** | Upstream: Meta Llama 3.2 License. Built with Llama.
Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* | | **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs | --- ## Optimizations - **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput. - **Quantization (W4A16)** - large reduction in memory footprint and latency. - **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package. --- ## Performance ### Token Generation Speed (RTX 3500 Ada, batch size = 1) | **Precision** | **Tokens/sec** | **Speedup vs BF16** | |----------------|----------------|----------------------| | BF16 baseline | 130 | 1.0× | | **FlashHead (Embedl)** | **163** | **1.25×** | | W4A16 baseline | 278 | 2.14× | | **FlashHead W4A16 (Embedl)** | **485** | **3.73×** | FlashHead improves end-to-end speed by **1.75×** over state-of-the-art, while maintaining full accuracy parity. **Measurement setup:** vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs. **NVIDIA H200 measurement:** **FP8**, **512 Tokens/sec**. --- ## Accuracy (Parity with Baseline) | **Method** | **MMLU-Pro** | **HellaSwag** | **IFEval** | **BoolQ** | **BBH** | **TruthfulQA** | **GSM8K** | |-------------|---------------|----------------|--------------|-------------|-------------|----------------|--------------| | **Baseline** | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 | | **FlashHead** | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 | FlashHead closely matches baseline accuracy. --- ## Installation ```bash pip install embedl-models ``` The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime. --- ## Usage Examples **Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`). ### vLLM Inference ```python from vllm import SamplingParams from embedl.models.vllm import LLM model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16" if __name__ == "__main__": sampling = SamplingParams(max_tokens=128, temperature=0.0) llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072) prompt = "Write a haiku about coffee." output = llm.generate([prompt], sampling) print(output[0].outputs[0].text) ``` --- ### Interactive REPL Example The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled. It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context. ```python import asyncio from embedl.models.vllm.demo import run_repl model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16" if __name__ == "__main__": asyncio.run( run_repl( model=model_id, max_model_len=131072 ) ) ``` --- --- ## ⚠️ Important Warning: Hugging Face Transformers Support > **FlashHead is currently not applied when using the Hugging Face `transformers` pipeline.** > Generation through `transformers` will fall back to the standard dense LM head, **disabling FlashHead acceleration**. > > For now, **we strongly recommend using the vLLM integration** (`embedl.models.vllm.LLM`) to ensure FlashHead is active and optimized for low-latency inference. > > Full support for the Hugging Face `transformers` pipeline with FlashHead integration will be released **in the coming days**. --- ## Limitations - Limited to **vLLM 0.10.2** (pinned dependency) - **Batch size = 1** (real-time generation) - Currently optimized for **NVIDIA RTX GPUs** --- ## Roadmap Planned improvements: - Huggingface transformers generation - Advanced mixed precision quantization - vLLM CLI benchmarking for detailed latency evaluation - `lm-eval-harness` integration for detailed accuracy evaluation - Upstream support in **Transformers** and **vLLM** - Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc. - Broader model coverage (larger models, VLMs, VLAs) --- ## License - **Upstream:** Meta Llama 3.2 License - **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)* --- ## Contact **Enterprise & Commercial Inquiries** [sales@embedl.com](mailto:sales@embedl.com) **Technical Issues & Early Access** [https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models) **More Information & Model Releases** [https://embedl.com](https://embedl.com) --- ### Partner & Developer Opportunities If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for: - Embedl SDK - AI optimization tools & profiling - Embedl HUB - benchmarking platform - Engineering support for on-prem/edge deployments - Migration guidance (Llama / Qwen / Gemma) - Early access & partner co-marketing opportunities Contact: [sales@embedl.com](mailto:sales@embedl.com)