| | --- |
| | license: other |
| | license_name: embedl-models-community-licence-1.0 |
| | license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE |
| | base_model: |
| | - google/gemma-3-1b-it |
| | tags: |
| | - text-generation-inference |
| | --- |
| | |
| |
|
| | # gemma-3-1b-it-FlashHead |
| |
|
| |  |
| |
|
| | **Optimized version of gemma-3-1b-it using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.** |
| | Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging: |
| |
|
| | - FlashHead |
| | - Custom vLLM generation via `embedl-models` |
| |
|
| | FlashHead matches the gemma-3-1b-it baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency. |
| |
|
| | ### Quickstart |
| |
|
| | Launch a chat window with commands for /reset and /exit with |
| |
|
| | ```shell |
| | pip install embedl-models |
| | python3 -m embedl.models.vllm.demo --model embedl/gemma-3-1b-it-FlashHead |
| | ``` |
| | --- |
| |
|
| | ## Model Details |
| | | **Field** | **Value** | |
| | |------------|------------| |
| | | **Base Model** | gemma-3-1b-it | |
| | | **Input / Output** | Text → Text | |
| | | **Release Date** | 2025-12-08 | |
| | | **Version** | 1.0 | |
| | | **Optimizations** | FlashHead LM Head| |
| | | **Developers** | Embedl | |
| | | **Licenses** | Upstream: Gemma Terms of Use. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* | |
| | | **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs | |
| |
|
| | --- |
| |
|
| | ## Optimizations |
| |
|
| | - **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput. |
| | - **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package. |
| |
|
| | --- |
| |
|
| | ## Performance |
| |
|
| | ### Token Generation Speed (RTX 3500 Ada, batch size = 1) |
| |
|
| | | **Precision** | **Tokens/sec** | **Speedup vs BF16** | |
| | |----------------|----------------|----------------------| |
| | | BF16 baseline | 148 | 1.0× | |
| | | **FlashHead (Embedl)** | **178** | **1.20×** | |
| | | W4A16 baseline | 243 | 1.64x× | |
| | | **FlashHead W4A16 (Embedl)** | **336** | **2.27×** | |
| |
|
| | FlashHead improves end-to-end speed by **1.38×** over state-of-the-art, while maintaining full accuracy parity. |
| |
|
| | **Measurement setup:** vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs. |
| | |
| | --- |
| | |
| | ## Accuracy (Parity with Baseline) |
| | |
| | | **Method** | **MMLU-Pro** | **IFEval** | **BBH** | **TruthfulQA** | **GSM8K** | |
| | |-------------|---------------|--------------|-------------|----------------|--------------| |
| | | **Baseline** | 0.15 | 0.55 | 0.38 | 0.31 | 0.42 | |
| | | **FlashHead** | 0.15 | 0.49 | 0.38 | 0.31 | 0.39 | |
| | |
| | FlashHead closely matches baseline accuracy. |
| | |
| | --- |
| | |
| | ## Installation |
| | |
| | ```bash |
| | pip install embedl-models |
| | ``` |
| | |
| | The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime. |
| | |
| | --- |
| | |
| | ## Usage Examples |
| | **Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`). |
| | |
| | ### vLLM Inference |
| | |
| | ```python |
| | from vllm import SamplingParams |
| | from embedl.models.vllm import LLM |
| | |
| | model_id = "embedl/gemma-3-1b-it-FlashHead" |
| |
|
| | if __name__ == "__main__": |
| | sampling = SamplingParams(max_tokens=128, temperature=0.0) |
| | llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072) |
| | |
| | prompt = "Write a haiku about coffee." |
| | output = llm.generate([prompt], sampling) |
| | print(output[0].outputs[0].text) |
| | ``` |
| | |
| | --- |
| |
|
| | ### Interactive REPL Example |
| |
|
| | The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled. |
| | It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context. |
| |
|
| | ```python |
| | import asyncio |
| | from embedl.models.vllm.demo import run_repl |
| | |
| | model_id = "embedl/gemma-3-1b-it-FlashHead" |
| | |
| | if __name__ == "__main__": |
| | asyncio.run( |
| | run_repl( |
| | model=model_id, |
| | max_model_len=131072 |
| | ) |
| | ) |
| | ``` |
| | --- |
| |
|
| | --- |
| |
|
| | ## ⚠️ Important Warning: Hugging Face Transformers Support |
| |
|
| | > **FlashHead is currently not applied when using the Hugging Face `transformers` pipeline.** |
| | > Generation through `transformers` will fall back to the standard dense LM head, **disabling FlashHead acceleration**. |
| | > |
| | > For now, **we strongly recommend using the vLLM integration** (`embedl.models.vllm.LLM`) to ensure FlashHead is active and optimized for low-latency inference. |
| | > |
| | > Full support for the Hugging Face `transformers` pipeline with FlashHead integration will be released **in the coming days**. |
| |
|
| | --- |
| |
|
| | ## Limitations |
| |
|
| | - Limited to **vLLM 0.10.2** (pinned dependency) |
| | - **Batch size = 1** (real-time generation) |
| | - Currently optimized for **NVIDIA RTX GPUs** |
| |
|
| | --- |
| |
|
| | ## Roadmap |
| |
|
| | Planned improvements: |
| |
|
| | - Advanced mixed precision quantization |
| | - Huggingface transformers generation |
| | - vLLM CLI benchmarking for detailed latency evaluation |
| | - `lm-eval-harness` integration for detailed accuracy evaluation |
| | - Upstream support in **Transformers** and **vLLM** |
| | - Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc. |
| | - Broader model coverage (larger models, VLMs, VLAs) |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | - **Upstream:** Gemma Terms of Use. |
| | - **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)* |
| |
|
| | --- |
| |
|
| | ## Contact |
| |
|
| | **Enterprise & Commercial Inquiries** |
| | [sales@embedl.com](mailto:sales@embedl.com) |
| |
|
| | **Technical Issues & Early Access** |
| | [https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models) |
| |
|
| | **More Information & Model Releases** |
| | [https://embedl.com](https://embedl.com) |
| |
|
| | --- |
| |
|
| | ### Partner & Developer Opportunities |
| |
|
| | If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for: |
| |
|
| | - Embedl SDK - AI optimization tools & profiling |
| | - Embedl HUB - benchmarking platform |
| | - Engineering support for on-prem/edge deployments |
| | - Migration guidance (Llama / Qwen / Gemma) |
| | - Early access & partner co-marketing opportunities |
| |
|
| | Contact: [sales@embedl.com](mailto:sales@embedl.com) |
| |
|