| | --- |
| | license: gemma |
| | library_name: vllm |
| | pipeline_tag: image-text-to-text |
| | extra_gated_heading: Access Gemma on Hugging Face |
| | extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and |
| | agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging |
| | Face and click below. Requests are processed immediately. |
| | extra_gated_button_content: Acknowledge license |
| | base_model: google/gemma-3-12b-it |
| | --- |
| | |
| | # FP8 Dynamic Quantized Gemma-3-12b-it |
| |
|
| | ### Features |
| | - Image text to text |
| | - Tool chain |
| |
|
| | |
| | ## 1. What FP8‑Dynamic Quantization Is |
| | * **FP8 format** |
| | * 8‑bit floating‑point (1 sign bit + 5 exponent bits + 2 mantissa bits). |
| | * Drastically shrinks weight/activation size while keeping floating‑point behavior. |
| | * **Dynamic scheme (`FP8_DYNAMIC`)** |
| | * **Weights:** *static*, **per‑channel** quantization (each out‑feature channel has its own scale). |
| | * **Activations:** *dynamic*, **per‑token** quantization (scales are recomputed on‑the‑fly for every input token). |
| | * **RTN (Round‑To‑Nearest) PTQ** |
| | * Post‑training; no back‑prop required. |
| | * No calibration dataset needed because: |
| | * Weights use symmetric RTN. |
| | * Activations are quantized dynamically at inference time. |
| | |
| | ## 2. Serving the FP8 Model with vLLM |
| | |
| | ``` |
| | vllm serve BCCard/gemma-3-12b-it-FP8-Dynamic \ |
| | --tensor-parallel-size 4 \ |
| | --gpu-memory-utilization 0.9 \ |
| | --max-model-len 8192 \ |
| | --enforce-eager \ |
| | --api-key bccard \ |
| | --served-model-name gemma-3-12b-it |
| | ``` |
| | |
| | ## 3. Gemma 3 model card |
| | |
| | **Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core) |
| | |
| | **Terms of Use**: [Terms][terms] |
| | |
| | **Authors**: Google DeepMind, BC Card (Quatization) |
| | |
| | ### Description |
| | |
| | Gemma is a family of lightweight, state-of-the-art open models from Google, |
| | built from the same research and technology used to create the Gemini models. |
| | Gemma 3 models are multimodal, handling text and image input and generating text |
| | output, with open weights for both pre-trained variants and instruction-tuned |
| | variants. Gemma 3 has a large, 128K context window, multilingual support in over |
| | 140 languages, and is available in more sizes than previous versions. Gemma 3 |
| | models are well-suited for a variety of text generation and image understanding |
| | tasks, including question answering, summarization, and reasoning. Their |
| | relatively small size makes it possible to deploy them in environments with |
| | limited resources such as laptops, desktops or your own cloud infrastructure, |
| | democratizing access to state of the art AI models and helping foster innovation |
| | for everyone. |
| | |
| | ### Inputs and outputs |
| | |
| | - **Input:** |
| | - Text string, such as a question, a prompt, or a document to be summarized |
| | - Images, normalized to 896 x 896 resolution and encoded to 256 tokens |
| | each |
| | - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and |
| | 32K tokens for the 1B size |
| | |
| | - **Output:** |
| | - Generated text in response to the input, such as an answer to a |
| | question, analysis of image content, or a summary of a document |
| | - Total output context of 8192 tokens |
| | |
| | ### Citation |
| | |
| | ```none |
| | @article{gemma_2025, |
| | title={Gemma 3 FP8 Dynamic}, |
| | url={https://bccard.ai}, |
| | author={BC Card}, |
| | year={2025} |
| | } |
| | ``` |
| | |