|
|
--- |
|
|
license: gemma |
|
|
library_name: vllm |
|
|
pipeline_tag: image-text-to-text |
|
|
extra_gated_heading: Access Gemma on Hugging Face |
|
|
extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and |
|
|
agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging |
|
|
Face and click below. Requests are processed immediately. |
|
|
extra_gated_button_content: Acknowledge license |
|
|
base_model: google/gemma-3-27b-it |
|
|
--- |
|
|
|
|
|
# FP8 Dynamic Quantized Gemma-3-27b-it |
|
|
|
|
|
### Features |
|
|
- Image text to text |
|
|
- Tool chain |
|
|
|
|
|
|
|
|
## 1. What FP8‑Dynamic Quantization Is |
|
|
* **FP8 format** |
|
|
* 8‑bit floating‑point (1 sign bit + 5 exponent bits + 2 mantissa bits). |
|
|
* Drastically shrinks weight/activation size while keeping floating‑point behavior. |
|
|
* **Dynamic scheme (`FP8_DYNAMIC`)** |
|
|
* **Weights:** *static*, **per‑channel** quantization (each out‑feature channel has its own scale). |
|
|
* **Activations:** *dynamic*, **per‑token** quantization (scales are recomputed on‑the‑fly for every input token). |
|
|
* **RTN (Round‑To‑Nearest) PTQ** |
|
|
* Post‑training; no back‑prop required. |
|
|
* No calibration dataset needed because: |
|
|
* Weights use symmetric RTN. |
|
|
* Activations are quantized dynamically at inference time. |
|
|
|
|
|
## 2. Serving the FP8 Model with vLLM |
|
|
|
|
|
``` |
|
|
vllm serve BCCard/gemma-3-27b-it-FP8-Dynamic \ |
|
|
--tensor-parallel-size 4 \ |
|
|
--gpu-memory-utilization 0.9 \ |
|
|
--max-model-len 8192 \ |
|
|
--enforce-eager \ |
|
|
--api-key bccard \ |
|
|
--served-model-name gemma-3-27b-it |
|
|
``` |
|
|
|
|
|
## 3. Quantization Code Walk‑Through (Shared Knowledges) |
|
|
|
|
|
[LLM Compressor](https://github.com/vllm-project/llm-compressor) is an easy-to-use library for optimizing models for deployment with vllm, including: |
|
|
|
|
|
Comprehensive set of quantization algorithms for weight-only and activation quantization |
|
|
Seamless integration with Hugging Face models and repositories |
|
|
safetensors-based file format compatible with vllm |
|
|
Large model support via accelerate |
|
|
|
|
|
``` |
|
|
from transformers import AutoProcessor, Gemma3ForConditionalGeneration |
|
|
from llmcompressor.modifiers.quantization import QuantizationModifier |
|
|
from llmcompressor.transformers import oneshot |
|
|
|
|
|
model_name = "google/gemma-3-27b-it" |
|
|
|
|
|
processor = AutoProcessor.from_pretrained(model_name) |
|
|
model = Gemma3ForConditionalGeneration.from_pretrained(model_name, device_map="auto", torch_dtype="auto", trust_remote_code=True) |
|
|
|
|
|
recipe = QuantizationModifier( |
|
|
targets="Linear", |
|
|
scheme="FP8_DYNAMIC", |
|
|
ignore=['re:.*lm_head', 're:vision_tower.*', 're:multi_modal_projector.*'], |
|
|
) |
|
|
|
|
|
SAVE_DIR = "gemma-3-27b-it-FP8-Dynamic" |
|
|
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR) |
|
|
processor.save_pretrained(SAVE_DIR) |
|
|
``` |
|
|
|
|
|
## 4. Gemma 3 model card |
|
|
|
|
|
**Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core) |
|
|
|
|
|
**Terms of Use**: [Terms][terms] |
|
|
|
|
|
**Authors**: Google DeepMind, BC Card (Quatization) |
|
|
|
|
|
### Description |
|
|
|
|
|
Gemma is a family of lightweight, state-of-the-art open models from Google, |
|
|
built from the same research and technology used to create the Gemini models. |
|
|
Gemma 3 models are multimodal, handling text and image input and generating text |
|
|
output, with open weights for both pre-trained variants and instruction-tuned |
|
|
variants. Gemma 3 has a large, 128K context window, multilingual support in over |
|
|
140 languages, and is available in more sizes than previous versions. Gemma 3 |
|
|
models are well-suited for a variety of text generation and image understanding |
|
|
tasks, including question answering, summarization, and reasoning. Their |
|
|
relatively small size makes it possible to deploy them in environments with |
|
|
limited resources such as laptops, desktops or your own cloud infrastructure, |
|
|
democratizing access to state of the art AI models and helping foster innovation |
|
|
for everyone. |
|
|
|
|
|
### Inputs and outputs |
|
|
|
|
|
- **Input:** |
|
|
- Text string, such as a question, a prompt, or a document to be summarized |
|
|
- Images, normalized to 896 x 896 resolution and encoded to 256 tokens |
|
|
each |
|
|
- Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and |
|
|
32K tokens for the 1B size |
|
|
|
|
|
- **Output:** |
|
|
- Generated text in response to the input, such as an answer to a |
|
|
question, analysis of image content, or a summary of a document |
|
|
- Total output context of 8192 tokens |
|
|
|
|
|
### Citation |
|
|
|
|
|
```none |
|
|
@article{gemma_2025, |
|
|
title={Gemma 3 FP8 Dynamic}, |
|
|
url={https://bccard.ai}, |
|
|
author={BC Card}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|