RedHatAI
/

gemma-3-4b-it-quantized.w4a16

text-generation-inference

compressed-tensors

Model card Files Files and versions

nm-research commited on Jun 5, 2025

Commit

a6adb5f

·

verified ·

1 Parent(s): 4962ff6

Update README.md

Files changed (1) hide show

README.md +24 -22

README.md CHANGED Viewed

@@ -34,32 +34,34 @@ This model was obtained by quantizing the weights of [google/gemma-3-4b-it](http
 This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 ```python
-from vllm.assets.image import ImageAsset
 from vllm import LLM, SamplingParams
-# prepare model
-llm = LLM(
-    model="nm-testing/gemma-3-4b-it-quantized.w4a16",
-    trust_remote_code=True,
-    max_model_len=4096,
-    max_num_seqs=2,
-)
-# prepare inputs
-question = "What is the content of this image?"
-inputs = {
-    "prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
-    "multi_modal_data": {
-        "image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
-    },
-}
-# generate response
-print("========== SAMPLE GENERATION ==============")
 outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
-print(f"PROMPT  : {outputs[0].prompt}")
-print(f"RESPONSE: {outputs[0].outputs[0].text}")
-print("==========================================")
 ```
 vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.

 This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 ```python
 from vllm import LLM, SamplingParams
+from vllm.assets.image import ImageAsset
+from transformers import AutoProcessor
+# Define model name once
+model_name = "RedHatAI/gemma-3-4b-it-quantized.w8a8"
+# Load image and processor
+image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
+processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
+# Build multimodal prompt
+chat = [
+    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What is the content of this image?"}]},
+    {"role": "assistant", "content": []}
+]
+prompt = processor.apply_chat_template(chat, add_generation_prompt=True)
+# Initialize model
+llm = LLM(model=model_name, trust_remote_code=True)
+# Run inference
+inputs = {"prompt": prompt, "multi_modal_data": {"image": [image]}}
 outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
+# Display result
+print("RESPONSE:", outputs[0].outputs[0].text)
 ```
 vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.