Instructions to use google/gemma-3-4b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-3-4b-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-3-4b-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-4b-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-3-4b-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-3-4b-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-4b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-3-4b-it
- SGLang
How to use google/gemma-3-4b-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-3-4b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-4b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-3-4b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-4b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-3-4b-it with Docker Model Runner:
docker model run hf.co/google/gemma-3-4b-it
VRAM not freed during long generations (Gemma, max_new_tokens=3000)
When using the official Gemma example code but changing max_new_tokens=200 to 3000, I get a CUDA error:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED during cublasSgemm call.
Additionally, even when the model gives a short response, VRAM remains occupied until all 3000 tokens are processed.
Hi @Nessit ,
By specifying max_new_tokens=3000 which means the model to prepare memory for generating up to 3000 tokens, regardless of how many are actually generated.
Even if the model replies with only a few tokens, the full memory buffer is still allocated and that memory stays locked until the process is done.
To solve this issue, try increasing max_new_tokens gradually: 200 → 500 → 1000, and monitor usage.
Also, using half-precision or quantized versions of the model can help save memory and improve performance.
I successfully executed the official Gemma example code in google colab with Runtime Type: T4 GPU as by specifying the max_new_tokens=3000, could you please refer to this gist file.
Thank you.
thank you for your answer! I understand your answer, but I'm encountering an issue with GPU utilization. When I ask short questions, I receive short responses, but the GPU remains occupied for an extended period after the answer is complete. I can't perform any other operations until this process finishes, suggesting the stop token might not be functioning properly.
For comparison:
With Qwen, using 3000 tokens allows me to ask both long and short questions - the GPU releases immediately after the answer appears.
With Gemma, regardless of question length or answer size, the GPU stays busy for the full duration needed to process 3000 tokens, blocking further operations.
This behavior significantly impacts workflow efficiency. Is there a way to make Gemma release GPU resources immediately after generating the complete answer, like Qwen does?
It seems the issue is with memory allocation for 3000 tokens; try gradually reducing max_new_tokens, using half-precision (FP16) or quantized models, and manually releasing memory with torch.cuda.empty_cache() after each generation.
I see its possible to do memory optimizations. I am using a quantified 27b model (gguf) which works fantastic on a 24gb rtx quadro passive in lm studio. And its certainly possible to push this by tweaking token allocation, for example increasing the context window notably increases its memory with respect to the chatlog which was quite a bit surprising. However i wonder if its possible for the model to "forget" posdibly with a smart selection of what is important or and maybe compress information somehow Since its a fully vision enabled model you can overload it fairly quickly by showing it some higres visual data. Is there any other mechanism to loose tokens exept the pytorch cache cleanup?