Instructions to use bear7011/gemma4-e4b-webvid4K_FT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use bear7011/gemma4-e4b-webvid4K_FT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="bear7011/gemma4-e4b-webvid4K_FT")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("bear7011/gemma4-e4b-webvid4K_FT")
model = AutoModelForImageTextToText.from_pretrained("bear7011/gemma4-e4b-webvid4K_FT")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use bear7011/gemma4-e4b-webvid4K_FT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "bear7011/gemma4-e4b-webvid4K_FT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bear7011/gemma4-e4b-webvid4K_FT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/bear7011/gemma4-e4b-webvid4K_FT

SGLang

How to use bear7011/gemma4-e4b-webvid4K_FT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "bear7011/gemma4-e4b-webvid4K_FT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bear7011/gemma4-e4b-webvid4K_FT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "bear7011/gemma4-e4b-webvid4K_FT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bear7011/gemma4-e4b-webvid4K_FT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use bear7011/gemma4-e4b-webvid4K_FT with Docker Model Runner:
```
docker model run hf.co/bear7011/gemma4-e4b-webvid4K_FT
```

gemma4-e4b-webvid4K_FT

Fine-tuned google/gemma-4-e4b-it checkpoint for video action recognition and short video question answering on a WebVid4K-style training set.

Model Specs

Item	Value
Base model	`google/gemma-4-e4b-it`
Architecture	`Gemma4ForConditionalGeneration`
Model type	`gemma4` multimodal causal generation
Fine-tuning type	Full model checkpoint (`use_lora=False`)
Training dtype	`bf16`
Output dtype	`bfloat16` / safetensors
Final checkpoint	`model.safetensors`
Dataset	`bear7011/gemma-4-e4b-webvid-4K` local training split
Training samples	3,941
Task format	Video + text prompt to short text answer

Architecture Details

Component	Spec
Text model	Gemma4 text decoder
Text layers	42
Text hidden size	2,560
Text FFN intermediate size	10,240
Text attention heads	8
Text vocabulary size	262,144
Vision tower	Gemma4 vision encoder
Vision layers	16
Vision hidden size	768
Vision attention heads	12
Vision patch size	16
Vision FFN intermediate size	3,072

Training Specs

Item	Value
Hardware	4 x NVIDIA Tesla V100-SXM2 32GB
Distributed training	DeepSpeed
Epochs	1
Global steps	124
Per-device train batch size	1
Gradient accumulation steps	8
Effective global batch size	32
Optimizer	`adamw_torch`
LR scheduler	cosine
Learning rate	`5e-6`
Projector LR	`5e-6`
Image encoder LR	`0.0`
Weight decay	`0.01`
Warmup ratio	`0.03`
Gradient checkpointing	enabled
Evaluation strategy	none during training
Final train loss	1.6628
Training runtime	18,750.99 seconds
Throughput	0.21 samples/sec

Expected Input Format

The model was fine-tuned with message-style multimodal examples:

[
  {
    "video_metadata": {
      "fps": 25.0,
      "duration_sec": 8.3
    },
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "video", "video": "clips/example.mp4"},
          {"type": "text", "text": "What action is performed?"}
        ]
      },
      {
        "role": "assistant",
        "content": [
          {"type": "text", "text": "riding a bicycle"}
        ]
      }
    ]
  }
]

Usage

import torch
from transformers import AutoProcessor, Gemma4ForConditionalGeneration

model_id = "bear7011/gemma4-e4b-webvid4K_FT"

processor = AutoProcessor.from_pretrained(model_id)
model = Gemma4ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "path/to/video.mp4"},
            {"type": "text", "text": "What action is performed in this video?"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=32)

print(processor.decode(output_ids[0], skip_special_tokens=True))

Limitations

This checkpoint is optimized for short WebVid-style clips and action-centric prompts. It was not evaluated here for long-form video reasoning, safety-sensitive decisions, or broad multilingual video QA.

Downloads last month: -

Safetensors

Model size

8B params

Tensor type

BF16