gemma4-e4b-webvid4K_FT
Fine-tuned google/gemma-4-e4b-it checkpoint for video action recognition and short video question answering on a WebVid4K-style training set.
Model Specs
| Item | Value |
|---|---|
| Base model | google/gemma-4-e4b-it |
| Architecture | Gemma4ForConditionalGeneration |
| Model type | gemma4 multimodal causal generation |
| Fine-tuning type | Full model checkpoint (use_lora=False) |
| Training dtype | bf16 |
| Output dtype | bfloat16 / safetensors |
| Final checkpoint | model.safetensors |
| Dataset | bear7011/gemma-4-e4b-webvid-4K local training split |
| Training samples | 3,941 |
| Task format | Video + text prompt to short text answer |
Architecture Details
| Component | Spec |
|---|---|
| Text model | Gemma4 text decoder |
| Text layers | 42 |
| Text hidden size | 2,560 |
| Text FFN intermediate size | 10,240 |
| Text attention heads | 8 |
| Text vocabulary size | 262,144 |
| Vision tower | Gemma4 vision encoder |
| Vision layers | 16 |
| Vision hidden size | 768 |
| Vision attention heads | 12 |
| Vision patch size | 16 |
| Vision FFN intermediate size | 3,072 |
Training Specs
| Item | Value |
|---|---|
| Hardware | 4 x NVIDIA Tesla V100-SXM2 32GB |
| Distributed training | DeepSpeed |
| Epochs | 1 |
| Global steps | 124 |
| Per-device train batch size | 1 |
| Gradient accumulation steps | 8 |
| Effective global batch size | 32 |
| Optimizer | adamw_torch |
| LR scheduler | cosine |
| Learning rate | 5e-6 |
| Projector LR | 5e-6 |
| Image encoder LR | 0.0 |
| Weight decay | 0.01 |
| Warmup ratio | 0.03 |
| Gradient checkpointing | enabled |
| Evaluation strategy | none during training |
| Final train loss | 1.6628 |
| Training runtime | 18,750.99 seconds |
| Throughput | 0.21 samples/sec |
Expected Input Format
The model was fine-tuned with message-style multimodal examples:
[
{
"video_metadata": {
"fps": 25.0,
"duration_sec": 8.3
},
"messages": [
{
"role": "user",
"content": [
{"type": "video", "video": "clips/example.mp4"},
{"type": "text", "text": "What action is performed?"}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "riding a bicycle"}
]
}
]
}
]
Usage
import torch
from transformers import AutoProcessor, Gemma4ForConditionalGeneration
model_id = "bear7011/gemma4-e4b-webvid4K_FT"
processor = AutoProcessor.from_pretrained(model_id)
model = Gemma4ForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": "path/to/video.mp4"},
{"type": "text", "text": "What action is performed in this video?"},
],
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=32)
print(processor.decode(output_ids[0], skip_special_tokens=True))
Limitations
This checkpoint is optimized for short WebVid-style clips and action-centric prompts. It was not evaluated here for long-form video reasoning, safety-sensitive decisions, or broad multilingual video QA.
- Downloads last month
- -