gemma4-e4b-webvid4K_FT

Fine-tuned google/gemma-4-e4b-it checkpoint for video action recognition and short video question answering on a WebVid4K-style training set.

Model Specs

Item Value
Base model google/gemma-4-e4b-it
Architecture Gemma4ForConditionalGeneration
Model type gemma4 multimodal causal generation
Fine-tuning type Full model checkpoint (use_lora=False)
Training dtype bf16
Output dtype bfloat16 / safetensors
Final checkpoint model.safetensors
Dataset bear7011/gemma-4-e4b-webvid-4K local training split
Training samples 3,941
Task format Video + text prompt to short text answer

Architecture Details

Component Spec
Text model Gemma4 text decoder
Text layers 42
Text hidden size 2,560
Text FFN intermediate size 10,240
Text attention heads 8
Text vocabulary size 262,144
Vision tower Gemma4 vision encoder
Vision layers 16
Vision hidden size 768
Vision attention heads 12
Vision patch size 16
Vision FFN intermediate size 3,072

Training Specs

Item Value
Hardware 4 x NVIDIA Tesla V100-SXM2 32GB
Distributed training DeepSpeed
Epochs 1
Global steps 124
Per-device train batch size 1
Gradient accumulation steps 8
Effective global batch size 32
Optimizer adamw_torch
LR scheduler cosine
Learning rate 5e-6
Projector LR 5e-6
Image encoder LR 0.0
Weight decay 0.01
Warmup ratio 0.03
Gradient checkpointing enabled
Evaluation strategy none during training
Final train loss 1.6628
Training runtime 18,750.99 seconds
Throughput 0.21 samples/sec

Expected Input Format

The model was fine-tuned with message-style multimodal examples:

[
  {
    "video_metadata": {
      "fps": 25.0,
      "duration_sec": 8.3
    },
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "video", "video": "clips/example.mp4"},
          {"type": "text", "text": "What action is performed?"}
        ]
      },
      {
        "role": "assistant",
        "content": [
          {"type": "text", "text": "riding a bicycle"}
        ]
      }
    ]
  }
]

Usage

import torch
from transformers import AutoProcessor, Gemma4ForConditionalGeneration

model_id = "bear7011/gemma4-e4b-webvid4K_FT"

processor = AutoProcessor.from_pretrained(model_id)
model = Gemma4ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "path/to/video.mp4"},
            {"type": "text", "text": "What action is performed in this video?"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=32)

print(processor.decode(output_ids[0], skip_special_tokens=True))

Limitations

This checkpoint is optimized for short WebVid-style clips and action-centric prompts. It was not evaluated here for long-form video reasoning, safety-sensitive decisions, or broad multilingual video QA.

Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support