bear7011/gemma4-e4b-kinetic3K_FT
This repository contains a LoRA adapter fine-tuned from google/gemma-4-e4b-it for action recognition on a Kinetics-3K style dataset.
The training code supports both image and video inputs, but this specific checkpoint was trained on 4-frame image sequences extracted from Kinetics clips, not on raw videos.
What Was Trained
- Base model:
google/gemma-4-e4b-it - Adapter type: LoRA
- Output artifact: adapter-only checkpoint (
adapter_model.safetensors) - Task: action recognition / short event description from a short frame sequence
The saved adapter applies LoRA to all 42 text transformer layers of Gemma 4 E4B with:
r=16lora_alpha=32lora_dropout=0.05- Target modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
The vision tower and projector were kept frozen in this run, so this is effectively a pure LoRA fine-tune on the language backbone conditioned on visual inputs.
Training Data
This model was trained with the dataset at:
./dataset/kinetics_3k/kinetic_3K.json- Image root:
./dataset/kinetics_3k
Dataset summary:
- 3,115 training samples
- Each sample contains 4 sequential frames from a video clip
- The user prompt asks the model to identify the action or event in the frame sequence
- The assistant target is a short natural-language action description
Example prompt format:
{
"messages": [
{
"role": "user",
"content": [
{"type": "image", "image": "frames/<clip_id>/frame_1.jpg"},
{"type": "image", "image": "frames/<clip_id>/frame_2.jpg"},
{"type": "image", "image": "frames/<clip_id>/frame_3.jpg"},
{"type": "image", "image": "frames/<clip_id>/frame_4.jpg"},
{
"type": "text",
"text": "Please analyze the sequence of frames from this video. What specific action or event is happening?"
}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "<action description>"}
]
}
]
}
How It Was Trained
Training was performed with a custom supervised fine-tuning pipeline built around:
transformerspeftdeepspeedbitsandbytesoptimizer (paged_adamw_8bit)
Core training setup used for this checkpoint:
- Precision:
bf16 - DeepSpeed: ZeRO Stage 2
- Epochs:
3 - Total training steps:
1170 - Per-device batch size:
1 - Gradient accumulation:
8 - Effective optimizer:
paged_adamw_8bit - Learning rate:
2e-4 - Weight decay:
0.0 - Warmup ratio:
0.03 - LR scheduler:
cosine - Gradient checkpointing: enabled
- Save every
200steps - Keep last
2checkpoints
Final trainer summary:
- Train loss:
14.4465 - Train runtime:
5026.56seconds - Train samples/sec:
1.859 - Train steps/sec:
0.233
Training Command
The project launcher was based on:
MODEL_NAME=google/gemma-4-e4b-it \
DATA_PATH=./dataset/kinetics_3k/kinetic_3K.json \
IMAGE_FOLDER=./dataset/kinetics_3k \
OUTPUT_DIR=./output/gemma4_e4b_lora_only \
RUN_NAME=gemma4-e4b-lora-only \
uv run deepspeed \
--num_gpus 1 \
--master_port 29500 \
stage1/train.py \
--deepspeed deepspeed_config/stage1.json \
--model_id google/gemma-4-e4b-it \
--data_path ./dataset/kinetics_3k/kinetic_3K.json \
--image_folder ./dataset/kinetics_3k \
--output_dir ./output/gemma4_e4b_lora_only \
--run_name gemma4-e4b-lora-only \
--bf16 True \
--use_lora True \
--lora_r 16 \
--lora_alpha 32 \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--optim paged_adamw_8bit \
--learning_rate 2e-4 \
--image_encoder_lr 0.0 \
--projector_lr 0.0 \
--weight_decay 0.0 \
--warmup_ratio 0.03 \
--lr_scheduler_type cosine \
--save_strategy steps \
--save_steps 200 \
--save_total_limit 2 \
--gradient_checkpointing True \
--logging_steps 10 \
--dataloader_num_workers 4 \
--report_to none
DeepSpeed config:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 5e7
},
"bf16": {
"enabled": true
}
}
Usage
This repo contains adapter weights only. Load the base model first, then attach the LoRA adapter.
from transformers import AutoProcessor, Gemma4ForConditionalGeneration
from peft import PeftModel
base_model = Gemma4ForConditionalGeneration.from_pretrained(
"google/gemma-4-e4b-it",
torch_dtype="auto",
device_map="auto",
)
model = PeftModel.from_pretrained(
base_model,
"bear7011/gemma4-e4b-kinetic3K_FT",
)
processor = AutoProcessor.from_pretrained("google/gemma-4-e4b-it")
Notes
- This checkpoint is an adapter, not a merged full model.
- The repo currently stores final adapter artifacts only; intermediate training checkpoints are intentionally excluded.
- No separate benchmark or held-out evaluation report is included in this repository.
- Downloads last month
- 41