bear7011/gemma4-e4b-kinetic3K_FT

This repository contains a LoRA adapter fine-tuned from google/gemma-4-e4b-it for action recognition on a Kinetics-3K style dataset.

The training code supports both image and video inputs, but this specific checkpoint was trained on 4-frame image sequences extracted from Kinetics clips, not on raw videos.

What Was Trained

  • Base model: google/gemma-4-e4b-it
  • Adapter type: LoRA
  • Output artifact: adapter-only checkpoint (adapter_model.safetensors)
  • Task: action recognition / short event description from a short frame sequence

The saved adapter applies LoRA to all 42 text transformer layers of Gemma 4 E4B with:

  • r=16
  • lora_alpha=32
  • lora_dropout=0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

The vision tower and projector were kept frozen in this run, so this is effectively a pure LoRA fine-tune on the language backbone conditioned on visual inputs.

Training Data

This model was trained with the dataset at:

  • ./dataset/kinetics_3k/kinetic_3K.json
  • Image root: ./dataset/kinetics_3k

Dataset summary:

  • 3,115 training samples
  • Each sample contains 4 sequential frames from a video clip
  • The user prompt asks the model to identify the action or event in the frame sequence
  • The assistant target is a short natural-language action description

Example prompt format:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image", "image": "frames/<clip_id>/frame_1.jpg"},
        {"type": "image", "image": "frames/<clip_id>/frame_2.jpg"},
        {"type": "image", "image": "frames/<clip_id>/frame_3.jpg"},
        {"type": "image", "image": "frames/<clip_id>/frame_4.jpg"},
        {
          "type": "text",
          "text": "Please analyze the sequence of frames from this video. What specific action or event is happening?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "<action description>"}
      ]
    }
  ]
}

How It Was Trained

Training was performed with a custom supervised fine-tuning pipeline built around:

  • transformers
  • peft
  • deepspeed
  • bitsandbytes optimizer (paged_adamw_8bit)

Core training setup used for this checkpoint:

  • Precision: bf16
  • DeepSpeed: ZeRO Stage 2
  • Epochs: 3
  • Total training steps: 1170
  • Per-device batch size: 1
  • Gradient accumulation: 8
  • Effective optimizer: paged_adamw_8bit
  • Learning rate: 2e-4
  • Weight decay: 0.0
  • Warmup ratio: 0.03
  • LR scheduler: cosine
  • Gradient checkpointing: enabled
  • Save every 200 steps
  • Keep last 2 checkpoints

Final trainer summary:

  • Train loss: 14.4465
  • Train runtime: 5026.56 seconds
  • Train samples/sec: 1.859
  • Train steps/sec: 0.233

Training Command

The project launcher was based on:

MODEL_NAME=google/gemma-4-e4b-it \
DATA_PATH=./dataset/kinetics_3k/kinetic_3K.json \
IMAGE_FOLDER=./dataset/kinetics_3k \
OUTPUT_DIR=./output/gemma4_e4b_lora_only \
RUN_NAME=gemma4-e4b-lora-only \
uv run deepspeed \
  --num_gpus 1 \
  --master_port 29500 \
  stage1/train.py \
  --deepspeed deepspeed_config/stage1.json \
  --model_id google/gemma-4-e4b-it \
  --data_path ./dataset/kinetics_3k/kinetic_3K.json \
  --image_folder ./dataset/kinetics_3k \
  --output_dir ./output/gemma4_e4b_lora_only \
  --run_name gemma4-e4b-lora-only \
  --bf16 True \
  --use_lora True \
  --lora_r 16 \
  --lora_alpha 32 \
  --num_train_epochs 3 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --optim paged_adamw_8bit \
  --learning_rate 2e-4 \
  --image_encoder_lr 0.0 \
  --projector_lr 0.0 \
  --weight_decay 0.0 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type cosine \
  --save_strategy steps \
  --save_steps 200 \
  --save_total_limit 2 \
  --gradient_checkpointing True \
  --logging_steps 10 \
  --dataloader_num_workers 4 \
  --report_to none

DeepSpeed config:

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0,
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e7
  },
  "bf16": {
    "enabled": true
  }
}

Usage

This repo contains adapter weights only. Load the base model first, then attach the LoRA adapter.

from transformers import AutoProcessor, Gemma4ForConditionalGeneration
from peft import PeftModel

base_model = Gemma4ForConditionalGeneration.from_pretrained(
    "google/gemma-4-e4b-it",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(
    base_model,
    "bear7011/gemma4-e4b-kinetic3K_FT",
)
processor = AutoProcessor.from_pretrained("google/gemma-4-e4b-it")

Notes

  • This checkpoint is an adapter, not a merged full model.
  • The repo currently stores final adapter artifacts only; intermediate training checkpoints are intentionally excluded.
  • No separate benchmark or held-out evaluation report is included in this repository.
Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support