How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="egotools-dev/egotools-8b-v3_3")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)
# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("egotools-dev/egotools-8b-v3_3")
model = AutoModelForImageTextToText.from_pretrained("egotools-dev/egotools-8b-v3_3")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

EgoTools 8B v3.3

This repository stores intermediate checkpoints from full-parameter SFT of Qwen/Qwen3-VL-8B-Instruct on EgoTools v3.3.

Available checkpoints:

Checkpoint Location Step Epoch Notes
checkpoint-300 repository root 300 / 907 0.3309 First uploaded intermediate checkpoint.
checkpoint-600 checkpoint-600/ 600 / 907 0.6619 Second uploaded intermediate checkpoint.

The repository root currently contains the checkpoint-300 model files. checkpoint-600 is stored in the checkpoint-600/ subdirectory.

Training Setup

Field Value
Base model Qwen/Qwen3-VL-8B-Instruct
Framework ms-swift / Transformers
Tuning type Full-parameter SFT
Trainable params 8.19B / 8.77B, VLM LLM trainable; ViT and aligner frozen
GPUs 8 x NVIDIA A100-SXM4-40GB
Precision BF16
DeepSpeed ZeRO-3, no optimizer/parameter offload
Attention FlashAttention
Per-device batch size 2
Gradient accumulation 8
Effective batch size 128 samples
Epochs 1
Max steps 907
Learning rate 2.3e-6
LR scheduler constant
Warmup 0
Weight decay 0.1
Max sequence length 8192
Video frame sampling up to 64 frames
Video token budget 128
Image token budget 1024
Save interval every 300 steps

Important note: this run used a constant 2.3e-6 LR. Earlier V2 exploratory runs used 5e-6 with cosine decay and 3% warmup; these v3.3 checkpoints do not use that schedule.

Training Data

Dataset: EgoTools v3.3 SFT, converted to ms-swift video-clip format.

Main local training file:

data_v3_3/egotools_v3_3_sft_final_clips.swift.jsonl

Overall Mix

Family Rows Ratio
Multiple-choice QA 104,613 90.16%
Caption / narration completion 9,473 8.16%
Open-ended QA 1,945 1.68%
Total 116,031 100.00%

Sample Type Mix

Sample type Rows Ratio
mcq 63,276 54.53%
narration_mcq 17,591 15.16%
egoschema_caption_mcq 11,830 10.20%
egoplan_next_action_mcq 7,990 6.89%
caption_completion 7,532 6.49%
egoschema_fused_mcq 3,926 3.38%
egothink_open_qa 1,945 1.68%
narration_completion 1,941 1.67%

Option / Answer Balance

The MCQ portion was deterministically balanced by option count.

Option count Answer distribution
4 options A: 1,998; B: 1,997; C: 1,998; D: 1,997
5 options A: 6,669; B: 6,669; C: 6,670; D: 6,669; E: 6,670
8 options A: 7,910; B: 7,909; C: 7,910; D: 7,910; E: 7,909; F: 7,910; G: 7,909; H: 7,909

Video Coverage

Field Value
Unique video references 362
Unique generated clips 13,100
Missing video rows 0
Full train-video references 92,572
Train-segment clip references 23,459

Checkpoint Metrics

Checkpoint Loss Token accuracy LR
checkpoint-300 0.8521 0.7638 2.3e-6
checkpoint-600 0.8500 0.7705 2.3e-6

No evaluation set was run for these intermediate checkpoints.

Downloads last month
33
Safetensors
Model size
770k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for egotools-dev/egotools-8b-v3_3

Finetuned
(278)
this model
Quantizations
1 model