--- base_model: Qwen/Qwen3-VL-8B-Instruct library_name: transformers pipeline_tag: image-text-to-text tags: - qwen3-vl - video-language-model - egocentric-video - ms-swift - sft --- # EgoTools 8B v3.3 This repository stores intermediate checkpoints from full-parameter SFT of `Qwen/Qwen3-VL-8B-Instruct` on EgoTools v3.3. Available checkpoints: | Checkpoint | Location | Step | Epoch | Notes | |---|---|---:|---:|---| | checkpoint-300 | repository root | 300 / 907 | 0.3309 | First uploaded intermediate checkpoint. | | checkpoint-600 | `checkpoint-600/` | 600 / 907 | 0.6619 | Second uploaded intermediate checkpoint. | The repository root currently contains the `checkpoint-300` model files. `checkpoint-600` is stored in the `checkpoint-600/` subdirectory. ## Training Setup | Field | Value | |---|---:| | Base model | `Qwen/Qwen3-VL-8B-Instruct` | | Framework | `ms-swift` / Transformers | | Tuning type | Full-parameter SFT | | Trainable params | 8.19B / 8.77B, VLM LLM trainable; ViT and aligner frozen | | GPUs | 8 x NVIDIA A100-SXM4-40GB | | Precision | BF16 | | DeepSpeed | ZeRO-3, no optimizer/parameter offload | | Attention | FlashAttention | | Per-device batch size | 2 | | Gradient accumulation | 8 | | Effective batch size | 128 samples | | Epochs | 1 | | Max steps | 907 | | Learning rate | `2.3e-6` | | LR scheduler | `constant` | | Warmup | 0 | | Weight decay | 0.1 | | Max sequence length | 8192 | | Video frame sampling | up to 64 frames | | Video token budget | 128 | | Image token budget | 1024 | | Save interval | every 300 steps | Important note: this run used a constant `2.3e-6` LR. Earlier V2 exploratory runs used `5e-6` with cosine decay and 3% warmup; these v3.3 checkpoints do not use that schedule. ## Training Data Dataset: EgoTools v3.3 SFT, converted to ms-swift video-clip format. Main local training file: `data_v3_3/egotools_v3_3_sft_final_clips.swift.jsonl` ### Overall Mix | Family | Rows | Ratio | |---|---:|---:| | Multiple-choice QA | 104,613 | 90.16% | | Caption / narration completion | 9,473 | 8.16% | | Open-ended QA | 1,945 | 1.68% | | Total | 116,031 | 100.00% | ### Sample Type Mix | Sample type | Rows | Ratio | |---|---:|---:| | `mcq` | 63,276 | 54.53% | | `narration_mcq` | 17,591 | 15.16% | | `egoschema_caption_mcq` | 11,830 | 10.20% | | `egoplan_next_action_mcq` | 7,990 | 6.89% | | `caption_completion` | 7,532 | 6.49% | | `egoschema_fused_mcq` | 3,926 | 3.38% | | `egothink_open_qa` | 1,945 | 1.68% | | `narration_completion` | 1,941 | 1.67% | ### Option / Answer Balance The MCQ portion was deterministically balanced by option count. | Option count | Answer distribution | |---:|---| | 4 options | A: 1,998; B: 1,997; C: 1,998; D: 1,997 | | 5 options | A: 6,669; B: 6,669; C: 6,670; D: 6,669; E: 6,670 | | 8 options | A: 7,910; B: 7,909; C: 7,910; D: 7,910; E: 7,909; F: 7,910; G: 7,909; H: 7,909 | ### Video Coverage | Field | Value | |---|---:| | Unique video references | 362 | | Unique generated clips | 13,100 | | Missing video rows | 0 | | Full train-video references | 92,572 | | Train-segment clip references | 23,459 | ## Checkpoint Metrics | Checkpoint | Loss | Token accuracy | LR | |---|---:|---:|---:| | checkpoint-300 | 0.8521 | 0.7638 | 2.3e-6 | | checkpoint-600 | 0.8500 | 0.7705 | 2.3e-6 | No evaluation set was run for these intermediate checkpoints.