---
base_model: Qwen/Qwen3-VL-8B-Instruct
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - qwen3-vl
  - video-language-model
  - egocentric-video
  - ms-swift
  - sft
---

# EgoTools 8B v3.3

This repository stores intermediate checkpoints from full-parameter SFT of `Qwen/Qwen3-VL-8B-Instruct` on EgoTools v3.3.

Available checkpoints:

| Checkpoint | Location | Step | Epoch | Notes |
|---|---|---:|---:|---|
| checkpoint-300 | repository root | 300 / 907 | 0.3309 | First uploaded intermediate checkpoint. |
| checkpoint-600 | `checkpoint-600/` | 600 / 907 | 0.6619 | Second uploaded intermediate checkpoint. |

The repository root currently contains the `checkpoint-300` model files. `checkpoint-600` is stored in the `checkpoint-600/` subdirectory.

## Training Setup

| Field | Value |
|---|---:|
| Base model | `Qwen/Qwen3-VL-8B-Instruct` |
| Framework | `ms-swift` / Transformers |
| Tuning type | Full-parameter SFT |
| Trainable params | 8.19B / 8.77B, VLM LLM trainable; ViT and aligner frozen |
| GPUs | 8 x NVIDIA A100-SXM4-40GB |
| Precision | BF16 |
| DeepSpeed | ZeRO-3, no optimizer/parameter offload |
| Attention | FlashAttention |
| Per-device batch size | 2 |
| Gradient accumulation | 8 |
| Effective batch size | 128 samples |
| Epochs | 1 |
| Max steps | 907 |
| Learning rate | `2.3e-6` |
| LR scheduler | `constant` |
| Warmup | 0 |
| Weight decay | 0.1 |
| Max sequence length | 8192 |
| Video frame sampling | up to 64 frames |
| Video token budget | 128 |
| Image token budget | 1024 |
| Save interval | every 300 steps |

Important note: this run used a constant `2.3e-6` LR. Earlier V2 exploratory runs used `5e-6` with cosine decay and 3% warmup; these v3.3 checkpoints do not use that schedule.

## Training Data

Dataset: EgoTools v3.3 SFT, converted to ms-swift video-clip format.

Main local training file:

`data_v3_3/egotools_v3_3_sft_final_clips.swift.jsonl`

### Overall Mix

| Family | Rows | Ratio |
|---|---:|---:|
| Multiple-choice QA | 104,613 | 90.16% |
| Caption / narration completion | 9,473 | 8.16% |
| Open-ended QA | 1,945 | 1.68% |
| Total | 116,031 | 100.00% |

### Sample Type Mix

| Sample type | Rows | Ratio |
|---|---:|---:|
| `mcq` | 63,276 | 54.53% |
| `narration_mcq` | 17,591 | 15.16% |
| `egoschema_caption_mcq` | 11,830 | 10.20% |
| `egoplan_next_action_mcq` | 7,990 | 6.89% |
| `caption_completion` | 7,532 | 6.49% |
| `egoschema_fused_mcq` | 3,926 | 3.38% |
| `egothink_open_qa` | 1,945 | 1.68% |
| `narration_completion` | 1,941 | 1.67% |

### Option / Answer Balance

The MCQ portion was deterministically balanced by option count.

| Option count | Answer distribution |
|---:|---|
| 4 options | A: 1,998; B: 1,997; C: 1,998; D: 1,997 |
| 5 options | A: 6,669; B: 6,669; C: 6,670; D: 6,669; E: 6,670 |
| 8 options | A: 7,910; B: 7,909; C: 7,910; D: 7,910; E: 7,909; F: 7,910; G: 7,909; H: 7,909 |

### Video Coverage

| Field | Value |
|---|---:|
| Unique video references | 362 |
| Unique generated clips | 13,100 |
| Missing video rows | 0 |
| Full train-video references | 92,572 |
| Train-segment clip references | 23,459 |

## Checkpoint Metrics

| Checkpoint | Loss | Token accuracy | LR |
|---|---:|---:|---:|
| checkpoint-300 | 0.8521 | 0.7638 | 2.3e-6 |
| checkpoint-600 | 0.8500 | 0.7705 | 2.3e-6 |

No evaluation set was run for these intermediate checkpoints.