Instructions to use egotools-dev/egotools-8b-v3_3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use egotools-dev/egotools-8b-v3_3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="egotools-dev/egotools-8b-v3_3")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("egotools-dev/egotools-8b-v3_3")
model = AutoModelForImageTextToText.from_pretrained("egotools-dev/egotools-8b-v3_3")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use egotools-dev/egotools-8b-v3_3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "egotools-dev/egotools-8b-v3_3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "egotools-dev/egotools-8b-v3_3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/egotools-dev/egotools-8b-v3_3

SGLang

How to use egotools-dev/egotools-8b-v3_3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "egotools-dev/egotools-8b-v3_3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "egotools-dev/egotools-8b-v3_3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "egotools-dev/egotools-8b-v3_3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "egotools-dev/egotools-8b-v3_3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use egotools-dev/egotools-8b-v3_3 with Docker Model Runner:
```
docker model run hf.co/egotools-dev/egotools-8b-v3_3
```

egotools-8b-v3_3

File size: 3,363 Bytes

---
base_model: Qwen/Qwen3-VL-8B-Instruct
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - qwen3-vl
  - video-language-model
  - egocentric-video
  - ms-swift
  - sft
---

# EgoTools 8B v3.3

This repository stores intermediate checkpoints from full-parameter SFT of `Qwen/Qwen3-VL-8B-Instruct` on EgoTools v3.3.

Available checkpoints:

| Checkpoint | Location | Step | Epoch | Notes |
|---|---|---:|---:|---|
| checkpoint-300 | repository root | 300 / 907 | 0.3309 | First uploaded intermediate checkpoint. |
| checkpoint-600 | `checkpoint-600/` | 600 / 907 | 0.6619 | Second uploaded intermediate checkpoint. |

The repository root currently contains the `checkpoint-300` model files. `checkpoint-600` is stored in the `checkpoint-600/` subdirectory.

## Training Setup

| Field | Value |
|---|---:|
| Base model | `Qwen/Qwen3-VL-8B-Instruct` |
| Framework | `ms-swift` / Transformers |
| Tuning type | Full-parameter SFT |
| Trainable params | 8.19B / 8.77B, VLM LLM trainable; ViT and aligner frozen |
| GPUs | 8 x NVIDIA A100-SXM4-40GB |
| Precision | BF16 |
| DeepSpeed | ZeRO-3, no optimizer/parameter offload |
| Attention | FlashAttention |
| Per-device batch size | 2 |
| Gradient accumulation | 8 |
| Effective batch size | 128 samples |
| Epochs | 1 |
| Max steps | 907 |
| Learning rate | `2.3e-6` |
| LR scheduler | `constant` |
| Warmup | 0 |
| Weight decay | 0.1 |
| Max sequence length | 8192 |
| Video frame sampling | up to 64 frames |
| Video token budget | 128 |
| Image token budget | 1024 |
| Save interval | every 300 steps |

Important note: this run used a constant `2.3e-6` LR. Earlier V2 exploratory runs used `5e-6` with cosine decay and 3% warmup; these v3.3 checkpoints do not use that schedule.

## Training Data

Dataset: EgoTools v3.3 SFT, converted to ms-swift video-clip format.

Main local training file:

`data_v3_3/egotools_v3_3_sft_final_clips.swift.jsonl`

### Overall Mix

| Family | Rows | Ratio |
|---|---:|---:|
| Multiple-choice QA | 104,613 | 90.16% |
| Caption / narration completion | 9,473 | 8.16% |
| Open-ended QA | 1,945 | 1.68% |
| Total | 116,031 | 100.00% |

### Sample Type Mix

| Sample type | Rows | Ratio |
|---|---:|---:|
| `mcq` | 63,276 | 54.53% |
| `narration_mcq` | 17,591 | 15.16% |
| `egoschema_caption_mcq` | 11,830 | 10.20% |
| `egoplan_next_action_mcq` | 7,990 | 6.89% |
| `caption_completion` | 7,532 | 6.49% |
| `egoschema_fused_mcq` | 3,926 | 3.38% |
| `egothink_open_qa` | 1,945 | 1.68% |
| `narration_completion` | 1,941 | 1.67% |

### Option / Answer Balance

The MCQ portion was deterministically balanced by option count.

| Option count | Answer distribution |
|---:|---|
| 4 options | A: 1,998; B: 1,997; C: 1,998; D: 1,997 |
| 5 options | A: 6,669; B: 6,669; C: 6,670; D: 6,669; E: 6,670 |
| 8 options | A: 7,910; B: 7,909; C: 7,910; D: 7,910; E: 7,909; F: 7,910; G: 7,909; H: 7,909 |

### Video Coverage

| Field | Value |
|---|---:|
| Unique video references | 362 |
| Unique generated clips | 13,100 |
| Missing video rows | 0 |
| Full train-video references | 92,572 |
| Train-segment clip references | 23,459 |

## Checkpoint Metrics

| Checkpoint | Loss | Token accuracy | LR |
|---|---:|---:|---:|
| checkpoint-300 | 0.8521 | 0.7638 | 2.3e-6 |
| checkpoint-600 | 0.8500 | 0.7705 | 2.3e-6 |

No evaluation set was run for these intermediate checkpoints.