Instructions to use egotools-dev/egotools-8b-v3_3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use egotools-dev/egotools-8b-v3_3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="egotools-dev/egotools-8b-v3_3")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("egotools-dev/egotools-8b-v3_3")
model = AutoModelForImageTextToText.from_pretrained("egotools-dev/egotools-8b-v3_3")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use egotools-dev/egotools-8b-v3_3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "egotools-dev/egotools-8b-v3_3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "egotools-dev/egotools-8b-v3_3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/egotools-dev/egotools-8b-v3_3

SGLang

How to use egotools-dev/egotools-8b-v3_3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "egotools-dev/egotools-8b-v3_3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "egotools-dev/egotools-8b-v3_3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "egotools-dev/egotools-8b-v3_3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "egotools-dev/egotools-8b-v3_3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use egotools-dev/egotools-8b-v3_3 with Docker Model Runner:
```
docker model run hf.co/egotools-dev/egotools-8b-v3_3
```

egotools-8b-v3_3 / README.md

shulin16

Add files using upload-large-folder tool

cd3f102 verified 18 days ago

preview code

raw

history blame contribute delete

3.36 kB

	---
	base_model: Qwen/Qwen3-VL-8B-Instruct
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags:
	- qwen3-vl
	- video-language-model
	- egocentric-video
	- ms-swift
	- sft
	---

	# EgoTools 8B v3.3

	This repository stores intermediate checkpoints from full-parameter SFT of `Qwen/Qwen3-VL-8B-Instruct` on EgoTools v3.3.

	Available checkpoints:

	\| Checkpoint \| Location \| Step \| Epoch \| Notes \|
	\|---\|---\|---:\|---:\|---\|
	\| checkpoint-300 \| repository root \| 300 / 907 \| 0.3309 \| First uploaded intermediate checkpoint. \|
	\| checkpoint-600 \| `checkpoint-600/` \| 600 / 907 \| 0.6619 \| Second uploaded intermediate checkpoint. \|

	The repository root currently contains the `checkpoint-300` model files. `checkpoint-600` is stored in the `checkpoint-600/` subdirectory.

	## Training Setup

	\| Field \| Value \|
	\|---\|---:\|
	\| Base model \| `Qwen/Qwen3-VL-8B-Instruct` \|
	\| Framework \| `ms-swift` / Transformers \|
	\| Tuning type \| Full-parameter SFT \|
	\| Trainable params \| 8.19B / 8.77B, VLM LLM trainable; ViT and aligner frozen \|
	\| GPUs \| 8 x NVIDIA A100-SXM4-40GB \|
	\| Precision \| BF16 \|
	\| DeepSpeed \| ZeRO-3, no optimizer/parameter offload \|
	\| Attention \| FlashAttention \|
	\| Per-device batch size \| 2 \|
	\| Gradient accumulation \| 8 \|
	\| Effective batch size \| 128 samples \|
	\| Epochs \| 1 \|
	\| Max steps \| 907 \|
	\| Learning rate \| `2.3e-6` \|
	\| LR scheduler \| `constant` \|
	\| Warmup \| 0 \|
	\| Weight decay \| 0.1 \|
	\| Max sequence length \| 8192 \|
	\| Video frame sampling \| up to 64 frames \|
	\| Video token budget \| 128 \|
	\| Image token budget \| 1024 \|
	\| Save interval \| every 300 steps \|

	Important note: this run used a constant `2.3e-6` LR. Earlier V2 exploratory runs used `5e-6` with cosine decay and 3% warmup; these v3.3 checkpoints do not use that schedule.

	## Training Data

	Dataset: EgoTools v3.3 SFT, converted to ms-swift video-clip format.

	Main local training file:

	`data_v3_3/egotools_v3_3_sft_final_clips.swift.jsonl`

	### Overall Mix

	\| Family \| Rows \| Ratio \|
	\|---\|---:\|---:\|
	\| Multiple-choice QA \| 104,613 \| 90.16% \|
	\| Caption / narration completion \| 9,473 \| 8.16% \|
	\| Open-ended QA \| 1,945 \| 1.68% \|
	\| Total \| 116,031 \| 100.00% \|

	### Sample Type Mix

	\| Sample type \| Rows \| Ratio \|
	\|---\|---:\|---:\|
	\| `mcq` \| 63,276 \| 54.53% \|
	\| `narration_mcq` \| 17,591 \| 15.16% \|
	\| `egoschema_caption_mcq` \| 11,830 \| 10.20% \|
	\| `egoplan_next_action_mcq` \| 7,990 \| 6.89% \|
	\| `caption_completion` \| 7,532 \| 6.49% \|
	\| `egoschema_fused_mcq` \| 3,926 \| 3.38% \|
	\| `egothink_open_qa` \| 1,945 \| 1.68% \|
	\| `narration_completion` \| 1,941 \| 1.67% \|

	### Option / Answer Balance

	The MCQ portion was deterministically balanced by option count.

	\| Option count \| Answer distribution \|
	\|---:\|---\|
	\| 4 options \| A: 1,998; B: 1,997; C: 1,998; D: 1,997 \|
	\| 5 options \| A: 6,669; B: 6,669; C: 6,670; D: 6,669; E: 6,670 \|
	\| 8 options \| A: 7,910; B: 7,909; C: 7,910; D: 7,910; E: 7,909; F: 7,910; G: 7,909; H: 7,909 \|

	### Video Coverage

	\| Field \| Value \|
	\|---\|---:\|
	\| Unique video references \| 362 \|
	\| Unique generated clips \| 13,100 \|
	\| Missing video rows \| 0 \|
	\| Full train-video references \| 92,572 \|
	\| Train-segment clip references \| 23,459 \|

	## Checkpoint Metrics

	\| Checkpoint \| Loss \| Token accuracy \| LR \|
	\|---\|---:\|---:\|---:\|
	\| checkpoint-300 \| 0.8521 \| 0.7638 \| 2.3e-6 \|
	\| checkpoint-600 \| 0.8500 \| 0.7705 \| 2.3e-6 \|

	No evaluation set was run for these intermediate checkpoints.

	---
	base_model: Qwen/Qwen3-VL-8B-Instruct
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags:
	- qwen3-vl
	- video-language-model
	- egocentric-video
	- ms-swift
	- sft
	---

	# EgoTools 8B v3.3

	This repository stores intermediate checkpoints from full-parameter SFT of `Qwen/Qwen3-VL-8B-Instruct` on EgoTools v3.3.

	Available checkpoints:

	\| Checkpoint \| Location \| Step \| Epoch \| Notes \|
	\|---\|---\|---:\|---:\|---\|
	\| checkpoint-300 \| repository root \| 300 / 907 \| 0.3309 \| First uploaded intermediate checkpoint. \|
	\| checkpoint-600 \| `checkpoint-600/` \| 600 / 907 \| 0.6619 \| Second uploaded intermediate checkpoint. \|

	The repository root currently contains the `checkpoint-300` model files. `checkpoint-600` is stored in the `checkpoint-600/` subdirectory.

	## Training Setup

	\| Field \| Value \|
	\|---\|---:\|
	\| Base model \| `Qwen/Qwen3-VL-8B-Instruct` \|
	\| Framework \| `ms-swift` / Transformers \|
	\| Tuning type \| Full-parameter SFT \|
	\| Trainable params \| 8.19B / 8.77B, VLM LLM trainable; ViT and aligner frozen \|
	\| GPUs \| 8 x NVIDIA A100-SXM4-40GB \|
	\| Precision \| BF16 \|
	\| DeepSpeed \| ZeRO-3, no optimizer/parameter offload \|
	\| Attention \| FlashAttention \|
	\| Per-device batch size \| 2 \|
	\| Gradient accumulation \| 8 \|
	\| Effective batch size \| 128 samples \|
	\| Epochs \| 1 \|
	\| Max steps \| 907 \|
	\| Learning rate \| `2.3e-6` \|
	\| LR scheduler \| `constant` \|
	\| Warmup \| 0 \|
	\| Weight decay \| 0.1 \|
	\| Max sequence length \| 8192 \|
	\| Video frame sampling \| up to 64 frames \|
	\| Video token budget \| 128 \|
	\| Image token budget \| 1024 \|
	\| Save interval \| every 300 steps \|

	Important note: this run used a constant `2.3e-6` LR. Earlier V2 exploratory runs used `5e-6` with cosine decay and 3% warmup; these v3.3 checkpoints do not use that schedule.

	## Training Data

	Dataset: EgoTools v3.3 SFT, converted to ms-swift video-clip format.

	Main local training file:

	`data_v3_3/egotools_v3_3_sft_final_clips.swift.jsonl`

	### Overall Mix

	\| Family \| Rows \| Ratio \|
	\|---\|---:\|---:\|
	\| Multiple-choice QA \| 104,613 \| 90.16% \|
	\| Caption / narration completion \| 9,473 \| 8.16% \|
	\| Open-ended QA \| 1,945 \| 1.68% \|
	\| Total \| 116,031 \| 100.00% \|

	### Sample Type Mix

	\| Sample type \| Rows \| Ratio \|
	\|---\|---:\|---:\|
	\| `mcq` \| 63,276 \| 54.53% \|
	\| `narration_mcq` \| 17,591 \| 15.16% \|
	\| `egoschema_caption_mcq` \| 11,830 \| 10.20% \|
	\| `egoplan_next_action_mcq` \| 7,990 \| 6.89% \|
	\| `caption_completion` \| 7,532 \| 6.49% \|
	\| `egoschema_fused_mcq` \| 3,926 \| 3.38% \|
	\| `egothink_open_qa` \| 1,945 \| 1.68% \|
	\| `narration_completion` \| 1,941 \| 1.67% \|

	### Option / Answer Balance

	The MCQ portion was deterministically balanced by option count.

	\| Option count \| Answer distribution \|
	\|---:\|---\|
	\| 4 options \| A: 1,998; B: 1,997; C: 1,998; D: 1,997 \|
	\| 5 options \| A: 6,669; B: 6,669; C: 6,670; D: 6,669; E: 6,670 \|
	\| 8 options \| A: 7,910; B: 7,909; C: 7,910; D: 7,910; E: 7,909; F: 7,910; G: 7,909; H: 7,909 \|

	### Video Coverage

	\| Field \| Value \|
	\|---\|---:\|
	\| Unique video references \| 362 \|
	\| Unique generated clips \| 13,100 \|
	\| Missing video rows \| 0 \|
	\| Full train-video references \| 92,572 \|
	\| Train-segment clip references \| 23,459 \|

	## Checkpoint Metrics

	\| Checkpoint \| Loss \| Token accuracy \| LR \|
	\|---\|---:\|---:\|---:\|
	\| checkpoint-300 \| 0.8521 \| 0.7638 \| 2.3e-6 \|
	\| checkpoint-600 \| 0.8500 \| 0.7705 \| 2.3e-6 \|

	No evaluation set was run for these intermediate checkpoints.