Instructions to use vincentkaufmann/gpt-oss-20b-vision-preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use vincentkaufmann/gpt-oss-20b-vision-preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="vincentkaufmann/gpt-oss-20b-vision-preview")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("vincentkaufmann/gpt-oss-20b-vision-preview")
model = AutoModelForCausalLM.from_pretrained("vincentkaufmann/gpt-oss-20b-vision-preview")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use vincentkaufmann/gpt-oss-20b-vision-preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "vincentkaufmann/gpt-oss-20b-vision-preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vincentkaufmann/gpt-oss-20b-vision-preview",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/vincentkaufmann/gpt-oss-20b-vision-preview

SGLang

How to use vincentkaufmann/gpt-oss-20b-vision-preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "vincentkaufmann/gpt-oss-20b-vision-preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vincentkaufmann/gpt-oss-20b-vision-preview",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "vincentkaufmann/gpt-oss-20b-vision-preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vincentkaufmann/gpt-oss-20b-vision-preview",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use vincentkaufmann/gpt-oss-20b-vision-preview with Docker Model Runner:
```
docker model run hf.co/vincentkaufmann/gpt-oss-20b-vision-preview
```

GPT-OSS-20B-Vision Preview (Proof of Concept)

A vision-language model for GPT-OSS built from scratch on a single NVIDIA DGX Spark in a Dubai hotel room. Features PseudoDeepStack multi-scale visual features and the first documented analysis of why projector-only training fails on MoE architectures.

Training GPT-OSS-Vision from a Dubai hotel room
Training GPT-OSS-20B-Vision on a DGX Spark. Dubai, February 2026.

What This Is

This is a proof of concept — not a production model. It demonstrates that the GPT-OSS Mixture-of-Experts architecture can be given vision capabilities using a novel multi-scale feature injection method we call PseudoDeepStack.

At 22% through training (step 9,000 of 40,461), the model already:

Identifies objects, scenes, and spatial relationships in images
Generates coherent multi-sentence descriptions
Understands food, people, indoor/outdoor scenes

It also hallucinates details and misses fine-grained elements — expected at this training stage.

We need compute to finish training and scale to 120B. See below.

Architecture

Component	Details
Vision Encoder	SigLIP-SO400M-patch14-384 (frozen)
Feature Method	PseudoDeepStack — multi-scale visual features from multiple encoder depths
Projector	2-layer MLP, 18.2M parameters
Language Model	GPT-OSS-20B MoE (4-bit QLoRA, rank 128, alpha 256)
Visual Tokens	729 per image (27x27 patches at 384px)
Training Data	647K samples — LLaVA-Instruct + Infinity-MM Stage 4
Hardware	Single NVIDIA DGX Spark GB10 (128 GB unified memory)

PseudoDeepStack

Standard VLMs extract features from only the final vision encoder layer. We extract from multiple depths — capturing low-level edges and textures, mid-level shapes and parts, and high-level semantic features — then concatenate them into enriched visual tokens. This gives the language model a richer visual representation at zero additional inference cost (same 729 tokens).

How it works: SigLIP-SO400M has 27 transformer layers. Instead of using only the final layer's output, we extract hidden states from layers 9, 18, and 27 — representing three levels of visual understanding. Layer 9 captures low-level features like edges and textures. Layer 18 captures mid-level structure like shapes and object parts. Layer 27 captures high-level semantics. These three [729, 1152] feature maps are concatenated along the feature dimension into a single [729, 3456] tensor, then projected down to [729, 2880] by a 2-layer MLP to match the LLM's hidden size. The result: each of the 729 visual tokens carries information from three scales of understanding, at zero additional inference cost compared to standard single-layer extraction.

Inspired by Qwen3-VL's DeepStack, but designed to work with frozen/quantized LLMs without architectural modifications.

Key Finding: MoE Models Need LoRA for Vision

We discovered that projector-only training fails for Mixture-of-Experts architectures. Unlike dense models where the LLM can sometimes process visual tokens without adaptation, MoE models produce incoherent output when visual tokens bypass the expert routing learned during pretraining. QLoRA adaptation of the attention layers allows the router to learn how to handle this new modality.

How This Compares to InternVL3.5-GPT-OSS-20B

OpenGVLab (Shanghai AI Laboratory) released InternVL3.5-GPT-OSS-20B-A4B in August 2025 — a team of dozens of researchers with access to large-scale A100 clusters, months of development, and a 4-stage training pipeline including reinforcement learning. Their model is also a Preview.

This project reached a comparable milestone in 7 days, with one person, one DGX Spark, and a novel feature method they don't use. Different project, different approach:

	This Project	InternVL3.5-GPT-OSS-20B
Training method	QLoRA (~2% of parameters)	Full model training (4 stages incl. RL)
Hardware required	Single DGX Spark (consumer device)	Multi-GPU cluster
Vision encoder	SigLIP-SO400M (frozen, off-the-shelf)	InternViT-300M (custom, trained)
Feature extraction	PseudoDeepStack (multi-scale, 3 depths)	Single-layer
Resolution	Fixed 384px	Dynamic 448px, up to 12 tiles
Video support	No	Yes
Training hardware	Single NVIDIA DGX Spark ($3,999)	Multi-GPU cluster (thousands+)
Reproducibility	Novel architecture, single-device training pipeline	Standard multi-GPU distributed training

Why this project matters:

Efficiency through ingenuity: Achieved vision capability on a single consumer device by designing a novel training pipeline that works within extreme hardware constraints
PseudoDeepStack: A new multi-scale feature extraction method that captures richer visual information than single-layer approaches, not used by InternVL
MoE routing analysis: First documented explanation of why projector-only training fails on MoE architectures, saving the community from a dead-end approach
Parameter-efficient adaptation: Trained only ~2% of model parameters to achieve vision, demonstrating that brute-force full training isn't the only path

Usage

Requirements

pip install torch transformers accelerate pillow

Quick Start

import torch
from PIL import Image
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    SiglipImageProcessor, SiglipVisionModel,
)
from huggingface_hub import hf_hub_download

REPO = "vincentkaufmann/gpt-oss-20b-vision-preview"
DEVICE = "cuda"

# 1. Load vision encoder (SigLIP — runs on CPU to save GPU memory)
processor = SiglipImageProcessor.from_pretrained("google/siglip-so400m-patch14-384")
vision = SiglipVisionModel.from_pretrained(
    "google/siglip-so400m-patch14-384", torch_dtype=torch.float32
).eval()

# 2. Load projector (PseudoDeepStack: multi-scale visual features)
proj_path = hf_hub_download(REPO, "projector-step9000.pt")
proj_ckpt = torch.load(proj_path, map_location="cpu", weights_only=False)
projector = torch.nn.Sequential(
    torch.nn.Linear(3456, 2880), torch.nn.GELU(), torch.nn.Linear(2880, 2880)
)
projector.load_state_dict({
    k.replace("projector.", ""): v
    for k, v in proj_ckpt["state_dict"].items()
})
projector = projector.to(DEVICE).to(torch.bfloat16).eval()

# 3. Load merged LLM (LoRA already merged into weights)
tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    REPO, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
model.eval()

# 4. Process image
image = Image.open("your_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    out = vision(**inputs, output_hidden_states=True)
    # PseudoDeepStack: extract features from multiple encoder depths
    features = torch.cat([out.hidden_states[l] for l in [9, 18, 27]], dim=-1)
    visual_tokens = projector(features.to(torch.bfloat16).to(DEVICE))

# 5. Generate
prompt = tokenizer("Describe this image in detail.", return_tensors="pt").to(DEVICE)
embeds = model.get_input_embeddings()(prompt["input_ids"])
input_embeds = torch.cat([visual_tokens, embeds], dim=1)

with torch.no_grad():
    output = model.generate(
        inputs_embeds=input_embeds,
        attention_mask=torch.ones(1, input_embeds.shape[1], device=DEVICE),
        max_new_tokens=256, temperature=0.7, top_p=0.9, repetition_penalty=1.1
    )
print(tokenizer.decode(output[0], skip_special_tokens=True))

Note: The full BF16 model requires ~40 GB VRAM. For 4-bit quantized loading, add load_in_4bit=True via BitsAndBytesConfig.

Training Details

Parameter	Value
Training stage	Visual instruction tuning (single-stage QLoRA)
Dataset	647K samples (LLaVA-Instruct-150K + Infinity-MM Stage 4 curated)
Epochs	1 (22% complete at this checkpoint)
LoRA rank	128
LoRA alpha	256
LoRA targets	Q, K, V, O, gate, up, down projections
Quantization	4-bit NormalFloat (QLoRA)
Optimizer	AdamW
Hardware	Single NVIDIA DGX Spark GB10 Blackwell
Training time	~3.5 days to this checkpoint (of ~15 days total)

Limitations

This is an early proof of concept at 22% training:

Hallucinations: The model invents objects and details not present in images
Fine-grained understanding: Struggles with text in images, counting, and spatial precision
Single image only: No multi-image or video support
Resolution: Fixed 384x384 input (no dynamic resolution)

These limitations are expected to improve significantly with full training completion and architectural upgrades.

The Story

I'm Vincent — a solo builder who trained a VLM for GPT-OSS on a DGX Spark from my hotel room in Dubai. No lab, no cluster, no team of PhDs. Just a Spark, a laptop, and stubbornness.

While OpenGVLab's InternVL3.5 brought vision to GPT-OSS using their full training pipeline, this project takes a different approach: parameter-efficient QLoRA adaptation with a novel multi-scale feature method, built and trained on a single consumer device.

This proof of concept proves three things:

PseudoDeepStack works — multi-scale visual features improve understanding at zero cost
MoE architectures can see — with the right adaptation approach (QLoRA, not projector-only)
Hardware constraints drive innovation — the right architecture lets a single DGX Spark do what typically requires a GPU cluster

What's needed to go from proof of concept to production:

Complete training (remaining ~31,000 steps) — estimated 11 more days on Spark
Scale to GPT-OSS-120B — same projector works due to shared hidden dimensions
Benchmark and evaluate against LLaVA-1.5, Qwen3-VL, and other VLMs
Dynamic resolution (AnyRes tiling) for higher-quality image understanding

Help Us Ship the Real Thing

This project needs compute. The DGX Spark is powerful for its size, but finishing training and scaling to 120B requires GPU hours I can't self-fund.

What your support enables:

Tier	Cost	What It Buys
Finish 20B training	~$500	Complete the remaining 78% of training on cloud GPUs
Train 120B version	~$2,000	Full GPT-OSS-120B-Vision with the same architecture
Production quality	~$5,000	Extended training on 3M+ samples, benchmarking, GGUF release

Every dollar goes directly to GPU time. No overhead, no team salaries — just compute.

Contact: vincentkaufmann@protonmail.com

Roadmap

PseudoDeepStack architecture design
Stage 1: Projector alignment (558K image-caption pairs)
Discovery: Projector-only fails for MoE → QLoRA required
Stage 2: QLoRA visual instruction tuning (647K samples)
Proof of concept checkpoint (this release)
Complete full training epoch
GPT-OSS-120B-Vision (same projector, larger LLM)
GGUF format for llama.cpp / LM Studio compatibility
Dynamic resolution (AnyRes tiling)
Comprehensive benchmark evaluation

Citation

@misc{kaufmann2026gptossvision,
  title={GPT-OSS-Vision: Efficient Vision-Language Adaptation of Sparse MoE Models via PseudoDeepStack},
  author={Vincent Kaufmann},
  year={2026},
  howpublished={\url{https://huggingface.co/vincentkaufmann/gpt-oss-20b-vision-preview}},
}

License

Apache 2.0 — same as the base GPT-OSS model.

Downloads last month: 17

Safetensors

Model size

21B params

Tensor type

BF16

Model tree for vincentkaufmann/gpt-oss-20b-vision-preview

Base model

axolotl-ai-co/gpt-oss-20b-dequantized

Finetuned

(3)

this model

vincentkaufmann
/

gpt-oss-20b-vision-preview