GPT-OSS-20B-Vision Preview (Proof of Concept)
A vision-language model for GPT-OSS built from scratch on a single NVIDIA DGX Spark in a Dubai hotel room. Features PseudoDeepStack multi-scale visual features and the first documented analysis of why projector-only training fails on MoE architectures.
Training GPT-OSS-20B-Vision on a DGX Spark. Dubai, February 2026.
What This Is
This is a proof of concept — not a production model. It demonstrates that the GPT-OSS Mixture-of-Experts architecture can be given vision capabilities using a novel multi-scale feature injection method we call PseudoDeepStack.
At 22% through training (step 9,000 of 40,461), the model already:
- Identifies objects, scenes, and spatial relationships in images
- Generates coherent multi-sentence descriptions
- Understands food, people, indoor/outdoor scenes
It also hallucinates details and misses fine-grained elements — expected at this training stage.
We need compute to finish training and scale to 120B. See below.
Architecture
| Component | Details |
|---|---|
| Vision Encoder | SigLIP-SO400M-patch14-384 (frozen) |
| Feature Method | PseudoDeepStack — multi-scale visual features from multiple encoder depths |
| Projector | 2-layer MLP, 18.2M parameters |
| Language Model | GPT-OSS-20B MoE (4-bit QLoRA, rank 128, alpha 256) |
| Visual Tokens | 729 per image (27x27 patches at 384px) |
| Training Data | 647K samples — LLaVA-Instruct + Infinity-MM Stage 4 |
| Hardware | Single NVIDIA DGX Spark GB10 (128 GB unified memory) |
PseudoDeepStack
Standard VLMs extract features from only the final vision encoder layer. We extract from multiple depths — capturing low-level edges and textures, mid-level shapes and parts, and high-level semantic features — then concatenate them into enriched visual tokens. This gives the language model a richer visual representation at zero additional inference cost (same 729 tokens).
How it works: SigLIP-SO400M has 27 transformer layers. Instead of using only the final layer's output, we extract hidden states from layers 9, 18, and 27 — representing three levels of visual understanding. Layer 9 captures low-level features like edges and textures. Layer 18 captures mid-level structure like shapes and object parts. Layer 27 captures high-level semantics. These three [729, 1152] feature maps are concatenated along the feature dimension into a single [729, 3456] tensor, then projected down to [729, 2880] by a 2-layer MLP to match the LLM's hidden size. The result: each of the 729 visual tokens carries information from three scales of understanding, at zero additional inference cost compared to standard single-layer extraction.
Inspired by Qwen3-VL's DeepStack, but designed to work with frozen/quantized LLMs without architectural modifications.
Key Finding: MoE Models Need LoRA for Vision
We discovered that projector-only training fails for Mixture-of-Experts architectures. Unlike dense models where the LLM can sometimes process visual tokens without adaptation, MoE models produce incoherent output when visual tokens bypass the expert routing learned during pretraining. QLoRA adaptation of the attention layers allows the router to learn how to handle this new modality.
How This Compares to InternVL3.5-GPT-OSS-20B
OpenGVLab (Shanghai AI Laboratory) released InternVL3.5-GPT-OSS-20B-A4B in August 2025 — a team of dozens of researchers with access to large-scale A100 clusters, months of development, and a 4-stage training pipeline including reinforcement learning. Their model is also a Preview.
This project reached a comparable milestone in 7 days, with one person, one DGX Spark, and a novel feature method they don't use. Different project, different approach:
| This Project | InternVL3.5-GPT-OSS-20B | |
|---|---|---|
| Training method | QLoRA (~2% of parameters) | Full model training (4 stages incl. RL) |
| Hardware required | Single DGX Spark (consumer device) | Multi-GPU cluster |
| Vision encoder | SigLIP-SO400M (frozen, off-the-shelf) | InternViT-300M (custom, trained) |
| Feature extraction | PseudoDeepStack (multi-scale, 3 depths) | Single-layer |
| Resolution | Fixed 384px | Dynamic 448px, up to 12 tiles |
| Video support | No | Yes |
| Training hardware | Single NVIDIA DGX Spark ($3,999) | Multi-GPU cluster (thousands+) |
| Reproducibility | Novel architecture, single-device training pipeline | Standard multi-GPU distributed training |
Why this project matters:
- Efficiency through ingenuity: Achieved vision capability on a single consumer device by designing a novel training pipeline that works within extreme hardware constraints
- PseudoDeepStack: A new multi-scale feature extraction method that captures richer visual information than single-layer approaches, not used by InternVL
- MoE routing analysis: First documented explanation of why projector-only training fails on MoE architectures, saving the community from a dead-end approach
- Parameter-efficient adaptation: Trained only ~2% of model parameters to achieve vision, demonstrating that brute-force full training isn't the only path
Usage
Requirements
pip install torch transformers accelerate pillow
Quick Start
import torch
from PIL import Image
from transformers import (
AutoTokenizer, AutoModelForCausalLM,
SiglipImageProcessor, SiglipVisionModel,
)
from huggingface_hub import hf_hub_download
REPO = "vincentkaufmann/gpt-oss-20b-vision-preview"
DEVICE = "cuda"
# 1. Load vision encoder (SigLIP — runs on CPU to save GPU memory)
processor = SiglipImageProcessor.from_pretrained("google/siglip-so400m-patch14-384")
vision = SiglipVisionModel.from_pretrained(
"google/siglip-so400m-patch14-384", torch_dtype=torch.float32
).eval()
# 2. Load projector (PseudoDeepStack: multi-scale visual features)
proj_path = hf_hub_download(REPO, "projector-step9000.pt")
proj_ckpt = torch.load(proj_path, map_location="cpu", weights_only=False)
projector = torch.nn.Sequential(
torch.nn.Linear(3456, 2880), torch.nn.GELU(), torch.nn.Linear(2880, 2880)
)
projector.load_state_dict({
k.replace("projector.", ""): v
for k, v in proj_ckpt["state_dict"].items()
})
projector = projector.to(DEVICE).to(torch.bfloat16).eval()
# 3. Load merged LLM (LoRA already merged into weights)
tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
REPO, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
model.eval()
# 4. Process image
image = Image.open("your_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
out = vision(**inputs, output_hidden_states=True)
# PseudoDeepStack: extract features from multiple encoder depths
features = torch.cat([out.hidden_states[l] for l in [9, 18, 27]], dim=-1)
visual_tokens = projector(features.to(torch.bfloat16).to(DEVICE))
# 5. Generate
prompt = tokenizer("Describe this image in detail.", return_tensors="pt").to(DEVICE)
embeds = model.get_input_embeddings()(prompt["input_ids"])
input_embeds = torch.cat([visual_tokens, embeds], dim=1)
with torch.no_grad():
output = model.generate(
inputs_embeds=input_embeds,
attention_mask=torch.ones(1, input_embeds.shape[1], device=DEVICE),
max_new_tokens=256, temperature=0.7, top_p=0.9, repetition_penalty=1.1
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Note: The full BF16 model requires ~40 GB VRAM. For 4-bit quantized loading, add
load_in_4bit=TrueviaBitsAndBytesConfig.
Training Details
| Parameter | Value |
|---|---|
| Training stage | Visual instruction tuning (single-stage QLoRA) |
| Dataset | 647K samples (LLaVA-Instruct-150K + Infinity-MM Stage 4 curated) |
| Epochs | 1 (22% complete at this checkpoint) |
| LoRA rank | 128 |
| LoRA alpha | 256 |
| LoRA targets | Q, K, V, O, gate, up, down projections |
| Quantization | 4-bit NormalFloat (QLoRA) |
| Optimizer | AdamW |
| Hardware | Single NVIDIA DGX Spark GB10 Blackwell |
| Training time | ~3.5 days to this checkpoint (of ~15 days total) |
Limitations
This is an early proof of concept at 22% training:
- Hallucinations: The model invents objects and details not present in images
- Fine-grained understanding: Struggles with text in images, counting, and spatial precision
- Single image only: No multi-image or video support
- Resolution: Fixed 384x384 input (no dynamic resolution)
These limitations are expected to improve significantly with full training completion and architectural upgrades.
The Story
I'm Vincent — a solo builder who trained a VLM for GPT-OSS on a DGX Spark from my hotel room in Dubai. No lab, no cluster, no team of PhDs. Just a Spark, a laptop, and stubbornness.
While OpenGVLab's InternVL3.5 brought vision to GPT-OSS using their full training pipeline, this project takes a different approach: parameter-efficient QLoRA adaptation with a novel multi-scale feature method, built and trained on a single consumer device.
This proof of concept proves three things:
- PseudoDeepStack works — multi-scale visual features improve understanding at zero cost
- MoE architectures can see — with the right adaptation approach (QLoRA, not projector-only)
- Hardware constraints drive innovation — the right architecture lets a single DGX Spark do what typically requires a GPU cluster
What's needed to go from proof of concept to production:
- Complete training (remaining ~31,000 steps) — estimated 11 more days on Spark
- Scale to GPT-OSS-120B — same projector works due to shared hidden dimensions
- Benchmark and evaluate against LLaVA-1.5, Qwen3-VL, and other VLMs
- Dynamic resolution (AnyRes tiling) for higher-quality image understanding
Help Us Ship the Real Thing
This project needs compute. The DGX Spark is powerful for its size, but finishing training and scaling to 120B requires GPU hours I can't self-fund.
What your support enables:
| Tier | Cost | What It Buys |
|---|---|---|
| Finish 20B training | ~$500 | Complete the remaining 78% of training on cloud GPUs |
| Train 120B version | ~$2,000 | Full GPT-OSS-120B-Vision with the same architecture |
| Production quality | ~$5,000 | Extended training on 3M+ samples, benchmarking, GGUF release |
Every dollar goes directly to GPU time. No overhead, no team salaries — just compute.
Contact: vincentkaufmann@protonmail.com
Roadmap
- PseudoDeepStack architecture design
- Stage 1: Projector alignment (558K image-caption pairs)
- Discovery: Projector-only fails for MoE → QLoRA required
- Stage 2: QLoRA visual instruction tuning (647K samples)
- Proof of concept checkpoint (this release)
- Complete full training epoch
- GPT-OSS-120B-Vision (same projector, larger LLM)
- GGUF format for llama.cpp / LM Studio compatibility
- Dynamic resolution (AnyRes tiling)
- Comprehensive benchmark evaluation
Citation
@misc{kaufmann2026gptossvision,
title={GPT-OSS-Vision: Efficient Vision-Language Adaptation of Sparse MoE Models via PseudoDeepStack},
author={Vincent Kaufmann},
year={2026},
howpublished={\url{https://huggingface.co/vincentkaufmann/gpt-oss-20b-vision-preview}},
}
License
Apache 2.0 — same as the base GPT-OSS model.
- Downloads last month
- 62
Model tree for vincentkaufmann/gpt-oss-20b-vision-preview
Base model
axolotl-ai-co/gpt-oss-20b-dequantized