GPT-OSS-20B-Vision Preview (Proof of Concept)

A vision-language model for GPT-OSS built from scratch on a single NVIDIA DGX Spark in a Dubai hotel room. Features PseudoDeepStack multi-scale visual features and the first documented analysis of why projector-only training fails on MoE architectures.

Training GPT-OSS-Vision from a Dubai hotel room
Training GPT-OSS-20B-Vision on a DGX Spark. Dubai, February 2026.

What This Is

This is a proof of concept — not a production model. It demonstrates that the GPT-OSS Mixture-of-Experts architecture can be given vision capabilities using a novel multi-scale feature injection method we call PseudoDeepStack.

At 22% through training (step 9,000 of 40,461), the model already:

  • Identifies objects, scenes, and spatial relationships in images
  • Generates coherent multi-sentence descriptions
  • Understands food, people, indoor/outdoor scenes

It also hallucinates details and misses fine-grained elements — expected at this training stage.

We need compute to finish training and scale to 120B. See below.

Architecture

Component Details
Vision Encoder SigLIP-SO400M-patch14-384 (frozen)
Feature Method PseudoDeepStack — multi-scale visual features from multiple encoder depths
Projector 2-layer MLP, 18.2M parameters
Language Model GPT-OSS-20B MoE (4-bit QLoRA, rank 128, alpha 256)
Visual Tokens 729 per image (27x27 patches at 384px)
Training Data 647K samples — LLaVA-Instruct + Infinity-MM Stage 4
Hardware Single NVIDIA DGX Spark GB10 (128 GB unified memory)

PseudoDeepStack

Standard VLMs extract features from only the final vision encoder layer. We extract from multiple depths — capturing low-level edges and textures, mid-level shapes and parts, and high-level semantic features — then concatenate them into enriched visual tokens. This gives the language model a richer visual representation at zero additional inference cost (same 729 tokens).

How it works: SigLIP-SO400M has 27 transformer layers. Instead of using only the final layer's output, we extract hidden states from layers 9, 18, and 27 — representing three levels of visual understanding. Layer 9 captures low-level features like edges and textures. Layer 18 captures mid-level structure like shapes and object parts. Layer 27 captures high-level semantics. These three [729, 1152] feature maps are concatenated along the feature dimension into a single [729, 3456] tensor, then projected down to [729, 2880] by a 2-layer MLP to match the LLM's hidden size. The result: each of the 729 visual tokens carries information from three scales of understanding, at zero additional inference cost compared to standard single-layer extraction.

Inspired by Qwen3-VL's DeepStack, but designed to work with frozen/quantized LLMs without architectural modifications.

Key Finding: MoE Models Need LoRA for Vision

We discovered that projector-only training fails for Mixture-of-Experts architectures. Unlike dense models where the LLM can sometimes process visual tokens without adaptation, MoE models produce incoherent output when visual tokens bypass the expert routing learned during pretraining. QLoRA adaptation of the attention layers allows the router to learn how to handle this new modality.

How This Compares to InternVL3.5-GPT-OSS-20B

OpenGVLab (Shanghai AI Laboratory) released InternVL3.5-GPT-OSS-20B-A4B in August 2025 — a team of dozens of researchers with access to large-scale A100 clusters, months of development, and a 4-stage training pipeline including reinforcement learning. Their model is also a Preview.

This project reached a comparable milestone in 7 days, with one person, one DGX Spark, and a novel feature method they don't use. Different project, different approach:

This Project InternVL3.5-GPT-OSS-20B
Training method QLoRA (~2% of parameters) Full model training (4 stages incl. RL)
Hardware required Single DGX Spark (consumer device) Multi-GPU cluster
Vision encoder SigLIP-SO400M (frozen, off-the-shelf) InternViT-300M (custom, trained)
Feature extraction PseudoDeepStack (multi-scale, 3 depths) Single-layer
Resolution Fixed 384px Dynamic 448px, up to 12 tiles
Video support No Yes
Training hardware Single NVIDIA DGX Spark ($3,999) Multi-GPU cluster (thousands+)
Reproducibility Novel architecture, single-device training pipeline Standard multi-GPU distributed training

Why this project matters:

  • Efficiency through ingenuity: Achieved vision capability on a single consumer device by designing a novel training pipeline that works within extreme hardware constraints
  • PseudoDeepStack: A new multi-scale feature extraction method that captures richer visual information than single-layer approaches, not used by InternVL
  • MoE routing analysis: First documented explanation of why projector-only training fails on MoE architectures, saving the community from a dead-end approach
  • Parameter-efficient adaptation: Trained only ~2% of model parameters to achieve vision, demonstrating that brute-force full training isn't the only path

Usage

Requirements

pip install torch transformers accelerate pillow

Quick Start

import torch
from PIL import Image
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    SiglipImageProcessor, SiglipVisionModel,
)
from huggingface_hub import hf_hub_download

REPO = "vincentkaufmann/gpt-oss-20b-vision-preview"
DEVICE = "cuda"

# 1. Load vision encoder (SigLIP — runs on CPU to save GPU memory)
processor = SiglipImageProcessor.from_pretrained("google/siglip-so400m-patch14-384")
vision = SiglipVisionModel.from_pretrained(
    "google/siglip-so400m-patch14-384", torch_dtype=torch.float32
).eval()

# 2. Load projector (PseudoDeepStack: multi-scale visual features)
proj_path = hf_hub_download(REPO, "projector-step9000.pt")
proj_ckpt = torch.load(proj_path, map_location="cpu", weights_only=False)
projector = torch.nn.Sequential(
    torch.nn.Linear(3456, 2880), torch.nn.GELU(), torch.nn.Linear(2880, 2880)
)
projector.load_state_dict({
    k.replace("projector.", ""): v
    for k, v in proj_ckpt["state_dict"].items()
})
projector = projector.to(DEVICE).to(torch.bfloat16).eval()

# 3. Load merged LLM (LoRA already merged into weights)
tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    REPO, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
model.eval()

# 4. Process image
image = Image.open("your_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    out = vision(**inputs, output_hidden_states=True)
    # PseudoDeepStack: extract features from multiple encoder depths
    features = torch.cat([out.hidden_states[l] for l in [9, 18, 27]], dim=-1)
    visual_tokens = projector(features.to(torch.bfloat16).to(DEVICE))

# 5. Generate
prompt = tokenizer("Describe this image in detail.", return_tensors="pt").to(DEVICE)
embeds = model.get_input_embeddings()(prompt["input_ids"])
input_embeds = torch.cat([visual_tokens, embeds], dim=1)

with torch.no_grad():
    output = model.generate(
        inputs_embeds=input_embeds,
        attention_mask=torch.ones(1, input_embeds.shape[1], device=DEVICE),
        max_new_tokens=256, temperature=0.7, top_p=0.9, repetition_penalty=1.1
    )
print(tokenizer.decode(output[0], skip_special_tokens=True))

Note: The full BF16 model requires ~40 GB VRAM. For 4-bit quantized loading, add load_in_4bit=True via BitsAndBytesConfig.

Training Details

Parameter Value
Training stage Visual instruction tuning (single-stage QLoRA)
Dataset 647K samples (LLaVA-Instruct-150K + Infinity-MM Stage 4 curated)
Epochs 1 (22% complete at this checkpoint)
LoRA rank 128
LoRA alpha 256
LoRA targets Q, K, V, O, gate, up, down projections
Quantization 4-bit NormalFloat (QLoRA)
Optimizer AdamW
Hardware Single NVIDIA DGX Spark GB10 Blackwell
Training time ~3.5 days to this checkpoint (of ~15 days total)

Limitations

This is an early proof of concept at 22% training:

  • Hallucinations: The model invents objects and details not present in images
  • Fine-grained understanding: Struggles with text in images, counting, and spatial precision
  • Single image only: No multi-image or video support
  • Resolution: Fixed 384x384 input (no dynamic resolution)

These limitations are expected to improve significantly with full training completion and architectural upgrades.

The Story

I'm Vincent — a solo builder who trained a VLM for GPT-OSS on a DGX Spark from my hotel room in Dubai. No lab, no cluster, no team of PhDs. Just a Spark, a laptop, and stubbornness.

While OpenGVLab's InternVL3.5 brought vision to GPT-OSS using their full training pipeline, this project takes a different approach: parameter-efficient QLoRA adaptation with a novel multi-scale feature method, built and trained on a single consumer device.

This proof of concept proves three things:

  1. PseudoDeepStack works — multi-scale visual features improve understanding at zero cost
  2. MoE architectures can see — with the right adaptation approach (QLoRA, not projector-only)
  3. Hardware constraints drive innovation — the right architecture lets a single DGX Spark do what typically requires a GPU cluster

What's needed to go from proof of concept to production:

  • Complete training (remaining ~31,000 steps) — estimated 11 more days on Spark
  • Scale to GPT-OSS-120B — same projector works due to shared hidden dimensions
  • Benchmark and evaluate against LLaVA-1.5, Qwen3-VL, and other VLMs
  • Dynamic resolution (AnyRes tiling) for higher-quality image understanding

Help Us Ship the Real Thing

This project needs compute. The DGX Spark is powerful for its size, but finishing training and scaling to 120B requires GPU hours I can't self-fund.

What your support enables:

Tier Cost What It Buys
Finish 20B training ~$500 Complete the remaining 78% of training on cloud GPUs
Train 120B version ~$2,000 Full GPT-OSS-120B-Vision with the same architecture
Production quality ~$5,000 Extended training on 3M+ samples, benchmarking, GGUF release

Every dollar goes directly to GPU time. No overhead, no team salaries — just compute.

Contact: vincentkaufmann@protonmail.com

Roadmap

  • PseudoDeepStack architecture design
  • Stage 1: Projector alignment (558K image-caption pairs)
  • Discovery: Projector-only fails for MoE → QLoRA required
  • Stage 2: QLoRA visual instruction tuning (647K samples)
  • Proof of concept checkpoint (this release)
  • Complete full training epoch
  • GPT-OSS-120B-Vision (same projector, larger LLM)
  • GGUF format for llama.cpp / LM Studio compatibility
  • Dynamic resolution (AnyRes tiling)
  • Comprehensive benchmark evaluation

Citation

@misc{kaufmann2026gptossvision,
  title={GPT-OSS-Vision: Efficient Vision-Language Adaptation of Sparse MoE Models via PseudoDeepStack},
  author={Vincent Kaufmann},
  year={2026},
  howpublished={\url{https://huggingface.co/vincentkaufmann/gpt-oss-20b-vision-preview}},
}

License

Apache 2.0 — same as the base GPT-OSS model.

Downloads last month
62
Safetensors
Model size
21B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vincentkaufmann/gpt-oss-20b-vision-preview

Finetuned
(3)
this model

Datasets used to train vincentkaufmann/gpt-oss-20b-vision-preview