amitha/molmo-dinov3-b16-1b-olmo3

A Molmo-style vision-language model: a frozen DINOv3 ViT-B/16 vision encoder (pretrained on DataComp-1B medium subset (128M, bs2048), from amitha/dinov3-vitb16-datacomp-1b-medium-subset-128M-bs2048)

  • a trained multimodal connector + the OLMo-3-7B language model (Olmo3ForCausalLM).

The vision encoder was frozen during training; only the connector (a SwiGLU image projector + a CLS projector) and the language model were trained, following the Molmo recipe.

Vision weights are referenced, not stored. This repo ships the connector + LLM weights only. The vision tower is loaded at runtime from amitha/dinov3-vitb16-datacomp-1b-medium-subset-128M-bs2048, so that repo must remain accessible. Loading requires trust_remote_code=True.

Checkpoints

Training ran for 4 epochs. The repo root is the final checkpoint (step14392, 4 epochs). Three earlier checkpoints are available as subfolders:

Checkpoint Subfolder Notes
step14392 (root) final (4 epochs)
step7196 step7196 2 epochs
step13000 step13000 ~3.6 epochs
step14000 step14000 ~3.9 epochs

Load an earlier checkpoint with subfolder=:

model = AutoModelForImageTextToText.from_pretrained(
    "amitha/molmo-dinov3-b16-1b-olmo3", subfolder="step13000", trust_remote_code=True)

Usage

import torch, PIL.Image, requests
from transformers import AutoModelForImageTextToText, AutoTokenizer
from transformers import AutoImageProcessor, AutoProcessor

repo = "amitha/molmo-dinov3-b16-1b-olmo3"
model = AutoModelForImageTextToText.from_pretrained(
    repo, trust_remote_code=True, dtype=torch.float32).eval()
processor = AutoProcessor.from_pretrained(repo, trust_remote_code=True)

image = PIL.Image.open(requests.get(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
    stream=True).raw).convert("RGB")
inputs = processor(text="Describe this image in detail.", images=[image], return_tensors="pt")

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Prompt styles

The model was trained with several caption/QA styles. The processor exposes an optional style argument (default: none) that prepends a "{style}: " prefix matching training:

inputs = processor(text="Describe this image.", images=[image],
                   style="long_caption", return_tensors="pt")

Known styles: long_caption, transcript, user_qa, synthetic_qa.

Architecture notes

  • Image tokens: single 224×224 crop, no pooling, CLS token included → 197 image tokens (1 CLS + 196 patches) inserted into the text stream.
  • LLM: native Olmo3ForCausalLM (post-norm, YaRN RoPE), vocabulary padded to 100480; the 128 image-placeholder logits are masked during generation.
  • Image preprocessing: resize so the short side is 224 (bicubic), center-crop 224, normalize with ImageNet statistics.

Provenance

Converted from native Molmo training checkpoints to the HuggingFace format with a converter verified to reproduce the original Molmo inference bit-for-bit (identical input ids and image token layout; vision features and logits matching to floating-point ordering noise; identical greedy generations).

Downloads last month
46
Safetensors
Model size
7B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amitha/molmo-dinov3-b16-1b-olmo3