DistilViT2 Image Captioning Model
This model performs image captioning using a prefix-conditioning architecture with LoRA adapters.
Architecture
Image → SigLIP Vision Encoder → Projection Layer → SmolLM + LoRA → Caption
Components:
- Vision Encoder: SigLIP-base-patch16-224 (frozen during training)
- Projection Layer: Linear layer mapping vision features to text embedding space
- Language Model: SmolLM-135M with LoRA adapters (rank=16)
Model Contents
The model.safetensors file (984 MB) contains all weights needed for inference:
Complete Model Weights (723 tensors total)
Vision Encoder (~300 MB)
- Complete SigLIP-base-patch16-224 weights
- Keys:
vision_encoder.* - Hidden size: 768
- Patches: 14×14 = 196 tokens per image
Projection Layer (~1 MB)
- Linear projection: 768 → 576
- Keys:
projection.* - Maps vision features to language model embedding space
Language Model Base Weights (~660 MB)
- Complete SmolLM-135M base weights
- Keys:
language_model.base_model.model.* - 30 layers, 576 hidden size, 49152 vocab size
LoRA Adapters (~2 MB trainable)
- Separate low-rank matrices (not merged into base)
- Keys:
language_model.*.lora_A.default.weight,language_model.*.lora_B.default.weight - Applied to: q_proj, k_proj, v_proj, o_proj
- Rank: 16, Alpha: 16, Dropout: 0.1
Training Details
- Trainable Parameters: 2.2M / 221M total (1%)
- Frozen: Vision encoder (SigLIP)
- Trainable: Projection layer + LoRA adapters
- Datasets: Flickr30k, COCO
- Architecture: Prefix-conditioning (no cross-attention)
Usage
Python CLI (torch vs ONNX)
Run the side-by-side comparison script:
python compare_inference.py --model-dir . --onnx-dir onnx --image cat.jpg --prompt "A photo of" --max-new-tokens 15
Key path inside compare_inference.py:
vision_encoder, projection, language_model, _ = load_models(args.model_dir, device)
pixel_values = preprocess(args.image, processor, device)
torch_caption = torch_generate(
vision_encoder, projection, language_model, tokenizer, pixel_values, args.prompt, args.max_new_tokens
)
vision_sess, proj_sess, lm_sess = load_onnx_sessions(args.onnx_dir)
onnx_caption = onnx_generate(
vision_sess, proj_sess, lm_sess, language_model, tokenizer, pixel_values, args.prompt, args.max_new_tokens
)
--image defaults to cat.jpg in the repo if you do not pass one. The script prints both captions so you can verify parity between torch and ONNX.
Browser demo (ort.js)
Run a static server from demo/:
cd demo && python -m http.server 8000
demo/main.js drives the full pipeline fully on-device:
await loadAll(); // downloads tokenizer/processor assets and ONNX models from ./demo/models
const pixelData = await preprocessImage(currentImage);
const visionHidden = await runVision(pixelData);
const projected = await runProjection(visionHidden);
const prompt = ''; // not needed
const encoded = await tokenizer(prompt);
const initFeeds = {
prefix_embeddings: projected,
input_ids: new ort.Tensor('int64', BigInt64Array.from(encoded.input_ids.data.map(BigInt)), [1, encoded.input_ids.data.length]),
};
const initOutputs = await prefixInitSession.run(initFeeds);
// then decode step-by-step with cached past:
const feeds = buildDecoderInputs([BigInt(nextToken)], attention, position, past);
const outputs = await lmSession.run(feeds);
Open http://localhost:8000, drop an image, and click Generate caption to watch the vision → projection → prefix-init → decode flow run in the browser.
Model Specifications
| Component | Size | Parameters | Status |
|---|---|---|---|
| Vision Encoder | 768 hidden | ~87M | Frozen |
| Projection | 768→576 | ~443K | Trainable |
| Language Model | 576 hidden, 30 layers | ~134M | Base frozen |
| LoRA Adapters | rank=16 | ~1.8M | Trainable |
| Total | ~221M | 2.2M trainable |
Key Features
- Combined weights: All components are merged into one
model.safetensors; ONNX runtime uses three files inonnx/ - LoRA Preserved: LoRA weights stored separately (not merged) for flexibility
- Efficient: Only 1% of parameters trained, 5.7× faster than cross-attention baseline
Technical Notes
- Training scripts: https://github.com/tarekziade/distilvit2
- For a full walkthrough of the architecture and export flow, see the blog post: https://blog.ziade.org/2025/12/16/better-alt-text-part-2/.
License
Model weights inherit licenses from base models:
- SigLIP: Apache 2.0
- SmolLM: Apache 2.0
Browser Demo (ort.js)
An offline, browser-only demo lives in demo/ and runs the ONNX exports via ort.js + transformers.js (vision encoder → projection → prefix_init → decoder). The demo/models directory points to the ONNX files in onnx/, so make sure those exports are present locally.
Run it with any static server (e.g., from the repo root):
cd demo && python -m http.server 8000
Then open http://localhost:8000, drop an image, and click Generate caption to see the model run fully on-device. No remote fetches are required.
- Downloads last month
- 39
