DistilViT2 Image Captioning Model

This model performs image captioning using a prefix-conditioning architecture with LoRA adapters.

Browser demo showing on-device captioning

Architecture

Image → SigLIP Vision Encoder → Projection Layer → SmolLM + LoRA → Caption

Components:

  • Vision Encoder: SigLIP-base-patch16-224 (frozen during training)
  • Projection Layer: Linear layer mapping vision features to text embedding space
  • Language Model: SmolLM-135M with LoRA adapters (rank=16)

Model Contents

The model.safetensors file (984 MB) contains all weights needed for inference:

Complete Model Weights (723 tensors total)

  1. Vision Encoder (~300 MB)

    • Complete SigLIP-base-patch16-224 weights
    • Keys: vision_encoder.*
    • Hidden size: 768
    • Patches: 14×14 = 196 tokens per image
  2. Projection Layer (~1 MB)

    • Linear projection: 768 → 576
    • Keys: projection.*
    • Maps vision features to language model embedding space
  3. Language Model Base Weights (~660 MB)

    • Complete SmolLM-135M base weights
    • Keys: language_model.base_model.model.*
    • 30 layers, 576 hidden size, 49152 vocab size
  4. LoRA Adapters (~2 MB trainable)

    • Separate low-rank matrices (not merged into base)
    • Keys: language_model.*.lora_A.default.weight, language_model.*.lora_B.default.weight
    • Applied to: q_proj, k_proj, v_proj, o_proj
    • Rank: 16, Alpha: 16, Dropout: 0.1

Training Details

  • Trainable Parameters: 2.2M / 221M total (1%)
  • Frozen: Vision encoder (SigLIP)
  • Trainable: Projection layer + LoRA adapters
  • Datasets: Flickr30k, COCO
  • Architecture: Prefix-conditioning (no cross-attention)

Usage

Python CLI (torch vs ONNX)

Run the side-by-side comparison script:

python compare_inference.py --model-dir . --onnx-dir onnx --image cat.jpg --prompt "A photo of" --max-new-tokens 15

Key path inside compare_inference.py:

vision_encoder, projection, language_model, _ = load_models(args.model_dir, device)
pixel_values = preprocess(args.image, processor, device)

torch_caption = torch_generate(
    vision_encoder, projection, language_model, tokenizer, pixel_values, args.prompt, args.max_new_tokens
)

vision_sess, proj_sess, lm_sess = load_onnx_sessions(args.onnx_dir)
onnx_caption = onnx_generate(
    vision_sess, proj_sess, lm_sess, language_model, tokenizer, pixel_values, args.prompt, args.max_new_tokens
)

--image defaults to cat.jpg in the repo if you do not pass one. The script prints both captions so you can verify parity between torch and ONNX.

Browser demo (ort.js)

Run a static server from demo/:

cd demo && python -m http.server 8000

demo/main.js drives the full pipeline fully on-device:

await loadAll(); // downloads tokenizer/processor assets and ONNX models from ./demo/models

const pixelData = await preprocessImage(currentImage);
const visionHidden = await runVision(pixelData);
const projected = await runProjection(visionHidden);

const prompt = '';   // not needed 
const encoded = await tokenizer(prompt);
const initFeeds = {
  prefix_embeddings: projected,
  input_ids: new ort.Tensor('int64', BigInt64Array.from(encoded.input_ids.data.map(BigInt)), [1, encoded.input_ids.data.length]),
};
const initOutputs = await prefixInitSession.run(initFeeds);
// then decode step-by-step with cached past:
const feeds = buildDecoderInputs([BigInt(nextToken)], attention, position, past);
const outputs = await lmSession.run(feeds);

Open http://localhost:8000, drop an image, and click Generate caption to watch the vision → projection → prefix-init → decode flow run in the browser.

Model Specifications

Component Size Parameters Status
Vision Encoder 768 hidden ~87M Frozen
Projection 768→576 ~443K Trainable
Language Model 576 hidden, 30 layers ~134M Base frozen
LoRA Adapters rank=16 ~1.8M Trainable
Total ~221M 2.2M trainable

Key Features

  • Combined weights: All components are merged into one model.safetensors; ONNX runtime uses three files in onnx/
  • LoRA Preserved: LoRA weights stored separately (not merged) for flexibility
  • Efficient: Only 1% of parameters trained, 5.7× faster than cross-attention baseline

Technical Notes

License

Model weights inherit licenses from base models:

  • SigLIP: Apache 2.0
  • SmolLM: Apache 2.0

Browser Demo (ort.js)

An offline, browser-only demo lives in demo/ and runs the ONNX exports via ort.js + transformers.js (vision encoder → projection → prefix_init → decoder). The demo/models directory points to the ONNX files in onnx/, so make sure those exports are present locally.

Run it with any static server (e.g., from the repo root):

cd demo && python -m http.server 8000

Then open http://localhost:8000, drop an image, and click Generate caption to see the model run fully on-device. No remote fetches are required.

Downloads last month
39
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support