RajGana
/

mini-vlm-scratch

vision-language

Model card Files Files and versions

Mini VLM (Built from Scratch)

Vision Language Model built from scratch. Architecture: CLIP (frozen) + Projection Layer + Custom LLM decoder.

Architecture

Vision: CLIP ViT-B/32 (frozen)
Projection: Linear(512 → 384)
LLM: Custom Transformer (6 layers, 384 dim)
Dataset: COCO Captions (20k samples)
GPU: NVIDIA L4

Training

Epochs: 3 | Final loss: 1.17
Same pipeline as LLaVA Stage 1!

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support