Mini VLM (Built from Scratch)
Vision Language Model built from scratch. Architecture: CLIP (frozen) + Projection Layer + Custom LLM decoder.
Architecture
- Vision: CLIP ViT-B/32 (frozen)
- Projection: Linear(512 → 384)
- LLM: Custom Transformer (6 layers, 384 dim)
- Dataset: COCO Captions (20k samples)
- GPU: NVIDIA L4
Training
- Epochs: 3 | Final loss: 1.17
- Same pipeline as LLaVA Stage 1!
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support