Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,39 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: zero-shot-image-classification
|
| 6 |
+
tags:
|
| 7 |
+
- vision
|
| 8 |
+
- simple
|
| 9 |
+
- small
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# tinyvvision 🧠✨
|
| 13 |
+
|
| 14 |
+
**tinyvvision** is a compact, synthetic curriculum-trained vision-language model designed to demonstrate real zero-shot capability in a minimal setup. Despite its small size (~630k parameters), it aligns images and captions effectively by learning shared visual-language embeddings.
|
| 15 |
+
|
| 16 |
+
## What tinyvvision can do:
|
| 17 |
+
- Match simple geometric shapes (circles, stars, hearts, triangles, etc.) and descriptive captions (e.g., "a red circle", "a yellow star").
|
| 18 |
+
- Perform genuine zero-shot generalization, meaning it can correctly match captions to shapes and colors it has never explicitly encountered during training.
|
| 19 |
+
|
| 20 |
+
## Model Details:
|
| 21 |
+
- **Type**: Contrastive embedding (CLIP-style, zero-shot)
|
| 22 |
+
- **Parameters**: ~630,000 (tiny!)
|
| 23 |
+
- **Training data**: Fully synthetic—randomly generated shapes, letters, numbers, and symbols paired with descriptive text captions.
|
| 24 |
+
- **Architecture**:
|
| 25 |
+
- **Image Encoder**: Simple CNN
|
| 26 |
+
- **Text Encoder**: Small embedding layer + bidirectional GRU
|
| 27 |
+
- **Embedding Dim**: 128-dimensional shared embedding space
|
| 28 |
+
|
| 29 |
+
## Examples of Zero-Shot Matching:
|
| 30 |
+
- **Seen during training**: "a red circle" → correctly matches the drawn red circle.
|
| 31 |
+
- **Never seen**: "a teal lightning bolt" → correctly matched a hand-drawn lightning bolt shape, despite never having seen one during training.
|
| 32 |
+
|
| 33 |
+
## Limitations:
|
| 34 |
+
- tinyvvision is designed as a demonstration of zero-shot embedding and generalization on synthetic data. It is not trained on real-world data or complex scenarios. While robust within its domain (simple geometric shapes and clear captions), results may vary significantly on more complicated or out-of-domain inputs.
|
| 35 |
+
|
| 36 |
+
## How to Test tinyvvision:
|
| 37 |
+
Check out the provided inference script to easily test your own shapes and captions. Feel free to challenge tinyvvision with new, unseen combinations to explore its generalization capability!
|
| 38 |
+
|
| 39 |
+
✨ **Enjoy experimenting!** ✨
|