Caplin43 commited on
Commit
713b2f5
·
verified ·
1 Parent(s): a7d6c86

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: image-to-text
6
+ tags:
7
+ - vision-language
8
+ - multimodal
9
+ - image-captioning
10
+ - transformer
11
+ ---
12
+
13
+ # 🖼️ Multimodal Vision Language Mini
14
+
15
+ A lightweight multimodal transformer model designed to process images and text instructions to generate structured descriptions.
16
+
17
+ ---
18
+
19
+ ## 🧠 Model Details
20
+
21
+ - Architecture: Vision Encoder + Text Decoder
22
+ - Vision Backbone: ViT-base
23
+ - Text Decoder: Transformer (12 layers)
24
+ - Hidden Size: 768
25
+ - Parameters: ~220M
26
+ - Training Samples: 500k image-text pairs
27
+
28
+ ---
29
+
30
+ ## 📥 Input
31
+
32
+ Image + Instruction
33
+ Example:
34
+ Instruction: "Describe the objects in the image."
35
+
36
+ ## 📤 Output
37
+
38
+ "Two people sitting at a wooden table with laptops."
39
+
40
+ ---
41
+
42
+ ## 🎯 Intended Use
43
+
44
+ - Image captioning
45
+ - Visual question answering
46
+ - Robotics perception modules
47
+
48
+ ---
49
+
50
+ ## ⚠️ Limitations
51
+
52
+ - English only
53
+ - Not optimized for high-resolution images