euhidaman commited on
Commit
a1428dc
Β·
verified Β·
1 Parent(s): 016c4d8

Update model card for best

Browse files
Files changed (1) hide show
  1. README.md +166 -0
README.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision-language
5
+ - multimodal
6
+ - episodic-memory
7
+ - fiber-alignment
8
+ - qwen2
9
+ - deit
10
+ - pytorch
11
+ library_name: transformers
12
+ datasets:
13
+ - conceptual-12m
14
+ pipeline_tag: image-to-text
15
+ ---
16
+
17
+ # MicroVLM-V: Vision-Language Model with FIBER Alignment & Episodic Memory
18
+
19
+ ## πŸ“‹ Model Overview
20
+
21
+ MicroVLM-V is a compact vision-language model (~215 MB) that combines:
22
+ - **Vision Encoder**: DeiT-Tiny (5.7M params)
23
+ - **Language Model**: Qwen2.5-0.5B (4-bit quantized, 315M params)
24
+ - **Alignment**: FIBER fusion at layers [6, 8, 10]
25
+ - **Episodic Memory**: Larimar GPM (512 slots, 4.8M params)
26
+
27
+ **Checkpoint**: `best` (Best alignment model)
28
+
29
+ ---
30
+
31
+ ## πŸ“Š Model Architecture
32
+
33
+ ### Parameter Distribution
34
+
35
+ | Component | Total Parameters | Trainable | Status |
36
+ |-----------|-----------------|-----------|--------|
37
+ | **Total Model** | **334.5M** | **13.8M** | **4.1% trainable** |
38
+ | Vision Encoder | 8.8M | 3.3M | FIBER fusion trainable |
39
+ | Language Model | 315.1M | 0 | Frozen (4-bit) |
40
+ | Multimodal Adapter | 5.0M | 5.0M | Fully trainable |
41
+ | Episodic Memory | 4.8M | 4.8M | Fully trainable |
42
+
43
+ ### Quantization Status
44
+
45
+ | Component | Quantization |
46
+ |-----------|-------------|
47
+ | Vision Encoder | FP16 |
48
+ | Language Model | 4-bit βœ“ |
49
+ | Episodic Memory | FP32 |
50
+
51
+ **Estimated Model Size**: ~214.6 MB
52
+
53
+ ---
54
+
55
+ ## πŸ‹οΈ Training Details
56
+
57
+ ### Configuration
58
+ - **Dataset**: CC12M (Conceptual 12M) - 3M training samples
59
+ - **Batch Size**: 512
60
+ - **Training Time**: ~0.64 hours on 2x A100 80GB
61
+ - **Throughput**: ~332 samples/sec
62
+ - **Total FLOPs**: 2088 PFLOPs
63
+
64
+ ### FIBER Alignment
65
+ - **Mode**: Fusion-in-Backbone (FIBER-style)
66
+ - **Fusion Layers**: [6, 8, 10]
67
+ - **ITC Weight**: 1.0
68
+ - **ITM Weight**: 0.5
69
+ - **ITC Queue Size**: 256
70
+
71
+ ### Training Metrics (Best Checkpoint)
72
+ - **Best Alignment Similarity**: 0.0249 (step 25)
73
+ - **Final ITM Loss**: ~0.53
74
+ - **Final Token Loss**: ~0.056
75
+ - **Training stopped**: Early stopping at step 1500 (alignment plateau)
76
+
77
+ ---
78
+
79
+ ## πŸ’» Usage
80
+
81
+ ### Loading the Model
82
+
83
+ ```python
84
+ import torch
85
+
86
+ # Load checkpoint
87
+ checkpoint = torch.load('model.pt', map_location='cpu')
88
+
89
+ # Access model state dict
90
+ model_state = checkpoint['model_state_dict']
91
+
92
+ # Get training info
93
+ print(f"Global step: {checkpoint.get('global_step', 'N/A')}")
94
+ print(f"Best alignment: {checkpoint.get('best_correct_sim', 'N/A')}")
95
+ ```
96
+
97
+ ### Inference Example
98
+
99
+ ```python
100
+ from PIL import Image
101
+ import torchvision.transforms as transforms
102
+
103
+ # Prepare image
104
+ transform = transforms.Compose([
105
+ transforms.Resize((224, 224)),
106
+ transforms.ToTensor(),
107
+ transforms.Normalize(mean=[0.485, 0.456, 0.406],
108
+ std=[0.229, 0.224, 0.225])
109
+ ])
110
+
111
+ image = Image.open('example.jpg').convert('RGB')
112
+ image_tensor = transform(image).unsqueeze(0)
113
+
114
+ # Forward pass (after loading model)
115
+ with torch.no_grad():
116
+ outputs = model(
117
+ images=image_tensor,
118
+ input_ids=tokens['input_ids'],
119
+ attention_mask=tokens['attention_mask']
120
+ )
121
+ ```
122
+
123
+ ---
124
+
125
+ ## πŸ“ Repository Contents
126
+
127
+ - `model.pt` - Best alignment checkpoint
128
+ - `statistics.json` - Training statistics
129
+ - `config.json` - Model configuration
130
+ - `README.md` - This model card
131
+
132
+ ---
133
+
134
+ ## βš™οΈ Requirements
135
+
136
+ ```bash
137
+ pip install torch>=2.0.0
138
+ pip install transformers>=4.30.0
139
+ pip install timm # For DeiT vision encoder
140
+ pip install bitsandbytes # For 4-bit quantization
141
+ ```
142
+
143
+ ---
144
+
145
+ ## πŸ“œ License
146
+
147
+ Apache 2.0 License
148
+
149
+ ---
150
+
151
+ ## πŸ”— Links
152
+
153
+ - **GitHub Repository**: [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V)
154
+ - **Branch**: FocusedAttention
155
+
156
+ ---
157
+
158
+ ## ⚠️ Limitations
159
+
160
+ - This is the **Stage 1 alignment checkpoint** - focuses on vision-language alignment
161
+ - Best for: Image-text matching, alignment tasks
162
+ - May need further fine-tuning for generation tasks
163
+
164
+ ---
165
+
166
+ *Uploaded: 2025-12-08 14:53:01 UTC*