File size: 3,758 Bytes
a1428dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
license: apache-2.0
tags:
- vision-language
- multimodal
- episodic-memory
- fiber-alignment
- qwen2
- deit
- pytorch
library_name: transformers
datasets:
- conceptual-12m
pipeline_tag: image-to-text
---

# MicroVLM-V: Vision-Language Model with FIBER Alignment & Episodic Memory

## πŸ“‹ Model Overview

MicroVLM-V is a compact vision-language model (~215 MB) that combines:
- **Vision Encoder**: DeiT-Tiny (5.7M params)
- **Language Model**: Qwen2.5-0.5B (4-bit quantized, 315M params)
- **Alignment**: FIBER fusion at layers [6, 8, 10]
- **Episodic Memory**: Larimar GPM (512 slots, 4.8M params)

**Checkpoint**: `best` (Best alignment model)

---

## πŸ“Š Model Architecture

### Parameter Distribution

| Component | Total Parameters | Trainable | Status |
|-----------|-----------------|-----------|--------|
| **Total Model** | **334.5M** | **13.8M** | **4.1% trainable** |
| Vision Encoder | 8.8M | 3.3M | FIBER fusion trainable |
| Language Model | 315.1M | 0 | Frozen (4-bit) |
| Multimodal Adapter | 5.0M | 5.0M | Fully trainable |
| Episodic Memory | 4.8M | 4.8M | Fully trainable |

### Quantization Status

| Component | Quantization |
|-----------|-------------|
| Vision Encoder | FP16 |
| Language Model | 4-bit βœ“ |
| Episodic Memory | FP32 |

**Estimated Model Size**: ~214.6 MB

---

## πŸ‹οΈ Training Details

### Configuration
- **Dataset**: CC12M (Conceptual 12M) - 3M training samples
- **Batch Size**: 512
- **Training Time**: ~0.64 hours on 2x A100 80GB
- **Throughput**: ~332 samples/sec
- **Total FLOPs**: 2088 PFLOPs

### FIBER Alignment
- **Mode**: Fusion-in-Backbone (FIBER-style)
- **Fusion Layers**: [6, 8, 10]
- **ITC Weight**: 1.0
- **ITM Weight**: 0.5
- **ITC Queue Size**: 256

### Training Metrics (Best Checkpoint)
- **Best Alignment Similarity**: 0.0249 (step 25)
- **Final ITM Loss**: ~0.53
- **Final Token Loss**: ~0.056
- **Training stopped**: Early stopping at step 1500 (alignment plateau)

---

## πŸ’» Usage

### Loading the Model

```python
import torch

# Load checkpoint
checkpoint = torch.load('model.pt', map_location='cpu')

# Access model state dict
model_state = checkpoint['model_state_dict']

# Get training info
print(f"Global step: {checkpoint.get('global_step', 'N/A')}")
print(f"Best alignment: {checkpoint.get('best_correct_sim', 'N/A')}")
```

### Inference Example

```python
from PIL import Image
import torchvision.transforms as transforms

# Prepare image
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

image = Image.open('example.jpg').convert('RGB')
image_tensor = transform(image).unsqueeze(0)

# Forward pass (after loading model)
with torch.no_grad():
    outputs = model(
        images=image_tensor,
        input_ids=tokens['input_ids'],
        attention_mask=tokens['attention_mask']
    )
```

---

## πŸ“ Repository Contents

- `model.pt` - Best alignment checkpoint
- `statistics.json` - Training statistics
- `config.json` - Model configuration
- `README.md` - This model card

---

## βš™οΈ Requirements

```bash
pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install timm  # For DeiT vision encoder
pip install bitsandbytes  # For 4-bit quantization
```

---

## πŸ“œ License

Apache 2.0 License

---

## πŸ”— Links

- **GitHub Repository**: [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V)
- **Branch**: FocusedAttention

---

## ⚠️ Limitations

- This is the **Stage 1 alignment checkpoint** - focuses on vision-language alignment
- Best for: Image-text matching, alignment tasks
- May need further fine-tuning for generation tasks

---

*Uploaded: 2025-12-08 14:53:01 UTC*