File size: 3,758 Bytes

a1428dc

---
license: apache-2.0
tags:
- vision-language
- multimodal
- episodic-memory
- fiber-alignment
- qwen2
- deit
- pytorch
library_name: transformers
datasets:
- conceptual-12m
pipeline_tag: image-to-text
---

# MicroVLM-V: Vision-Language Model with FIBER Alignment & Episodic Memory

## 📋 Model Overview

MicroVLM-V is a compact vision-language model (~215 MB) that combines:
- **Vision Encoder**: DeiT-Tiny (5.7M params)
- **Language Model**: Qwen2.5-0.5B (4-bit quantized, 315M params)
- **Alignment**: FIBER fusion at layers [6, 8, 10]
- **Episodic Memory**: Larimar GPM (512 slots, 4.8M params)

**Checkpoint**: `best` (Best alignment model)

---

## 📊 Model Architecture

### Parameter Distribution

| Component | Total Parameters | Trainable | Status |
|-----------|-----------------|-----------|--------|
| **Total Model** | **334.5M** | **13.8M** | **4.1% trainable** |
| Vision Encoder | 8.8M | 3.3M | FIBER fusion trainable |
| Language Model | 315.1M | 0 | Frozen (4-bit) |
| Multimodal Adapter | 5.0M | 5.0M | Fully trainable |
| Episodic Memory | 4.8M | 4.8M | Fully trainable |

### Quantization Status

| Component | Quantization |
|-----------|-------------|
| Vision Encoder | FP16 |
| Language Model | 4-bit ✓ |
| Episodic Memory | FP32 |

**Estimated Model Size**: ~214.6 MB

---

## 🏋️ Training Details

### Configuration
- **Dataset**: CC12M (Conceptual 12M) - 3M training samples
- **Batch Size**: 512
- **Training Time**: ~0.64 hours on 2x A100 80GB
- **Throughput**: ~332 samples/sec
- **Total FLOPs**: 2088 PFLOPs

### FIBER Alignment
- **Mode**: Fusion-in-Backbone (FIBER-style)
- **Fusion Layers**: [6, 8, 10]
- **ITC Weight**: 1.0
- **ITM Weight**: 0.5
- **ITC Queue Size**: 256

### Training Metrics (Best Checkpoint)
- **Best Alignment Similarity**: 0.0249 (step 25)
- **Final ITM Loss**: ~0.53
- **Final Token Loss**: ~0.056
- **Training stopped**: Early stopping at step 1500 (alignment plateau)

---

## 💻 Usage

### Loading the Model

```python
import torch

# Load checkpoint
checkpoint = torch.load('model.pt', map_location='cpu')

# Access model state dict
model_state = checkpoint['model_state_dict']

# Get training info
print(f"Global step: {checkpoint.get('global_step', 'N/A')}")
print(f"Best alignment: {checkpoint.get('best_correct_sim', 'N/A')}")
```

### Inference Example

```python
from PIL import Image
import torchvision.transforms as transforms

# Prepare image
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

image = Image.open('example.jpg').convert('RGB')
image_tensor = transform(image).unsqueeze(0)

# Forward pass (after loading model)
with torch.no_grad():
    outputs = model(
        images=image_tensor,
        input_ids=tokens['input_ids'],
        attention_mask=tokens['attention_mask']
    )
```

---

## 📁 Repository Contents

- `model.pt` - Best alignment checkpoint
- `statistics.json` - Training statistics
- `config.json` - Model configuration
- `README.md` - This model card

---

## ⚙️ Requirements

```bash
pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install timm  # For DeiT vision encoder
pip install bitsandbytes  # For 4-bit quantization
```

---

## 📜 License

Apache 2.0 License

---

## 🔗 Links

- **GitHub Repository**: [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V)
- **Branch**: FocusedAttention

---

## ⚠️ Limitations

- This is the **Stage 1 alignment checkpoint** - focuses on vision-language alignment
- Best for: Image-text matching, alignment tasks
- May need further fine-tuning for generation tasks

---

*Uploaded: 2025-12-08 14:53:01 UTC*