MicroVLM-V / README.md
euhidaman's picture
Update model card for best
a1428dc verified
---
license: apache-2.0
tags:
- vision-language
- multimodal
- episodic-memory
- fiber-alignment
- qwen2
- deit
- pytorch
library_name: transformers
datasets:
- conceptual-12m
pipeline_tag: image-to-text
---
# MicroVLM-V: Vision-Language Model with FIBER Alignment & Episodic Memory
## πŸ“‹ Model Overview
MicroVLM-V is a compact vision-language model (~215 MB) that combines:
- **Vision Encoder**: DeiT-Tiny (5.7M params)
- **Language Model**: Qwen2.5-0.5B (4-bit quantized, 315M params)
- **Alignment**: FIBER fusion at layers [6, 8, 10]
- **Episodic Memory**: Larimar GPM (512 slots, 4.8M params)
**Checkpoint**: `best` (Best alignment model)
---
## πŸ“Š Model Architecture
### Parameter Distribution
| Component | Total Parameters | Trainable | Status |
|-----------|-----------------|-----------|--------|
| **Total Model** | **334.5M** | **13.8M** | **4.1% trainable** |
| Vision Encoder | 8.8M | 3.3M | FIBER fusion trainable |
| Language Model | 315.1M | 0 | Frozen (4-bit) |
| Multimodal Adapter | 5.0M | 5.0M | Fully trainable |
| Episodic Memory | 4.8M | 4.8M | Fully trainable |
### Quantization Status
| Component | Quantization |
|-----------|-------------|
| Vision Encoder | FP16 |
| Language Model | 4-bit βœ“ |
| Episodic Memory | FP32 |
**Estimated Model Size**: ~214.6 MB
---
## πŸ‹οΈ Training Details
### Configuration
- **Dataset**: CC12M (Conceptual 12M) - 3M training samples
- **Batch Size**: 512
- **Training Time**: ~0.64 hours on 2x A100 80GB
- **Throughput**: ~332 samples/sec
- **Total FLOPs**: 2088 PFLOPs
### FIBER Alignment
- **Mode**: Fusion-in-Backbone (FIBER-style)
- **Fusion Layers**: [6, 8, 10]
- **ITC Weight**: 1.0
- **ITM Weight**: 0.5
- **ITC Queue Size**: 256
### Training Metrics (Best Checkpoint)
- **Best Alignment Similarity**: 0.0249 (step 25)
- **Final ITM Loss**: ~0.53
- **Final Token Loss**: ~0.056
- **Training stopped**: Early stopping at step 1500 (alignment plateau)
---
## πŸ’» Usage
### Loading the Model
```python
import torch
# Load checkpoint
checkpoint = torch.load('model.pt', map_location='cpu')
# Access model state dict
model_state = checkpoint['model_state_dict']
# Get training info
print(f"Global step: {checkpoint.get('global_step', 'N/A')}")
print(f"Best alignment: {checkpoint.get('best_correct_sim', 'N/A')}")
```
### Inference Example
```python
from PIL import Image
import torchvision.transforms as transforms
# Prepare image
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
image = Image.open('example.jpg').convert('RGB')
image_tensor = transform(image).unsqueeze(0)
# Forward pass (after loading model)
with torch.no_grad():
outputs = model(
images=image_tensor,
input_ids=tokens['input_ids'],
attention_mask=tokens['attention_mask']
)
```
---
## πŸ“ Repository Contents
- `model.pt` - Best alignment checkpoint
- `statistics.json` - Training statistics
- `config.json` - Model configuration
- `README.md` - This model card
---
## βš™οΈ Requirements
```bash
pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install timm # For DeiT vision encoder
pip install bitsandbytes # For 4-bit quantization
```
---
## πŸ“œ License
Apache 2.0 License
---
## πŸ”— Links
- **GitHub Repository**: [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V)
- **Branch**: FocusedAttention
---
## ⚠️ Limitations
- This is the **Stage 1 alignment checkpoint** - focuses on vision-language alignment
- Best for: Image-text matching, alignment tasks
- May need further fine-tuning for generation tasks
---
*Uploaded: 2025-12-08 14:53:01 UTC*