|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- vision-language |
|
|
- multimodal |
|
|
- episodic-memory |
|
|
- fiber-alignment |
|
|
- qwen2 |
|
|
- deit |
|
|
- pytorch |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- conceptual-12m |
|
|
pipeline_tag: image-to-text |
|
|
--- |
|
|
|
|
|
# MicroVLM-V: Vision-Language Model with FIBER Alignment & Episodic Memory |
|
|
|
|
|
## π Model Overview |
|
|
|
|
|
MicroVLM-V is a compact vision-language model (~215 MB) that combines: |
|
|
- **Vision Encoder**: DeiT-Tiny (5.7M params) |
|
|
- **Language Model**: Qwen2.5-0.5B (4-bit quantized, 315M params) |
|
|
- **Alignment**: FIBER fusion at layers [6, 8, 10] |
|
|
- **Episodic Memory**: Larimar GPM (512 slots, 4.8M params) |
|
|
|
|
|
**Checkpoint**: `best` (Best alignment model) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Architecture |
|
|
|
|
|
### Parameter Distribution |
|
|
|
|
|
| Component | Total Parameters | Trainable | Status | |
|
|
|-----------|-----------------|-----------|--------| |
|
|
| **Total Model** | **334.5M** | **13.8M** | **4.1% trainable** | |
|
|
| Vision Encoder | 8.8M | 3.3M | FIBER fusion trainable | |
|
|
| Language Model | 315.1M | 0 | Frozen (4-bit) | |
|
|
| Multimodal Adapter | 5.0M | 5.0M | Fully trainable | |
|
|
| Episodic Memory | 4.8M | 4.8M | Fully trainable | |
|
|
|
|
|
### Quantization Status |
|
|
|
|
|
| Component | Quantization | |
|
|
|-----------|-------------| |
|
|
| Vision Encoder | FP16 | |
|
|
| Language Model | 4-bit β | |
|
|
| Episodic Memory | FP32 | |
|
|
|
|
|
**Estimated Model Size**: ~214.6 MB |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Training Details |
|
|
|
|
|
### Configuration |
|
|
- **Dataset**: CC12M (Conceptual 12M) - 3M training samples |
|
|
- **Batch Size**: 512 |
|
|
- **Training Time**: ~0.64 hours on 2x A100 80GB |
|
|
- **Throughput**: ~332 samples/sec |
|
|
- **Total FLOPs**: 2088 PFLOPs |
|
|
|
|
|
### FIBER Alignment |
|
|
- **Mode**: Fusion-in-Backbone (FIBER-style) |
|
|
- **Fusion Layers**: [6, 8, 10] |
|
|
- **ITC Weight**: 1.0 |
|
|
- **ITM Weight**: 0.5 |
|
|
- **ITC Queue Size**: 256 |
|
|
|
|
|
### Training Metrics (Best Checkpoint) |
|
|
- **Best Alignment Similarity**: 0.0249 (step 25) |
|
|
- **Final ITM Loss**: ~0.53 |
|
|
- **Final Token Loss**: ~0.056 |
|
|
- **Training stopped**: Early stopping at step 1500 (alignment plateau) |
|
|
|
|
|
--- |
|
|
|
|
|
## π» Usage |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
# Load checkpoint |
|
|
checkpoint = torch.load('model.pt', map_location='cpu') |
|
|
|
|
|
# Access model state dict |
|
|
model_state = checkpoint['model_state_dict'] |
|
|
|
|
|
# Get training info |
|
|
print(f"Global step: {checkpoint.get('global_step', 'N/A')}") |
|
|
print(f"Best alignment: {checkpoint.get('best_correct_sim', 'N/A')}") |
|
|
``` |
|
|
|
|
|
### Inference Example |
|
|
|
|
|
```python |
|
|
from PIL import Image |
|
|
import torchvision.transforms as transforms |
|
|
|
|
|
# Prepare image |
|
|
transform = transforms.Compose([ |
|
|
transforms.Resize((224, 224)), |
|
|
transforms.ToTensor(), |
|
|
transforms.Normalize(mean=[0.485, 0.456, 0.406], |
|
|
std=[0.229, 0.224, 0.225]) |
|
|
]) |
|
|
|
|
|
image = Image.open('example.jpg').convert('RGB') |
|
|
image_tensor = transform(image).unsqueeze(0) |
|
|
|
|
|
# Forward pass (after loading model) |
|
|
with torch.no_grad(): |
|
|
outputs = model( |
|
|
images=image_tensor, |
|
|
input_ids=tokens['input_ids'], |
|
|
attention_mask=tokens['attention_mask'] |
|
|
) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Repository Contents |
|
|
|
|
|
- `model.pt` - Best alignment checkpoint |
|
|
- `statistics.json` - Training statistics |
|
|
- `config.json` - Model configuration |
|
|
- `README.md` - This model card |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Requirements |
|
|
|
|
|
```bash |
|
|
pip install torch>=2.0.0 |
|
|
pip install transformers>=4.30.0 |
|
|
pip install timm # For DeiT vision encoder |
|
|
pip install bitsandbytes # For 4-bit quantization |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
Apache 2.0 License |
|
|
|
|
|
--- |
|
|
|
|
|
## π Links |
|
|
|
|
|
- **GitHub Repository**: [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V) |
|
|
- **Branch**: FocusedAttention |
|
|
|
|
|
--- |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
- This is the **Stage 1 alignment checkpoint** - focuses on vision-language alignment |
|
|
- Best for: Image-text matching, alignment tasks |
|
|
- May need further fine-tuning for generation tasks |
|
|
|
|
|
--- |
|
|
|
|
|
*Uploaded: 2025-12-08 14:53:01 UTC* |
|
|
|