euhidaman
/

MicroVLM-V

+---
+license: apache-2.0
+tags:
+- vision-language
+- multimodal
+- episodic-memory
+- fiber-alignment
+- qwen2
+- deit
+- pytorch
+library_name: transformers
+datasets:
+- conceptual-12m
+pipeline_tag: image-to-text
+---
+# MicroVLM-V: Vision-Language Model with FIBER Alignment & Episodic Memory
+## 📋 Model Overview
+MicroVLM-V is a compact vision-language model (~215 MB) that combines:
+- **Vision Encoder**: DeiT-Tiny (5.7M params)
+- **Language Model**: Qwen2.5-0.5B (4-bit quantized, 315M params)
+- **Alignment**: FIBER fusion at layers [6, 8, 10]
+- **Episodic Memory**: Larimar GPM (512 slots, 4.8M params)
+**Checkpoint**: `best` (Best alignment model)
+---
+## 📊 Model Architecture
+### Parameter Distribution
+| Component | Total Parameters | Trainable | Status |
+|-----------|-----------------|-----------|--------|
+| **Total Model** | **334.5M** | **13.8M** | **4.1% trainable** |
+| Vision Encoder | 8.8M | 3.3M | FIBER fusion trainable |
+| Language Model | 315.1M | 0 | Frozen (4-bit) |
+| Multimodal Adapter | 5.0M | 5.0M | Fully trainable |
+| Episodic Memory | 4.8M | 4.8M | Fully trainable |
+### Quantization Status
+| Component | Quantization |
+|-----------|-------------|
+| Vision Encoder | FP16 |
+| Language Model | 4-bit ✓ |
+| Episodic Memory | FP32 |
+**Estimated Model Size**: ~214.6 MB
+---
+## 🏋️ Training Details
+### Configuration
+- **Dataset**: CC12M (Conceptual 12M) - 3M training samples
+- **Batch Size**: 512
+- **Training Time**: ~0.64 hours on 2x A100 80GB
+- **Throughput**: ~332 samples/sec
+- **Total FLOPs**: 2088 PFLOPs
+### FIBER Alignment
+- **Mode**: Fusion-in-Backbone (FIBER-style)
+- **Fusion Layers**: [6, 8, 10]
+- **ITC Weight**: 1.0
+- **ITM Weight**: 0.5
+- **ITC Queue Size**: 256
+### Training Metrics (Best Checkpoint)
+- **Best Alignment Similarity**: 0.0249 (step 25)
+- **Final ITM Loss**: ~0.53
+- **Final Token Loss**: ~0.056
+- **Training stopped**: Early stopping at step 1500 (alignment plateau)
+---
+## 💻 Usage
+### Loading the Model
+```python
+import torch
+# Load checkpoint
+checkpoint = torch.load('model.pt', map_location='cpu')
+# Access model state dict
+model_state = checkpoint['model_state_dict']
+# Get training info
+print(f"Global step: {checkpoint.get('global_step', 'N/A')}")
+print(f"Best alignment: {checkpoint.get('best_correct_sim', 'N/A')}")
+```
+### Inference Example
+```python
+from PIL import Image
+import torchvision.transforms as transforms
+# Prepare image
+transform = transforms.Compose([
+    transforms.Resize((224, 224)),
+    transforms.ToTensor(),
+    transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                        std=[0.229, 0.224, 0.225])
+])
+image = Image.open('example.jpg').convert('RGB')
+image_tensor = transform(image).unsqueeze(0)
+# Forward pass (after loading model)
+with torch.no_grad():
+    outputs = model(
+        images=image_tensor,
+        input_ids=tokens['input_ids'],
+        attention_mask=tokens['attention_mask']
+    )
+```
+---
+## 📁 Repository Contents
+- `model.pt` - Best alignment checkpoint
+- `statistics.json` - Training statistics
+- `config.json` - Model configuration
+- `README.md` - This model card
+---
+## ⚙️ Requirements
+```bash
+pip install torch>=2.0.0
+pip install transformers>=4.30.0
+pip install timm  # For DeiT vision encoder
+pip install bitsandbytes  # For 4-bit quantization
+```
+---
+## 📜 License
+Apache 2.0 License
+---
+## 🔗 Links
+- **GitHub Repository**: [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V)
+- **Branch**: FocusedAttention
+---
+## ⚠️ Limitations
+- This is the **Stage 1 alignment checkpoint** - focuses on vision-language alignment
+- Best for: Image-text matching, alignment tasks
+- May need further fine-tuning for generation tasks
+---
+*Uploaded: 2025-12-08 14:53:01 UTC*