MicroVLM-V / README.md

Update model card for best

a1428dc verified 25 days ago

3.76 kB

	---
	license: apache-2.0
	tags:
	- vision-language
	- multimodal
	- episodic-memory
	- fiber-alignment
	- qwen2
	- deit
	- pytorch
	library_name: transformers
	datasets:
	- conceptual-12m
	pipeline_tag: image-to-text
	---

	# MicroVLM-V: Vision-Language Model with FIBER Alignment & Episodic Memory

	## 📋 Model Overview

	MicroVLM-V is a compact vision-language model (~215 MB) that combines:
	- Vision Encoder: DeiT-Tiny (5.7M params)
	- Language Model: Qwen2.5-0.5B (4-bit quantized, 315M params)
	- Alignment: FIBER fusion at layers [6, 8, 10]
	- Episodic Memory: Larimar GPM (512 slots, 4.8M params)

	Checkpoint: `best` (Best alignment model)

	---

	## 📊 Model Architecture

	### Parameter Distribution

	\| Component \| Total Parameters \| Trainable \| Status \|
	\|-----------\|-----------------\|-----------\|--------\|
	\| Total Model \| 334.5M \| 13.8M \| 4.1% trainable \|
	\| Vision Encoder \| 8.8M \| 3.3M \| FIBER fusion trainable \|
	\| Language Model \| 315.1M \| 0 \| Frozen (4-bit) \|
	\| Multimodal Adapter \| 5.0M \| 5.0M \| Fully trainable \|
	\| Episodic Memory \| 4.8M \| 4.8M \| Fully trainable \|

	### Quantization Status

	\| Component \| Quantization \|
	\|-----------\|-------------\|
	\| Vision Encoder \| FP16 \|
	\| Language Model \| 4-bit ✓ \|
	\| Episodic Memory \| FP32 \|

	Estimated Model Size: ~214.6 MB

	---

	## 🏋️ Training Details

	### Configuration
	- Dataset: CC12M (Conceptual 12M) - 3M training samples
	- Batch Size: 512
	- Training Time: ~0.64 hours on 2x A100 80GB
	- Throughput: ~332 samples/sec
	- Total FLOPs: 2088 PFLOPs

	### FIBER Alignment
	- Mode: Fusion-in-Backbone (FIBER-style)
	- Fusion Layers: [6, 8, 10]
	- ITC Weight: 1.0
	- ITM Weight: 0.5
	- ITC Queue Size: 256

	### Training Metrics (Best Checkpoint)
	- Best Alignment Similarity: 0.0249 (step 25)
	- Final ITM Loss: ~0.53
	- Final Token Loss: ~0.056
	- Training stopped: Early stopping at step 1500 (alignment plateau)

	---

	## 💻 Usage

	### Loading the Model

	```python
	import torch

	# Load checkpoint
	checkpoint = torch.load('model.pt', map_location='cpu')

	# Access model state dict
	model_state = checkpoint['model_state_dict']

	# Get training info
	print(f"Global step: {checkpoint.get('global_step', 'N/A')}")
	print(f"Best alignment: {checkpoint.get('best_correct_sim', 'N/A')}")
	```

	### Inference Example

	```python
	from PIL import Image
	import torchvision.transforms as transforms

	# Prepare image
	transform = transforms.Compose([
	transforms.Resize((224, 224)),
	transforms.ToTensor(),
	transforms.Normalize(mean=[0.485, 0.456, 0.406],
	std=[0.229, 0.224, 0.225])
	])

	image = Image.open('example.jpg').convert('RGB')
	image_tensor = transform(image).unsqueeze(0)

	# Forward pass (after loading model)
	with torch.no_grad():
	outputs = model(
	images=image_tensor,
	input_ids=tokens['input_ids'],
	attention_mask=tokens['attention_mask']
	)
	```

	---

	## 📁 Repository Contents

	- `model.pt` - Best alignment checkpoint
	- `statistics.json` - Training statistics
	- `config.json` - Model configuration
	- `README.md` - This model card

	---

	## ⚙️ Requirements

	```bash
	pip install torch>=2.0.0
	pip install transformers>=4.30.0
	pip install timm # For DeiT vision encoder
	pip install bitsandbytes # For 4-bit quantization
	```

	---

	## 📜 License

	Apache 2.0 License

	---

	## 🔗 Links

	- GitHub Repository: [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V)
	- Branch: FocusedAttention

	---

	## ⚠️ Limitations

	- This is the Stage 1 alignment checkpoint - focuses on vision-language alignment
	- Best for: Image-text matching, alignment tasks
	- May need further fine-tuning for generation tasks

	---

	Uploaded: 2025-12-08 14:53:01 UTC