MicroVLM-V-stage2 / README.md

stage2: Epoch 0/10 - Latest checkpoint

4f80a7d verified about 1 month ago

4.74 kB

	---
	license: apache-2.0
	tags:
	- vision-language
	- multimodal
	- episodic-memory
	- 1.58-bit
	- pytorch
	library_name: transformers
	---

	# MicroVLM-V: Vision-Language Model with Episodic Memory

	## 🔄 Training Progress: stage2

	Current Status: Epoch 0/10

	> ⚠️ Note: This repository contains ONLY the latest checkpoint. Each epoch overwrites previous weights.

	---

	## 📊 Model Architecture

	### Parameter Distribution

	\| Component \| Total Parameters \| Trainable \| Percentage \|
	\|-----------\|-----------------\|-----------\|------------\|
	\| Total Model \| 513.77M \| 79.39M \| 15.5% \|
	\| Vision Encoder \| 8.79M \| 8.79M \| 1.7% \|
	\| Language Model \| 494.03M \| 59.65M \| 96.2% \|
	\| Multimodal Adapter \| 5.04M \| 5.04M \| 1.0% \|
	\| Episodic Memory \| 5.23M \| 5.23M \| 1.0% \|

	### Technical Specifications

	- Vision Encoder: DeiT-Tiny (192-dim embeddings)
	- Quantization: FP16
	- Status: Trainable

	- Language Model: Qwen2.5-0.5B (896-dim embeddings)
	- Quantization: FP16
	- Trainable Layers: 59650K params

	- Multimodal Adapter:
	- Architecture: Linear projection + Layer Norm
	- Mapping: 192-dim (vision) → 896-dim (language)
	- Parameters: 5.04M

	- Episodic Memory:
	- Type: BitLinear 1.58-bit quantized
	- Quantization: Enabled
	- Parameters: 5.23M

	### Model Size

	- Estimated Size: 1026.85 MB
	- Memory Footprint: ~1540 MB (with activation)

	---

	## 🎯 Training Methodology

	### stage2 Configuration

	Focus: Episodic memory integration

	Training Strategy:
	- Vision encoder: Frozen
	- Language model: Partially unfrozen (last 2 layers)
	- Multimodal adapter: Trainable (initialized from Stage 1)
	- Episodic memory: Enabled (1.58-bit quantization)

	Loss Function: Alignment + Memory losses
	- Continues alignment refinement
	- Adds memory read/write/retrieval objectives

	Hyperparameters:
	- Learning Rate: 0.0001
	- Batch Size: 112
	- Warmup Steps: 1000
	- Gradient Clipping: 0.5
	- Optimizer: ADAMW
	- Scheduler: cosine

	Hardware:
	- GPU: NVIDIA RTX 6000 Ada (48GB)
	- Precision: Mixed FP16/FP32
	- Distributed: Single GPU

	---

	## 📈 Training Statistics (Epoch 0)

	Latest Metrics:
	- Training Loss: 1.7374
	- Alignment Loss: 0.0000
	- Learning Rate: 9.79e-05
	- Gradient Norm: 0.0000

	Timestamp: 2025-12-19 21:35:56 UTC

	---

	## 💻 Usage

	### Loading the Model

	```python
	import torch
	from pathlib import Path

	# Download model weights
	checkpoint = torch.load('model.pt', map_location='cpu')

	# Load model state
	model.load_state_dict(checkpoint['model_state_dict'])
	model.eval()

	# Move to GPU if available
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
	model = model.to(device)
	```

	### Inference Example

	```python
	from PIL import Image
	import torchvision.transforms as transforms

	# Prepare image
	image = Image.open('example.jpg').convert('RGB')
	transform = transforms.Compose([
	transforms.Resize((224, 224)),
	transforms.ToTensor(),
	transforms.Normalize(mean=[0.485, 0.456, 0.406],
	std=[0.229, 0.224, 0.225])
	])
	image_tensor = transform(image).unsqueeze(0).to(device)

	# Prepare text
	text = "A photo of a cat"
	tokens = tokenizer(text, return_tensors='pt', padding=True).to(device)

	# Forward pass
	with torch.no_grad():
	outputs = model(
	images=image_tensor,
	input_ids=tokens['input_ids'],
	attention_mask=tokens['attention_mask']
	)
	```

	### Model Input/Output Format

	Inputs:
	- `images`: Tensor [B, 3, 224, 224] - RGB images normalized
	- `input_ids`: Tensor [B, seq_len] - Tokenized text
	- `attention_mask`: Tensor [B, seq_len] - Attention mask

	Outputs:
	- `lm_loss`: Language modeling loss (if labels provided)
	- `alignment_loss`: Vision-language alignment loss
	- `memory_loss`: Episodic memory loss (Stage 2/3 only)
	- `logits`: Next token predictions [B, seq_len, vocab_size]

	---

	## ⚙️ Requirements

	```bash
	pip install torch>=2.0.0
	pip install transformers>=4.30.0
	pip install timm # For DeiT vision encoder
	pip install Pillow # For image processing
	```

	---

	## 📜 License

	Apache 2.0 License

	---

	## 🔗 Links

	- GitHub Repository: [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V)
	- Paper: Coming soon
	- Demo: Coming soon

	---

	## ⚠️ Limitations

	- Training in Progress: This model is still under active training
	- Checkpoint Volatility: Only latest epoch is preserved - download if needed
	- Stage-Specific: Capabilities depend on training stage
	- Stage 1: Alignment only, no generation
	- Stage 2: Basic generation with memory
	- Stage 3: Full capabilities

	---

	## 📧 Contact

	For questions or issues, please open an issue on GitHub.

	---

	Last updated: Epoch 0/10 - 2025-12-19