MicroVLM-V-stage2 / README.md
euhidaman's picture
stage2: Epoch 0/10 - Latest checkpoint
4f80a7d verified
---
license: apache-2.0
tags:
- vision-language
- multimodal
- episodic-memory
- 1.58-bit
- pytorch
library_name: transformers
---
# MicroVLM-V: Vision-Language Model with Episodic Memory
## πŸ”„ Training Progress: stage2
**Current Status:** Epoch 0/10
> ⚠️ **Note:** This repository contains ONLY the latest checkpoint. Each epoch overwrites previous weights.
---
## πŸ“Š Model Architecture
### Parameter Distribution
| Component | Total Parameters | Trainable | Percentage |
|-----------|-----------------|-----------|------------|
| **Total Model** | **513.77M** | **79.39M** | **15.5%** |
| Vision Encoder | 8.79M | 8.79M | 1.7% |
| Language Model | 494.03M | 59.65M | 96.2% |
| Multimodal Adapter | 5.04M | 5.04M | 1.0% |
| Episodic Memory | 5.23M | 5.23M | 1.0% |
### Technical Specifications
- **Vision Encoder:** DeiT-Tiny (192-dim embeddings)
- Quantization: FP16
- Status: Trainable
- **Language Model:** Qwen2.5-0.5B (896-dim embeddings)
- Quantization: FP16
- Trainable Layers: 59650K params
- **Multimodal Adapter:**
- Architecture: Linear projection + Layer Norm
- Mapping: 192-dim (vision) β†’ 896-dim (language)
- Parameters: 5.04M
- **Episodic Memory:**
- Type: BitLinear 1.58-bit quantized
- Quantization: Enabled
- Parameters: 5.23M
### Model Size
- **Estimated Size:** 1026.85 MB
- **Memory Footprint:** ~1540 MB (with activation)
---
## 🎯 Training Methodology
### stage2 Configuration
**Focus:** Episodic memory integration
**Training Strategy:**
- Vision encoder: **Frozen**
- Language model: **Partially unfrozen** (last 2 layers)
- Multimodal adapter: **Trainable** (initialized from Stage 1)
- Episodic memory: **Enabled** (1.58-bit quantization)
**Loss Function:** Alignment + Memory losses
- Continues alignment refinement
- Adds memory read/write/retrieval objectives
**Hyperparameters:**
- Learning Rate: 0.0001
- Batch Size: 112
- Warmup Steps: 1000
- Gradient Clipping: 0.5
- Optimizer: ADAMW
- Scheduler: cosine
**Hardware:**
- GPU: NVIDIA RTX 6000 Ada (48GB)
- Precision: Mixed FP16/FP32
- Distributed: Single GPU
---
## πŸ“ˆ Training Statistics (Epoch 0)
**Latest Metrics:**
- Training Loss: 1.7374
- Alignment Loss: 0.0000
- Learning Rate: 9.79e-05
- Gradient Norm: 0.0000
**Timestamp:** 2025-12-19 21:35:56 UTC
---
## πŸ’» Usage
### Loading the Model
```python
import torch
from pathlib import Path
# Download model weights
checkpoint = torch.load('model.pt', map_location='cpu')
# Load model state
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
```
### Inference Example
```python
from PIL import Image
import torchvision.transforms as transforms
# Prepare image
image = Image.open('example.jpg').convert('RGB')
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
image_tensor = transform(image).unsqueeze(0).to(device)
# Prepare text
text = "A photo of a cat"
tokens = tokenizer(text, return_tensors='pt', padding=True).to(device)
# Forward pass
with torch.no_grad():
outputs = model(
images=image_tensor,
input_ids=tokens['input_ids'],
attention_mask=tokens['attention_mask']
)
```
### Model Input/Output Format
**Inputs:**
- `images`: Tensor [B, 3, 224, 224] - RGB images normalized
- `input_ids`: Tensor [B, seq_len] - Tokenized text
- `attention_mask`: Tensor [B, seq_len] - Attention mask
**Outputs:**
- `lm_loss`: Language modeling loss (if labels provided)
- `alignment_loss`: Vision-language alignment loss
- `memory_loss`: Episodic memory loss (Stage 2/3 only)
- `logits`: Next token predictions [B, seq_len, vocab_size]
---
## βš™οΈ Requirements
```bash
pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install timm # For DeiT vision encoder
pip install Pillow # For image processing
```
---
## πŸ“œ License
Apache 2.0 License
---
## πŸ”— Links
- **GitHub Repository:** [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V)
- **Paper:** Coming soon
- **Demo:** Coming soon
---
## ⚠️ Limitations
- **Training in Progress:** This model is still under active training
- **Checkpoint Volatility:** Only latest epoch is preserved - download if needed
- **Stage-Specific:** Capabilities depend on training stage
- Stage 1: Alignment only, no generation
- Stage 2: Basic generation with memory
- Stage 3: Full capabilities
---
## πŸ“§ Contact
For questions or issues, please open an issue on GitHub.
---
*Last updated: Epoch 0/10 - 2025-12-19*