---
license: apache-2.0
tags:
- vision-language
- multimodal
- episodic-memory
- 1.58-bit
- pytorch
library_name: transformers
---

# MicroVLM-V: Vision-Language Model with Episodic Memory

## 🔄 Training Progress: stage2

**Current Status:** Epoch 0/10

> ⚠️ **Note:** This repository contains ONLY the latest checkpoint. Each epoch overwrites previous weights.

---

## 📊 Model Architecture

### Parameter Distribution

| Component | Total Parameters | Trainable | Percentage |
|-----------|-----------------|-----------|------------|
| **Total Model** | **513.77M** | **79.39M** | **15.5%** |
| Vision Encoder | 8.79M | 8.79M | 1.7% |
| Language Model | 494.03M | 59.65M | 96.2% |
| Multimodal Adapter | 5.04M | 5.04M | 1.0% |
| Episodic Memory | 5.23M | 5.23M | 1.0% |

### Technical Specifications

- **Vision Encoder:** DeiT-Tiny (192-dim embeddings)
  - Quantization: FP16
  - Status: Trainable
  
- **Language Model:** Qwen2.5-0.5B (896-dim embeddings)
  - Quantization: FP16
  - Trainable Layers: 59650K params
  
- **Multimodal Adapter:**
  - Architecture: Linear projection + Layer Norm
  - Mapping: 192-dim (vision) → 896-dim (language)
  - Parameters: 5.04M
  
- **Episodic Memory:**
  - Type: BitLinear 1.58-bit quantized
  - Quantization: Enabled
  - Parameters: 5.23M

### Model Size

- **Estimated Size:** 1026.85 MB
- **Memory Footprint:** ~1540 MB (with activation)

---

## 🎯 Training Methodology

### stage2 Configuration

**Focus:** Episodic memory integration

**Training Strategy:**
- Vision encoder: **Frozen**
- Language model: **Partially unfrozen** (last 2 layers)
- Multimodal adapter: **Trainable** (initialized from Stage 1)
- Episodic memory: **Enabled** (1.58-bit quantization)

**Loss Function:** Alignment + Memory losses
- Continues alignment refinement
- Adds memory read/write/retrieval objectives

**Hyperparameters:**
- Learning Rate: 0.0001
- Batch Size: 112
- Warmup Steps: 1000
- Gradient Clipping: 0.5
- Optimizer: ADAMW
- Scheduler: cosine

**Hardware:**
- GPU: NVIDIA RTX 6000 Ada (48GB)
- Precision: Mixed FP16/FP32
- Distributed: Single GPU

---

## 📈 Training Statistics (Epoch 0)

**Latest Metrics:**
- Training Loss: 1.7374
- Alignment Loss: 0.0000
- Learning Rate: 9.79e-05
- Gradient Norm: 0.0000

**Timestamp:** 2025-12-19 21:35:56 UTC

---

## 💻 Usage

### Loading the Model

```python
import torch
from pathlib import Path

# Download model weights
checkpoint = torch.load('model.pt', map_location='cpu')

# Load model state
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
```

### Inference Example

```python
from PIL import Image
import torchvision.transforms as transforms

# Prepare image
image = Image.open('example.jpg').convert('RGB')
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])
])
image_tensor = transform(image).unsqueeze(0).to(device)

# Prepare text
text = "A photo of a cat"
tokens = tokenizer(text, return_tensors='pt', padding=True).to(device)

# Forward pass
with torch.no_grad():
    outputs = model(
        images=image_tensor,
        input_ids=tokens['input_ids'],
        attention_mask=tokens['attention_mask']
    )
```

### Model Input/Output Format

**Inputs:**
- `images`: Tensor [B, 3, 224, 224] - RGB images normalized
- `input_ids`: Tensor [B, seq_len] - Tokenized text
- `attention_mask`: Tensor [B, seq_len] - Attention mask

**Outputs:**
- `lm_loss`: Language modeling loss (if labels provided)
- `alignment_loss`: Vision-language alignment loss
- `memory_loss`: Episodic memory loss (Stage 2/3 only)
- `logits`: Next token predictions [B, seq_len, vocab_size]

---

## ⚙️ Requirements

```bash
pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install timm  # For DeiT vision encoder
pip install Pillow  # For image processing
```

---

## 📜 License

Apache 2.0 License

---

## 🔗 Links

- **GitHub Repository:** [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V)
- **Paper:** Coming soon
- **Demo:** Coming soon

---

## ⚠️ Limitations

- **Training in Progress:** This model is still under active training
- **Checkpoint Volatility:** Only latest epoch is preserved - download if needed
- **Stage-Specific:** Capabilities depend on training stage
  - Stage 1: Alignment only, no generation
  - Stage 2: Basic generation with memory
  - Stage 3: Full capabilities

---

## 📧 Contact

For questions or issues, please open an issue on GitHub.

---

*Last updated: Epoch 0/10 - 2025-12-19*