MicroVLM-V-stage2 / README.md

euhidaman

stage2: Epoch 0/10 - Latest checkpoint

4f80a7d verified 16 days ago

preview code

raw

history blame contribute delete

4.74 kB

metadata

license: apache-2.0
tags:
  - vision-language
  - multimodal
  - episodic-memory
  - 1.58-bit
  - pytorch
library_name: transformers

MicroVLM-V: Vision-Language Model with Episodic Memory

🔄 Training Progress: stage2

Current Status: Epoch 0/10

⚠️ Note: This repository contains ONLY the latest checkpoint. Each epoch overwrites previous weights.

📊 Model Architecture

Parameter Distribution

Component	Total Parameters	Trainable	Percentage
Total Model	513.77M	79.39M	15.5%
Vision Encoder	8.79M	8.79M	1.7%
Language Model	494.03M	59.65M	96.2%
Multimodal Adapter	5.04M	5.04M	1.0%
Episodic Memory	5.23M	5.23M	1.0%

Technical Specifications

Vision Encoder: DeiT-Tiny (192-dim embeddings)
- Quantization: FP16
- Status: Trainable
Language Model: Qwen2.5-0.5B (896-dim embeddings)
- Quantization: FP16
- Trainable Layers: 59650K params
Multimodal Adapter:
- Architecture: Linear projection + Layer Norm
- Mapping: 192-dim (vision) → 896-dim (language)
- Parameters: 5.04M
Episodic Memory:
- Type: BitLinear 1.58-bit quantized
- Quantization: Enabled
- Parameters: 5.23M

Model Size

Estimated Size: 1026.85 MB
Memory Footprint: ~1540 MB (with activation)

🎯 Training Methodology

stage2 Configuration

Focus: Episodic memory integration

Training Strategy:

Vision encoder: Frozen
Language model: Partially unfrozen (last 2 layers)
Multimodal adapter: Trainable (initialized from Stage 1)
Episodic memory: Enabled (1.58-bit quantization)

Loss Function: Alignment + Memory losses

Continues alignment refinement
Adds memory read/write/retrieval objectives

Hyperparameters:

Learning Rate: 0.0001
Batch Size: 112
Warmup Steps: 1000
Gradient Clipping: 0.5
Optimizer: ADAMW
Scheduler: cosine

Hardware:

GPU: NVIDIA RTX 6000 Ada (48GB)
Precision: Mixed FP16/FP32
Distributed: Single GPU

📈 Training Statistics (Epoch 0)

Latest Metrics:

Training Loss: 1.7374
Alignment Loss: 0.0000
Learning Rate: 9.79e-05
Gradient Norm: 0.0000

Timestamp: 2025-12-19 21:35:56 UTC

💻 Usage

Loading the Model

import torch
from pathlib import Path

# Download model weights
checkpoint = torch.load('model.pt', map_location='cpu')

# Load model state
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

Inference Example

from PIL import Image
import torchvision.transforms as transforms

# Prepare image
image = Image.open('example.jpg').convert('RGB')
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])
])
image_tensor = transform(image).unsqueeze(0).to(device)

# Prepare text
text = "A photo of a cat"
tokens = tokenizer(text, return_tensors='pt', padding=True).to(device)

# Forward pass
with torch.no_grad():
    outputs = model(
        images=image_tensor,
        input_ids=tokens['input_ids'],
        attention_mask=tokens['attention_mask']
    )

Model Input/Output Format

Inputs:

images: Tensor [B, 3, 224, 224] - RGB images normalized
input_ids: Tensor [B, seq_len] - Tokenized text
attention_mask: Tensor [B, seq_len] - Attention mask

Outputs:

lm_loss: Language modeling loss (if labels provided)
alignment_loss: Vision-language alignment loss
memory_loss: Episodic memory loss (Stage 2/3 only)
logits: Next token predictions [B, seq_len, vocab_size]

⚙️ Requirements

pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install timm  # For DeiT vision encoder
pip install Pillow  # For image processing

📜 License

Apache 2.0 License

🔗 Links

GitHub Repository: euhidaman/MicroVLM-V
Paper: Coming soon
Demo: Coming soon

⚠️ Limitations

Training in Progress: This model is still under active training
Checkpoint Volatility: Only latest epoch is preserved - download if needed
Stage-Specific: Capabilities depend on training stage
- Stage 1: Alignment only, no generation
- Stage 2: Basic generation with memory
- Stage 3: Full capabilities

📧 Contact

For questions or issues, please open an issue on GitHub.

Last updated: Epoch 0/10 - 2025-12-19