MicroVLM-V-stage2 / README.md
euhidaman's picture
stage2: Epoch 0/10 - Latest checkpoint
4f80a7d verified
metadata
license: apache-2.0
tags:
  - vision-language
  - multimodal
  - episodic-memory
  - 1.58-bit
  - pytorch
library_name: transformers

MicroVLM-V: Vision-Language Model with Episodic Memory

πŸ”„ Training Progress: stage2

Current Status: Epoch 0/10

⚠️ Note: This repository contains ONLY the latest checkpoint. Each epoch overwrites previous weights.


πŸ“Š Model Architecture

Parameter Distribution

Component Total Parameters Trainable Percentage
Total Model 513.77M 79.39M 15.5%
Vision Encoder 8.79M 8.79M 1.7%
Language Model 494.03M 59.65M 96.2%
Multimodal Adapter 5.04M 5.04M 1.0%
Episodic Memory 5.23M 5.23M 1.0%

Technical Specifications

  • Vision Encoder: DeiT-Tiny (192-dim embeddings)

    • Quantization: FP16
    • Status: Trainable
  • Language Model: Qwen2.5-0.5B (896-dim embeddings)

    • Quantization: FP16
    • Trainable Layers: 59650K params
  • Multimodal Adapter:

    • Architecture: Linear projection + Layer Norm
    • Mapping: 192-dim (vision) β†’ 896-dim (language)
    • Parameters: 5.04M
  • Episodic Memory:

    • Type: BitLinear 1.58-bit quantized
    • Quantization: Enabled
    • Parameters: 5.23M

Model Size

  • Estimated Size: 1026.85 MB
  • Memory Footprint: ~1540 MB (with activation)

🎯 Training Methodology

stage2 Configuration

Focus: Episodic memory integration

Training Strategy:

  • Vision encoder: Frozen
  • Language model: Partially unfrozen (last 2 layers)
  • Multimodal adapter: Trainable (initialized from Stage 1)
  • Episodic memory: Enabled (1.58-bit quantization)

Loss Function: Alignment + Memory losses

  • Continues alignment refinement
  • Adds memory read/write/retrieval objectives

Hyperparameters:

  • Learning Rate: 0.0001
  • Batch Size: 112
  • Warmup Steps: 1000
  • Gradient Clipping: 0.5
  • Optimizer: ADAMW
  • Scheduler: cosine

Hardware:

  • GPU: NVIDIA RTX 6000 Ada (48GB)
  • Precision: Mixed FP16/FP32
  • Distributed: Single GPU

πŸ“ˆ Training Statistics (Epoch 0)

Latest Metrics:

  • Training Loss: 1.7374
  • Alignment Loss: 0.0000
  • Learning Rate: 9.79e-05
  • Gradient Norm: 0.0000

Timestamp: 2025-12-19 21:35:56 UTC


πŸ’» Usage

Loading the Model

import torch
from pathlib import Path

# Download model weights
checkpoint = torch.load('model.pt', map_location='cpu')

# Load model state
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

Inference Example

from PIL import Image
import torchvision.transforms as transforms

# Prepare image
image = Image.open('example.jpg').convert('RGB')
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])
])
image_tensor = transform(image).unsqueeze(0).to(device)

# Prepare text
text = "A photo of a cat"
tokens = tokenizer(text, return_tensors='pt', padding=True).to(device)

# Forward pass
with torch.no_grad():
    outputs = model(
        images=image_tensor,
        input_ids=tokens['input_ids'],
        attention_mask=tokens['attention_mask']
    )

Model Input/Output Format

Inputs:

  • images: Tensor [B, 3, 224, 224] - RGB images normalized
  • input_ids: Tensor [B, seq_len] - Tokenized text
  • attention_mask: Tensor [B, seq_len] - Attention mask

Outputs:

  • lm_loss: Language modeling loss (if labels provided)
  • alignment_loss: Vision-language alignment loss
  • memory_loss: Episodic memory loss (Stage 2/3 only)
  • logits: Next token predictions [B, seq_len, vocab_size]

βš™οΈ Requirements

pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install timm  # For DeiT vision encoder
pip install Pillow  # For image processing

πŸ“œ License

Apache 2.0 License


πŸ”— Links


⚠️ Limitations

  • Training in Progress: This model is still under active training
  • Checkpoint Volatility: Only latest epoch is preserved - download if needed
  • Stage-Specific: Capabilities depend on training stage
    • Stage 1: Alignment only, no generation
    • Stage 2: Basic generation with memory
    • Stage 3: Full capabilities

πŸ“§ Contact

For questions or issues, please open an issue on GitHub.


Last updated: Epoch 0/10 - 2025-12-19