metadata
license: apache-2.0
tags:
- vision-language
- multimodal
- episodic-memory
- 1.58-bit
- pytorch
library_name: transformers
MicroVLM-V: Vision-Language Model with Episodic Memory
π Training Progress: stage2
Current Status: Epoch 0/10
β οΈ Note: This repository contains ONLY the latest checkpoint. Each epoch overwrites previous weights.
π Model Architecture
Parameter Distribution
| Component | Total Parameters | Trainable | Percentage |
|---|---|---|---|
| Total Model | 513.77M | 79.39M | 15.5% |
| Vision Encoder | 8.79M | 8.79M | 1.7% |
| Language Model | 494.03M | 59.65M | 96.2% |
| Multimodal Adapter | 5.04M | 5.04M | 1.0% |
| Episodic Memory | 5.23M | 5.23M | 1.0% |
Technical Specifications
Vision Encoder: DeiT-Tiny (192-dim embeddings)
- Quantization: FP16
- Status: Trainable
Language Model: Qwen2.5-0.5B (896-dim embeddings)
- Quantization: FP16
- Trainable Layers: 59650K params
Multimodal Adapter:
- Architecture: Linear projection + Layer Norm
- Mapping: 192-dim (vision) β 896-dim (language)
- Parameters: 5.04M
Episodic Memory:
- Type: BitLinear 1.58-bit quantized
- Quantization: Enabled
- Parameters: 5.23M
Model Size
- Estimated Size: 1026.85 MB
- Memory Footprint: ~1540 MB (with activation)
π― Training Methodology
stage2 Configuration
Focus: Episodic memory integration
Training Strategy:
- Vision encoder: Frozen
- Language model: Partially unfrozen (last 2 layers)
- Multimodal adapter: Trainable (initialized from Stage 1)
- Episodic memory: Enabled (1.58-bit quantization)
Loss Function: Alignment + Memory losses
- Continues alignment refinement
- Adds memory read/write/retrieval objectives
Hyperparameters:
- Learning Rate: 0.0001
- Batch Size: 112
- Warmup Steps: 1000
- Gradient Clipping: 0.5
- Optimizer: ADAMW
- Scheduler: cosine
Hardware:
- GPU: NVIDIA RTX 6000 Ada (48GB)
- Precision: Mixed FP16/FP32
- Distributed: Single GPU
π Training Statistics (Epoch 0)
Latest Metrics:
- Training Loss: 1.7374
- Alignment Loss: 0.0000
- Learning Rate: 9.79e-05
- Gradient Norm: 0.0000
Timestamp: 2025-12-19 21:35:56 UTC
π» Usage
Loading the Model
import torch
from pathlib import Path
# Download model weights
checkpoint = torch.load('model.pt', map_location='cpu')
# Load model state
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
Inference Example
from PIL import Image
import torchvision.transforms as transforms
# Prepare image
image = Image.open('example.jpg').convert('RGB')
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
image_tensor = transform(image).unsqueeze(0).to(device)
# Prepare text
text = "A photo of a cat"
tokens = tokenizer(text, return_tensors='pt', padding=True).to(device)
# Forward pass
with torch.no_grad():
outputs = model(
images=image_tensor,
input_ids=tokens['input_ids'],
attention_mask=tokens['attention_mask']
)
Model Input/Output Format
Inputs:
images: Tensor [B, 3, 224, 224] - RGB images normalizedinput_ids: Tensor [B, seq_len] - Tokenized textattention_mask: Tensor [B, seq_len] - Attention mask
Outputs:
lm_loss: Language modeling loss (if labels provided)alignment_loss: Vision-language alignment lossmemory_loss: Episodic memory loss (Stage 2/3 only)logits: Next token predictions [B, seq_len, vocab_size]
βοΈ Requirements
pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install timm # For DeiT vision encoder
pip install Pillow # For image processing
π License
Apache 2.0 License
π Links
- GitHub Repository: euhidaman/MicroVLM-V
- Paper: Coming soon
- Demo: Coming soon
β οΈ Limitations
- Training in Progress: This model is still under active training
- Checkpoint Volatility: Only latest epoch is preserved - download if needed
- Stage-Specific: Capabilities depend on training stage
- Stage 1: Alignment only, no generation
- Stage 2: Basic generation with memory
- Stage 3: Full capabilities
π§ Contact
For questions or issues, please open an issue on GitHub.
Last updated: Epoch 0/10 - 2025-12-19