--- license: apache-2.0 tags: - vision-language - multimodal - episodic-memory - 1.58-bit - pytorch library_name: transformers --- # MicroVLM-V: Vision-Language Model with Episodic Memory ## 🔄 Training Progress: stage2 **Current Status:** Epoch 0/10 > ⚠️ **Note:** This repository contains ONLY the latest checkpoint. Each epoch overwrites previous weights. --- ## 📊 Model Architecture ### Parameter Distribution | Component | Total Parameters | Trainable | Percentage | |-----------|-----------------|-----------|------------| | **Total Model** | **513.77M** | **79.39M** | **15.5%** | | Vision Encoder | 8.79M | 8.79M | 1.7% | | Language Model | 494.03M | 59.65M | 96.2% | | Multimodal Adapter | 5.04M | 5.04M | 1.0% | | Episodic Memory | 5.23M | 5.23M | 1.0% | ### Technical Specifications - **Vision Encoder:** DeiT-Tiny (192-dim embeddings) - Quantization: FP16 - Status: Trainable - **Language Model:** Qwen2.5-0.5B (896-dim embeddings) - Quantization: FP16 - Trainable Layers: 59650K params - **Multimodal Adapter:** - Architecture: Linear projection + Layer Norm - Mapping: 192-dim (vision) → 896-dim (language) - Parameters: 5.04M - **Episodic Memory:** - Type: BitLinear 1.58-bit quantized - Quantization: Enabled - Parameters: 5.23M ### Model Size - **Estimated Size:** 1026.85 MB - **Memory Footprint:** ~1540 MB (with activation) --- ## 🎯 Training Methodology ### stage2 Configuration **Focus:** Episodic memory integration **Training Strategy:** - Vision encoder: **Frozen** - Language model: **Partially unfrozen** (last 2 layers) - Multimodal adapter: **Trainable** (initialized from Stage 1) - Episodic memory: **Enabled** (1.58-bit quantization) **Loss Function:** Alignment + Memory losses - Continues alignment refinement - Adds memory read/write/retrieval objectives **Hyperparameters:** - Learning Rate: 0.0001 - Batch Size: 112 - Warmup Steps: 1000 - Gradient Clipping: 0.5 - Optimizer: ADAMW - Scheduler: cosine **Hardware:** - GPU: NVIDIA RTX 6000 Ada (48GB) - Precision: Mixed FP16/FP32 - Distributed: Single GPU --- ## 📈 Training Statistics (Epoch 0) **Latest Metrics:** - Training Loss: 1.7374 - Alignment Loss: 0.0000 - Learning Rate: 9.79e-05 - Gradient Norm: 0.0000 **Timestamp:** 2025-12-19 21:35:56 UTC --- ## 💻 Usage ### Loading the Model ```python import torch from pathlib import Path # Download model weights checkpoint = torch.load('model.pt', map_location='cpu') # Load model state model.load_state_dict(checkpoint['model_state_dict']) model.eval() # Move to GPU if available device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device) ``` ### Inference Example ```python from PIL import Image import torchvision.transforms as transforms # Prepare image image = Image.open('example.jpg').convert('RGB') transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) image_tensor = transform(image).unsqueeze(0).to(device) # Prepare text text = "A photo of a cat" tokens = tokenizer(text, return_tensors='pt', padding=True).to(device) # Forward pass with torch.no_grad(): outputs = model( images=image_tensor, input_ids=tokens['input_ids'], attention_mask=tokens['attention_mask'] ) ``` ### Model Input/Output Format **Inputs:** - `images`: Tensor [B, 3, 224, 224] - RGB images normalized - `input_ids`: Tensor [B, seq_len] - Tokenized text - `attention_mask`: Tensor [B, seq_len] - Attention mask **Outputs:** - `lm_loss`: Language modeling loss (if labels provided) - `alignment_loss`: Vision-language alignment loss - `memory_loss`: Episodic memory loss (Stage 2/3 only) - `logits`: Next token predictions [B, seq_len, vocab_size] --- ## ⚙️ Requirements ```bash pip install torch>=2.0.0 pip install transformers>=4.30.0 pip install timm # For DeiT vision encoder pip install Pillow # For image processing ``` --- ## 📜 License Apache 2.0 License --- ## 🔗 Links - **GitHub Repository:** [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V) - **Paper:** Coming soon - **Demo:** Coming soon --- ## ⚠️ Limitations - **Training in Progress:** This model is still under active training - **Checkpoint Volatility:** Only latest epoch is preserved - download if needed - **Stage-Specific:** Capabilities depend on training stage - Stage 1: Alignment only, no generation - Stage 2: Basic generation with memory - Stage 3: Full capabilities --- ## 📧 Contact For questions or issues, please open an issue on GitHub. --- *Last updated: Epoch 0/10 - 2025-12-19*