--- license: apache-2.0 tags: - vision-language - multimodal - episodic-memory - fiber-alignment - qwen2 - deit - pytorch library_name: transformers datasets: - conceptual-12m pipeline_tag: image-to-text --- # MicroVLM-V: Vision-Language Model with FIBER Alignment & Episodic Memory ## 📋 Model Overview MicroVLM-V is a compact vision-language model (~215 MB) that combines: - **Vision Encoder**: DeiT-Tiny (5.7M params) - **Language Model**: Qwen2.5-0.5B (4-bit quantized, 315M params) - **Alignment**: FIBER fusion at layers [6, 8, 10] - **Episodic Memory**: Larimar GPM (512 slots, 4.8M params) **Checkpoint**: `best` (Best alignment model) --- ## 📊 Model Architecture ### Parameter Distribution | Component | Total Parameters | Trainable | Status | |-----------|-----------------|-----------|--------| | **Total Model** | **334.5M** | **13.8M** | **4.1% trainable** | | Vision Encoder | 8.8M | 3.3M | FIBER fusion trainable | | Language Model | 315.1M | 0 | Frozen (4-bit) | | Multimodal Adapter | 5.0M | 5.0M | Fully trainable | | Episodic Memory | 4.8M | 4.8M | Fully trainable | ### Quantization Status | Component | Quantization | |-----------|-------------| | Vision Encoder | FP16 | | Language Model | 4-bit ✓ | | Episodic Memory | FP32 | **Estimated Model Size**: ~214.6 MB --- ## 🏋️ Training Details ### Configuration - **Dataset**: CC12M (Conceptual 12M) - 3M training samples - **Batch Size**: 512 - **Training Time**: ~0.64 hours on 2x A100 80GB - **Throughput**: ~332 samples/sec - **Total FLOPs**: 2088 PFLOPs ### FIBER Alignment - **Mode**: Fusion-in-Backbone (FIBER-style) - **Fusion Layers**: [6, 8, 10] - **ITC Weight**: 1.0 - **ITM Weight**: 0.5 - **ITC Queue Size**: 256 ### Training Metrics (Best Checkpoint) - **Best Alignment Similarity**: 0.0249 (step 25) - **Final ITM Loss**: ~0.53 - **Final Token Loss**: ~0.056 - **Training stopped**: Early stopping at step 1500 (alignment plateau) --- ## 💻 Usage ### Loading the Model ```python import torch # Load checkpoint checkpoint = torch.load('model.pt', map_location='cpu') # Access model state dict model_state = checkpoint['model_state_dict'] # Get training info print(f"Global step: {checkpoint.get('global_step', 'N/A')}") print(f"Best alignment: {checkpoint.get('best_correct_sim', 'N/A')}") ``` ### Inference Example ```python from PIL import Image import torchvision.transforms as transforms # Prepare image transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) image = Image.open('example.jpg').convert('RGB') image_tensor = transform(image).unsqueeze(0) # Forward pass (after loading model) with torch.no_grad(): outputs = model( images=image_tensor, input_ids=tokens['input_ids'], attention_mask=tokens['attention_mask'] ) ``` --- ## 📁 Repository Contents - `model.pt` - Best alignment checkpoint - `statistics.json` - Training statistics - `config.json` - Model configuration - `README.md` - This model card --- ## ⚙️ Requirements ```bash pip install torch>=2.0.0 pip install transformers>=4.30.0 pip install timm # For DeiT vision encoder pip install bitsandbytes # For 4-bit quantization ``` --- ## 📜 License Apache 2.0 License --- ## 🔗 Links - **GitHub Repository**: [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V) - **Branch**: FocusedAttention --- ## ⚠️ Limitations - This is the **Stage 1 alignment checkpoint** - focuses on vision-language alignment - Best for: Image-text matching, alignment tasks - May need further fine-tuning for generation tasks --- *Uploaded: 2025-12-08 14:53:01 UTC*