|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- vision-language |
|
|
- multimodal |
|
|
- episodic-memory |
|
|
- 1.58-bit |
|
|
- pytorch |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# MicroVLM-V: Vision-Language Model with Episodic Memory |
|
|
|
|
|
## π Training Progress: stage2 |
|
|
|
|
|
**Current Status:** Epoch 0/10 |
|
|
|
|
|
> β οΈ **Note:** This repository contains ONLY the latest checkpoint. Each epoch overwrites previous weights. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Architecture |
|
|
|
|
|
### Parameter Distribution |
|
|
|
|
|
| Component | Total Parameters | Trainable | Percentage | |
|
|
|-----------|-----------------|-----------|------------| |
|
|
| **Total Model** | **513.77M** | **79.39M** | **15.5%** | |
|
|
| Vision Encoder | 8.79M | 8.79M | 1.7% | |
|
|
| Language Model | 494.03M | 59.65M | 96.2% | |
|
|
| Multimodal Adapter | 5.04M | 5.04M | 1.0% | |
|
|
| Episodic Memory | 5.23M | 5.23M | 1.0% | |
|
|
|
|
|
### Technical Specifications |
|
|
|
|
|
- **Vision Encoder:** DeiT-Tiny (192-dim embeddings) |
|
|
- Quantization: FP16 |
|
|
- Status: Trainable |
|
|
|
|
|
- **Language Model:** Qwen2.5-0.5B (896-dim embeddings) |
|
|
- Quantization: FP16 |
|
|
- Trainable Layers: 59650K params |
|
|
|
|
|
- **Multimodal Adapter:** |
|
|
- Architecture: Linear projection + Layer Norm |
|
|
- Mapping: 192-dim (vision) β 896-dim (language) |
|
|
- Parameters: 5.04M |
|
|
|
|
|
- **Episodic Memory:** |
|
|
- Type: BitLinear 1.58-bit quantized |
|
|
- Quantization: Enabled |
|
|
- Parameters: 5.23M |
|
|
|
|
|
### Model Size |
|
|
|
|
|
- **Estimated Size:** 1026.85 MB |
|
|
- **Memory Footprint:** ~1540 MB (with activation) |
|
|
|
|
|
--- |
|
|
|
|
|
## π― Training Methodology |
|
|
|
|
|
### stage2 Configuration |
|
|
|
|
|
**Focus:** Episodic memory integration |
|
|
|
|
|
**Training Strategy:** |
|
|
- Vision encoder: **Frozen** |
|
|
- Language model: **Partially unfrozen** (last 2 layers) |
|
|
- Multimodal adapter: **Trainable** (initialized from Stage 1) |
|
|
- Episodic memory: **Enabled** (1.58-bit quantization) |
|
|
|
|
|
**Loss Function:** Alignment + Memory losses |
|
|
- Continues alignment refinement |
|
|
- Adds memory read/write/retrieval objectives |
|
|
|
|
|
**Hyperparameters:** |
|
|
- Learning Rate: 0.0001 |
|
|
- Batch Size: 112 |
|
|
- Warmup Steps: 1000 |
|
|
- Gradient Clipping: 0.5 |
|
|
- Optimizer: ADAMW |
|
|
- Scheduler: cosine |
|
|
|
|
|
**Hardware:** |
|
|
- GPU: NVIDIA RTX 6000 Ada (48GB) |
|
|
- Precision: Mixed FP16/FP32 |
|
|
- Distributed: Single GPU |
|
|
|
|
|
--- |
|
|
|
|
|
## π Training Statistics (Epoch 0) |
|
|
|
|
|
**Latest Metrics:** |
|
|
- Training Loss: 1.7374 |
|
|
- Alignment Loss: 0.0000 |
|
|
- Learning Rate: 9.79e-05 |
|
|
- Gradient Norm: 0.0000 |
|
|
|
|
|
**Timestamp:** 2025-12-19 21:35:56 UTC |
|
|
|
|
|
--- |
|
|
|
|
|
## π» Usage |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from pathlib import Path |
|
|
|
|
|
# Download model weights |
|
|
checkpoint = torch.load('model.pt', map_location='cpu') |
|
|
|
|
|
# Load model state |
|
|
model.load_state_dict(checkpoint['model_state_dict']) |
|
|
model.eval() |
|
|
|
|
|
# Move to GPU if available |
|
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
model = model.to(device) |
|
|
``` |
|
|
|
|
|
### Inference Example |
|
|
|
|
|
```python |
|
|
from PIL import Image |
|
|
import torchvision.transforms as transforms |
|
|
|
|
|
# Prepare image |
|
|
image = Image.open('example.jpg').convert('RGB') |
|
|
transform = transforms.Compose([ |
|
|
transforms.Resize((224, 224)), |
|
|
transforms.ToTensor(), |
|
|
transforms.Normalize(mean=[0.485, 0.456, 0.406], |
|
|
std=[0.229, 0.224, 0.225]) |
|
|
]) |
|
|
image_tensor = transform(image).unsqueeze(0).to(device) |
|
|
|
|
|
# Prepare text |
|
|
text = "A photo of a cat" |
|
|
tokens = tokenizer(text, return_tensors='pt', padding=True).to(device) |
|
|
|
|
|
# Forward pass |
|
|
with torch.no_grad(): |
|
|
outputs = model( |
|
|
images=image_tensor, |
|
|
input_ids=tokens['input_ids'], |
|
|
attention_mask=tokens['attention_mask'] |
|
|
) |
|
|
``` |
|
|
|
|
|
### Model Input/Output Format |
|
|
|
|
|
**Inputs:** |
|
|
- `images`: Tensor [B, 3, 224, 224] - RGB images normalized |
|
|
- `input_ids`: Tensor [B, seq_len] - Tokenized text |
|
|
- `attention_mask`: Tensor [B, seq_len] - Attention mask |
|
|
|
|
|
**Outputs:** |
|
|
- `lm_loss`: Language modeling loss (if labels provided) |
|
|
- `alignment_loss`: Vision-language alignment loss |
|
|
- `memory_loss`: Episodic memory loss (Stage 2/3 only) |
|
|
- `logits`: Next token predictions [B, seq_len, vocab_size] |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Requirements |
|
|
|
|
|
```bash |
|
|
pip install torch>=2.0.0 |
|
|
pip install transformers>=4.30.0 |
|
|
pip install timm # For DeiT vision encoder |
|
|
pip install Pillow # For image processing |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
Apache 2.0 License |
|
|
|
|
|
--- |
|
|
|
|
|
## π Links |
|
|
|
|
|
- **GitHub Repository:** [euhidaman/MicroVLM-V](https://github.com/euhidaman/MicroVLM-V) |
|
|
- **Paper:** Coming soon |
|
|
- **Demo:** Coming soon |
|
|
|
|
|
--- |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
- **Training in Progress:** This model is still under active training |
|
|
- **Checkpoint Volatility:** Only latest epoch is preserved - download if needed |
|
|
- **Stage-Specific:** Capabilities depend on training stage |
|
|
- Stage 1: Alignment only, no generation |
|
|
- Stage 2: Basic generation with memory |
|
|
- Stage 3: Full capabilities |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Contact |
|
|
|
|
|
For questions or issues, please open an issue on GitHub. |
|
|
|
|
|
--- |
|
|
|
|
|
*Last updated: Epoch 0/10 - 2025-12-19* |
|
|
|