embervlm-tiny / README.md
euhidaman's picture
Update model - STAGE2 Epoch 1 | Loss: 5.2714
aed5fac verified
---
language:
- en
license: apache-2.0
tags:
- vision-language
- multimodal
- robotics
- edge-deployment
- tiny-vlm
- repvit
- tinyllm
- stage2
base_model:
- tinyllm
library_name: transformers
pipeline_tag: image-text-to-text
---
# EmberVLM: Tiny (~35M parameters)
**πŸ”₯ Efficient Vision-Language Model for Edge Deployment & Robotic Applications**
This model is currently in training - **STAGE2 (Epoch 1)**.
## πŸ“Š Current Training Status
- **Stage**: Multimodal Instruction Tuning - Following complex instructions
- **Epoch**: 1
- **Last Updated**: 2026-02-01 16:01:18 UTC
### Latest Metrics
- **instruction_loss**: 0.0000
- **loss**: 5.2714
## πŸ—οΈ Model Architecture
- **Size**: Tiny (~35M parameters)
- **Total Parameters**: 40,196,257
- **Trainable Parameters**: 26,212,929 (65.2%)
- **Vision Encoder**: RepViT-M0.9 (~5M params)
- **Language Model**: TinyLLM-30M (30M params)
## 🎯 Training Curriculum
EmberVLM follows a 4-stage training curriculum:
1. βœ… **Stage 1: Visual-Language Alignment** - Grounding vision and language
2. βœ… **Stage 2: Multimodal Instruction Tuning** - Following instructions
3. βœ… **Stage 3: Robot Fleet Selection** - Task-robot matching
4. ⏳ **Stage 4: Chain-of-Thought Reasoning** - Reasoning generation
**Current Stage**: STAGE2
## πŸ’» Usage
```python
from transformers import AutoTokenizer
from embervlm import EmberVLM
from PIL import Image
# Load model and tokenizer
model = EmberVLM.from_pretrained("euhidaman/embervlm-tiny")
tokenizer = AutoTokenizer.from_pretrained("euhidaman/embervlm-tiny")
# Load image
image = Image.open("scene.jpg")
# Generate response
prompt = "<image>Describe what you see and select the best robot for this task."
outputs = model.generate(
image=image,
prompt=prompt,
tokenizer=tokenizer,
max_new_tokens=256
)
print(outputs)
```
## πŸŽ“ Training Details
- **Vision Backbone**: repvit
- **Language Backbone**: tinyllm
- **Optimization**: AdamW with cosine learning rate schedule
- **Mixed Precision**: bfloat16
- **Distributed Training**: Multi-GPU with DDP
- **Class Balancing**: Focal loss for robot selection (Stage 3)
- **Reasoning**: Chain-of-thought with reinforcement learning (Stage 4)
## 🌍 Environmental Impact
This model is designed for edge deployment to minimize energy consumption.
## 🎯 Intended Use
- **Primary**: Edge deployment on resource-constrained devices
- **Applications**:
- Robotic vision-language understanding
- Real-time multimodal reasoning
- Robot fleet selection and task planning
- Mobile/embedded AI systems
## ⚠️ Limitations
- Model is still in training - performance will improve as training progresses
- Optimized for efficiency over maximum accuracy
- Best suited for edge/mobile deployment scenarios
- Training focused on robot-centric scenarios
## πŸ“š Citation
```bibtex
@software{embervlm_2026,
title = {EmberVLM: Efficient Vision-Language Model for Edge Deployment},
author = {EmberVLM Team},
year = {2026},
url = {https://huggingface.co/euhidaman/embervlm-tiny}
}
```
## πŸ“ License
Apache 2.0
---
**Note**: This is a checkpoint from stage2 training (epoch 1).
The model will be updated after each epoch with improved performance.