|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- vision-language |
|
|
- multimodal |
|
|
- robotics |
|
|
- edge-deployment |
|
|
- tiny-vlm |
|
|
- repvit |
|
|
- tinyllm |
|
|
- stage2 |
|
|
base_model: |
|
|
- tinyllm |
|
|
library_name: transformers |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# EmberVLM: Tiny (~35M parameters) |
|
|
|
|
|
**π₯ Efficient Vision-Language Model for Edge Deployment & Robotic Applications** |
|
|
|
|
|
This model is currently in training - **STAGE2 (Epoch 1)**. |
|
|
|
|
|
## π Current Training Status |
|
|
|
|
|
- **Stage**: Multimodal Instruction Tuning - Following complex instructions |
|
|
- **Epoch**: 1 |
|
|
- **Last Updated**: 2026-02-01 16:01:18 UTC |
|
|
|
|
|
### Latest Metrics |
|
|
- **instruction_loss**: 0.0000 |
|
|
- **loss**: 5.2714 |
|
|
|
|
|
## ποΈ Model Architecture |
|
|
|
|
|
- **Size**: Tiny (~35M parameters) |
|
|
- **Total Parameters**: 40,196,257 |
|
|
- **Trainable Parameters**: 26,212,929 (65.2%) |
|
|
- **Vision Encoder**: RepViT-M0.9 (~5M params) |
|
|
- **Language Model**: TinyLLM-30M (30M params) |
|
|
|
|
|
## π― Training Curriculum |
|
|
|
|
|
EmberVLM follows a 4-stage training curriculum: |
|
|
|
|
|
1. β
**Stage 1: Visual-Language Alignment** - Grounding vision and language |
|
|
2. β
**Stage 2: Multimodal Instruction Tuning** - Following instructions |
|
|
3. β
**Stage 3: Robot Fleet Selection** - Task-robot matching |
|
|
4. β³ **Stage 4: Chain-of-Thought Reasoning** - Reasoning generation |
|
|
|
|
|
**Current Stage**: STAGE2 |
|
|
|
|
|
## π» Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
from embervlm import EmberVLM |
|
|
from PIL import Image |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = EmberVLM.from_pretrained("euhidaman/embervlm-tiny") |
|
|
tokenizer = AutoTokenizer.from_pretrained("euhidaman/embervlm-tiny") |
|
|
|
|
|
# Load image |
|
|
image = Image.open("scene.jpg") |
|
|
|
|
|
# Generate response |
|
|
prompt = "<image>Describe what you see and select the best robot for this task." |
|
|
outputs = model.generate( |
|
|
image=image, |
|
|
prompt=prompt, |
|
|
tokenizer=tokenizer, |
|
|
max_new_tokens=256 |
|
|
) |
|
|
|
|
|
print(outputs) |
|
|
``` |
|
|
|
|
|
## π Training Details |
|
|
|
|
|
- **Vision Backbone**: repvit |
|
|
- **Language Backbone**: tinyllm |
|
|
- **Optimization**: AdamW with cosine learning rate schedule |
|
|
- **Mixed Precision**: bfloat16 |
|
|
- **Distributed Training**: Multi-GPU with DDP |
|
|
- **Class Balancing**: Focal loss for robot selection (Stage 3) |
|
|
- **Reasoning**: Chain-of-thought with reinforcement learning (Stage 4) |
|
|
|
|
|
## π Environmental Impact |
|
|
|
|
|
This model is designed for edge deployment to minimize energy consumption. |
|
|
|
|
|
## π― Intended Use |
|
|
|
|
|
- **Primary**: Edge deployment on resource-constrained devices |
|
|
- **Applications**: |
|
|
- Robotic vision-language understanding |
|
|
- Real-time multimodal reasoning |
|
|
- Robot fleet selection and task planning |
|
|
- Mobile/embedded AI systems |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
- Model is still in training - performance will improve as training progresses |
|
|
- Optimized for efficiency over maximum accuracy |
|
|
- Best suited for edge/mobile deployment scenarios |
|
|
- Training focused on robot-centric scenarios |
|
|
|
|
|
## π Citation |
|
|
|
|
|
```bibtex |
|
|
@software{embervlm_2026, |
|
|
title = {EmberVLM: Efficient Vision-Language Model for Edge Deployment}, |
|
|
author = {EmberVLM Team}, |
|
|
year = {2026}, |
|
|
url = {https://huggingface.co/euhidaman/embervlm-tiny} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
--- |
|
|
|
|
|
**Note**: This is a checkpoint from stage2 training (epoch 1). |
|
|
The model will be updated after each epoch with improved performance. |
|
|
|