EmberVLM: Small (~137M parameters)
π₯ Efficient Vision-Language Model for Edge Deployment & Robotic Applications
This model is currently in training - STAGE2 (Epoch 5).
π Current Training Status
- Stage: Multimodal Instruction Tuning - Following complex instructions
- Epoch: 5
- Last Updated: 2026-02-07 00:12:34 UTC
Latest Metrics
- instruction_loss: 0.0000
- loss: 2.0480
ποΈ Model Architecture
- Size: Small (~137M parameters)
- Total Parameters: 206,720,658
- Trainable Parameters: 50,149,074 (24.3%)
- Vision Encoder: dinov2_small
- Language Model: SmolLM-135M (135M params)
π― Training Curriculum
EmberVLM follows a 4-stage training curriculum:
- β Stage 1: Visual-Language Alignment - Grounding vision and language
- β Stage 2: Multimodal Instruction Tuning - Following instructions
- β Stage 3: Robot Fleet Selection - Task-robot matching
- β³ Stage 4: Chain-of-Thought Reasoning - Reasoning generation
Current Stage: STAGE2
π» Usage
from transformers import AutoTokenizer
from embervlm import EmberVLM
from PIL import Image
# Load model and tokenizer
model = EmberVLM.from_pretrained("euhidaman/embervlm-small")
tokenizer = AutoTokenizer.from_pretrained("euhidaman/embervlm-small")
# Load image
image = Image.open("scene.jpg")
# Generate response
prompt = "<image>Describe what you see and select the best robot for this task."
outputs = model.generate(
image=image,
prompt=prompt,
tokenizer=tokenizer,
max_new_tokens=256
)
print(outputs)
π Training Details
- Vision Backbone: dinov2_small
- Language Backbone: smollm_135m
- Optimization: AdamW with cosine learning rate schedule
- Mixed Precision: bfloat16
- Distributed Training: Multi-GPU with DDP
- Class Balancing: Focal loss for robot selection (Stage 3)
- Reasoning: Chain-of-thought with reinforcement learning (Stage 4)
π Environmental Impact
This model is designed for edge deployment to minimize energy consumption.
π― Intended Use
- Primary: Edge deployment on resource-constrained devices
- Applications:
- Robotic vision-language understanding
- Real-time multimodal reasoning
- Robot fleet selection and task planning
- Mobile/embedded AI systems
β οΈ Limitations
- Model is still in training - performance will improve as training progresses
- Optimized for efficiency over maximum accuracy
- Best suited for edge/mobile deployment scenarios
- Training focused on robot-centric scenarios
π Citation
@software{embervlm_2026,
title = {EmberVLM: Efficient Vision-Language Model for Edge Deployment},
author = {EmberVLM Team},
year = {2026},
url = {https://huggingface.co/euhidaman/embervlm-small}
}
π License
Apache 2.0
Note: This is a checkpoint from stage2 training (epoch 5). The model will be updated after each epoch with improved performance.
- Downloads last month
- 447
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support