EmberVLM: Tiny (~35M parameters)
π₯ Efficient Vision-Language Model for Edge Deployment & Robotic Applications
This model is currently in training - STAGE2 (Epoch 1).
π Current Training Status
- Stage: Multimodal Instruction Tuning - Following complex instructions
- Epoch: 1
- Last Updated: 2026-02-01 16:01:18 UTC
Latest Metrics
- instruction_loss: 0.0000
- loss: 5.2714
ποΈ Model Architecture
- Size: Tiny (~35M parameters)
- Total Parameters: 40,196,257
- Trainable Parameters: 26,212,929 (65.2%)
- Vision Encoder: RepViT-M0.9 (~5M params)
- Language Model: TinyLLM-30M (30M params)
π― Training Curriculum
EmberVLM follows a 4-stage training curriculum:
- β Stage 1: Visual-Language Alignment - Grounding vision and language
- β Stage 2: Multimodal Instruction Tuning - Following instructions
- β Stage 3: Robot Fleet Selection - Task-robot matching
- β³ Stage 4: Chain-of-Thought Reasoning - Reasoning generation
Current Stage: STAGE2
π» Usage
from transformers import AutoTokenizer
from embervlm import EmberVLM
from PIL import Image
# Load model and tokenizer
model = EmberVLM.from_pretrained("euhidaman/embervlm-tiny")
tokenizer = AutoTokenizer.from_pretrained("euhidaman/embervlm-tiny")
# Load image
image = Image.open("scene.jpg")
# Generate response
prompt = "<image>Describe what you see and select the best robot for this task."
outputs = model.generate(
image=image,
prompt=prompt,
tokenizer=tokenizer,
max_new_tokens=256
)
print(outputs)
π Training Details
- Vision Backbone: repvit
- Language Backbone: tinyllm
- Optimization: AdamW with cosine learning rate schedule
- Mixed Precision: bfloat16
- Distributed Training: Multi-GPU with DDP
- Class Balancing: Focal loss for robot selection (Stage 3)
- Reasoning: Chain-of-thought with reinforcement learning (Stage 4)
π Environmental Impact
This model is designed for edge deployment to minimize energy consumption.
π― Intended Use
- Primary: Edge deployment on resource-constrained devices
- Applications:
- Robotic vision-language understanding
- Real-time multimodal reasoning
- Robot fleet selection and task planning
- Mobile/embedded AI systems
β οΈ Limitations
- Model is still in training - performance will improve as training progresses
- Optimized for efficiency over maximum accuracy
- Best suited for edge/mobile deployment scenarios
- Training focused on robot-centric scenarios
π Citation
@software{embervlm_2026,
title = {EmberVLM: Efficient Vision-Language Model for Edge Deployment},
author = {EmberVLM Team},
year = {2026},
url = {https://huggingface.co/euhidaman/embervlm-tiny}
}
π License
Apache 2.0
Note: This is a checkpoint from stage2 training (epoch 1). The model will be updated after each epoch with improved performance.
- Downloads last month
- 25
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support