EMBERVLM-SMALL: Small (~137M parameters)
EmberVLM is an efficient vision-language model optimized for edge deployment and robotic applications.
Model Details
- Model Type: Vision-Language Model (VLM)
- Size: Small (~137M parameters)
- Total Parameters: 164,203,841
- Trainable Parameters: 35,943,041
- Carbon Emissions: 0.0308 kg CO2eq
Architecture
- Vision Encoder: dinov2_small
- Language Model: SmolLM-135M (135M params)
- Training Stages: 4-stage curriculum
- Visual-Language Alignment
- Multimodal Instruction Tuning
- Robot Fleet Selection
- Chain-of-Thought Reasoning
Usage
from embervlm import EmberVLM
from transformers import AutoTokenizer
from PIL import Image
# Load model and tokenizer
model = EmberVLM.from_pretrained("embervlm-small")
tokenizer = AutoTokenizer.from_pretrained("embervlm-small")
# Prepare input
image = Image.open("robot_scene.jpg")
prompt = "<image>What is happening in this scene?"
# Generate response
outputs = model.generate(image=image, prompt=prompt, tokenizer=tokenizer)
print(outputs)
Training Configuration
- Vision Backbone: dinov2_small
- Language Backbone: smollm_135m
- Optimization: AdamW with cosine learning rate schedule
- Mixed Precision: bfloat16
- Stages Completed: 1-4 (Full curriculum)
Intended Use
- Edge deployment on resource-constrained devices
- Robotic vision-language understanding
- Real-time multimodal reasoning
- Robot fleet selection and task planning
Limitations
- Optimized for efficiency over maximum accuracy
- Best suited for edge/mobile deployment scenarios
- Training focused on robot-centric scenarios
Citation
@software{embervlm_embervlm_small,
title = {EmberVLM-Small},
author = {EmberVLM Team},
year = {2026},
url = {https://huggingface.co/embervlm-small}
}
- Downloads last month
- 34