EMBERVLM-SMALL: Small (~137M parameters)

EmberVLM is an efficient vision-language model optimized for edge deployment and robotic applications.

Model Details

  • Model Type: Vision-Language Model (VLM)
  • Size: Small (~137M parameters)
  • Total Parameters: 164,203,841
  • Trainable Parameters: 35,943,041
  • Carbon Emissions: 0.0308 kg CO2eq

Architecture

  • Vision Encoder: dinov2_small
  • Language Model: SmolLM-135M (135M params)
  • Training Stages: 4-stage curriculum
    1. Visual-Language Alignment
    2. Multimodal Instruction Tuning
    3. Robot Fleet Selection
    4. Chain-of-Thought Reasoning

Usage

from embervlm import EmberVLM
from transformers import AutoTokenizer
from PIL import Image

# Load model and tokenizer
model = EmberVLM.from_pretrained("embervlm-small")
tokenizer = AutoTokenizer.from_pretrained("embervlm-small")

# Prepare input
image = Image.open("robot_scene.jpg")
prompt = "<image>What is happening in this scene?"

# Generate response
outputs = model.generate(image=image, prompt=prompt, tokenizer=tokenizer)
print(outputs)

Training Configuration

  • Vision Backbone: dinov2_small
  • Language Backbone: smollm_135m
  • Optimization: AdamW with cosine learning rate schedule
  • Mixed Precision: bfloat16
  • Stages Completed: 1-4 (Full curriculum)

Intended Use

  • Edge deployment on resource-constrained devices
  • Robotic vision-language understanding
  • Real-time multimodal reasoning
  • Robot fleet selection and task planning

Limitations

  • Optimized for efficiency over maximum accuracy
  • Best suited for edge/mobile deployment scenarios
  • Training focused on robot-centric scenarios

Citation

@software{embervlm_embervlm_small,
  title = {EmberVLM-Small},
  author = {EmberVLM Team},
  year = {2026},
  url = {https://huggingface.co/embervlm-small}
}
Downloads last month
34
Video Preview
loading