EmberVLM: Tiny (~35M parameters)

πŸ”₯ Efficient Vision-Language Model for Edge Deployment & Robotic Applications

This model is currently in training - STAGE2 (Epoch 1).

πŸ“Š Current Training Status

  • Stage: Multimodal Instruction Tuning - Following complex instructions
  • Epoch: 1
  • Last Updated: 2026-02-01 16:01:18 UTC

Latest Metrics

  • instruction_loss: 0.0000
  • loss: 5.2714

πŸ—οΈ Model Architecture

  • Size: Tiny (~35M parameters)
  • Total Parameters: 40,196,257
  • Trainable Parameters: 26,212,929 (65.2%)
  • Vision Encoder: RepViT-M0.9 (~5M params)
  • Language Model: TinyLLM-30M (30M params)

🎯 Training Curriculum

EmberVLM follows a 4-stage training curriculum:

  1. βœ… Stage 1: Visual-Language Alignment - Grounding vision and language
  2. βœ… Stage 2: Multimodal Instruction Tuning - Following instructions
  3. βœ… Stage 3: Robot Fleet Selection - Task-robot matching
  4. ⏳ Stage 4: Chain-of-Thought Reasoning - Reasoning generation

Current Stage: STAGE2

πŸ’» Usage

from transformers import AutoTokenizer
from embervlm import EmberVLM
from PIL import Image

# Load model and tokenizer
model = EmberVLM.from_pretrained("euhidaman/embervlm-tiny")
tokenizer = AutoTokenizer.from_pretrained("euhidaman/embervlm-tiny")

# Load image
image = Image.open("scene.jpg")

# Generate response
prompt = "<image>Describe what you see and select the best robot for this task."
outputs = model.generate(
    image=image,
    prompt=prompt,
    tokenizer=tokenizer,
    max_new_tokens=256
)

print(outputs)

πŸŽ“ Training Details

  • Vision Backbone: repvit
  • Language Backbone: tinyllm
  • Optimization: AdamW with cosine learning rate schedule
  • Mixed Precision: bfloat16
  • Distributed Training: Multi-GPU with DDP
  • Class Balancing: Focal loss for robot selection (Stage 3)
  • Reasoning: Chain-of-thought with reinforcement learning (Stage 4)

🌍 Environmental Impact

This model is designed for edge deployment to minimize energy consumption.

🎯 Intended Use

  • Primary: Edge deployment on resource-constrained devices
  • Applications:
    • Robotic vision-language understanding
    • Real-time multimodal reasoning
    • Robot fleet selection and task planning
    • Mobile/embedded AI systems

⚠️ Limitations

  • Model is still in training - performance will improve as training progresses
  • Optimized for efficiency over maximum accuracy
  • Best suited for edge/mobile deployment scenarios
  • Training focused on robot-centric scenarios

πŸ“š Citation

@software{embervlm_2026,
  title = {EmberVLM: Efficient Vision-Language Model for Edge Deployment},
  author = {EmberVLM Team},
  year = {2026},
  url = {https://huggingface.co/euhidaman/embervlm-tiny}
}

πŸ“ License

Apache 2.0


Note: This is a checkpoint from stage2 training (epoch 1). The model will be updated after each epoch with improved performance.

Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support