embervlm-tiny / README.md
euhidaman's picture
Update model - STAGE2 Epoch 1 | Loss: 5.2714
aed5fac verified
metadata
language:
  - en
license: apache-2.0
tags:
  - vision-language
  - multimodal
  - robotics
  - edge-deployment
  - tiny-vlm
  - repvit
  - tinyllm
  - stage2
base_model:
  - tinyllm
library_name: transformers
pipeline_tag: image-text-to-text

EmberVLM: Tiny (~35M parameters)

πŸ”₯ Efficient Vision-Language Model for Edge Deployment & Robotic Applications

This model is currently in training - STAGE2 (Epoch 1).

πŸ“Š Current Training Status

  • Stage: Multimodal Instruction Tuning - Following complex instructions
  • Epoch: 1
  • Last Updated: 2026-02-01 16:01:18 UTC

Latest Metrics

  • instruction_loss: 0.0000
  • loss: 5.2714

πŸ—οΈ Model Architecture

  • Size: Tiny (~35M parameters)
  • Total Parameters: 40,196,257
  • Trainable Parameters: 26,212,929 (65.2%)
  • Vision Encoder: RepViT-M0.9 (~5M params)
  • Language Model: TinyLLM-30M (30M params)

🎯 Training Curriculum

EmberVLM follows a 4-stage training curriculum:

  1. βœ… Stage 1: Visual-Language Alignment - Grounding vision and language
  2. βœ… Stage 2: Multimodal Instruction Tuning - Following instructions
  3. βœ… Stage 3: Robot Fleet Selection - Task-robot matching
  4. ⏳ Stage 4: Chain-of-Thought Reasoning - Reasoning generation

Current Stage: STAGE2

πŸ’» Usage

from transformers import AutoTokenizer
from embervlm import EmberVLM
from PIL import Image

# Load model and tokenizer
model = EmberVLM.from_pretrained("euhidaman/embervlm-tiny")
tokenizer = AutoTokenizer.from_pretrained("euhidaman/embervlm-tiny")

# Load image
image = Image.open("scene.jpg")

# Generate response
prompt = "<image>Describe what you see and select the best robot for this task."
outputs = model.generate(
    image=image,
    prompt=prompt,
    tokenizer=tokenizer,
    max_new_tokens=256
)

print(outputs)

πŸŽ“ Training Details

  • Vision Backbone: repvit
  • Language Backbone: tinyllm
  • Optimization: AdamW with cosine learning rate schedule
  • Mixed Precision: bfloat16
  • Distributed Training: Multi-GPU with DDP
  • Class Balancing: Focal loss for robot selection (Stage 3)
  • Reasoning: Chain-of-thought with reinforcement learning (Stage 4)

🌍 Environmental Impact

This model is designed for edge deployment to minimize energy consumption.

🎯 Intended Use

  • Primary: Edge deployment on resource-constrained devices
  • Applications:
    • Robotic vision-language understanding
    • Real-time multimodal reasoning
    • Robot fleet selection and task planning
    • Mobile/embedded AI systems

⚠️ Limitations

  • Model is still in training - performance will improve as training progresses
  • Optimized for efficiency over maximum accuracy
  • Best suited for edge/mobile deployment scenarios
  • Training focused on robot-centric scenarios

πŸ“š Citation

@software{embervlm_2026,
  title = {EmberVLM: Efficient Vision-Language Model for Edge Deployment},
  author = {EmberVLM Team},
  year = {2026},
  url = {https://huggingface.co/euhidaman/embervlm-tiny}
}

πŸ“ License

Apache 2.0


Note: This is a checkpoint from stage2 training (epoch 1). The model will be updated after each epoch with improved performance.