--- language: - en license: apache-2.0 tags: - vision-language - multimodal - robotics - edge-deployment - tiny-vlm - repvit - tinyllm - stage2 base_model: - tinyllm library_name: transformers pipeline_tag: image-text-to-text --- # EmberVLM: Tiny (~35M parameters) **🔥 Efficient Vision-Language Model for Edge Deployment & Robotic Applications** This model is currently in training - **STAGE2 (Epoch 1)**. ## 📊 Current Training Status - **Stage**: Multimodal Instruction Tuning - Following complex instructions - **Epoch**: 1 - **Last Updated**: 2026-02-01 16:01:18 UTC ### Latest Metrics - **instruction_loss**: 0.0000 - **loss**: 5.2714 ## 🏗️ Model Architecture - **Size**: Tiny (~35M parameters) - **Total Parameters**: 40,196,257 - **Trainable Parameters**: 26,212,929 (65.2%) - **Vision Encoder**: RepViT-M0.9 (~5M params) - **Language Model**: TinyLLM-30M (30M params) ## 🎯 Training Curriculum EmberVLM follows a 4-stage training curriculum: 1. ✅ **Stage 1: Visual-Language Alignment** - Grounding vision and language 2. ✅ **Stage 2: Multimodal Instruction Tuning** - Following instructions 3. ✅ **Stage 3: Robot Fleet Selection** - Task-robot matching 4. ⏳ **Stage 4: Chain-of-Thought Reasoning** - Reasoning generation **Current Stage**: STAGE2 ## 💻 Usage ```python from transformers import AutoTokenizer from embervlm import EmberVLM from PIL import Image # Load model and tokenizer model = EmberVLM.from_pretrained("euhidaman/embervlm-tiny") tokenizer = AutoTokenizer.from_pretrained("euhidaman/embervlm-tiny") # Load image image = Image.open("scene.jpg") # Generate response prompt = "Describe what you see and select the best robot for this task." outputs = model.generate( image=image, prompt=prompt, tokenizer=tokenizer, max_new_tokens=256 ) print(outputs) ``` ## 🎓 Training Details - **Vision Backbone**: repvit - **Language Backbone**: tinyllm - **Optimization**: AdamW with cosine learning rate schedule - **Mixed Precision**: bfloat16 - **Distributed Training**: Multi-GPU with DDP - **Class Balancing**: Focal loss for robot selection (Stage 3) - **Reasoning**: Chain-of-thought with reinforcement learning (Stage 4) ## 🌍 Environmental Impact This model is designed for edge deployment to minimize energy consumption. ## 🎯 Intended Use - **Primary**: Edge deployment on resource-constrained devices - **Applications**: - Robotic vision-language understanding - Real-time multimodal reasoning - Robot fleet selection and task planning - Mobile/embedded AI systems ## ⚠️ Limitations - Model is still in training - performance will improve as training progresses - Optimized for efficiency over maximum accuracy - Best suited for edge/mobile deployment scenarios - Training focused on robot-centric scenarios ## 📚 Citation ```bibtex @software{embervlm_2026, title = {EmberVLM: Efficient Vision-Language Model for Edge Deployment}, author = {EmberVLM Team}, year = {2026}, url = {https://huggingface.co/euhidaman/embervlm-tiny} } ``` ## 📝 License Apache 2.0 --- **Note**: This is a checkpoint from stage2 training (epoch 1). The model will be updated after each epoch with improved performance.