File size: 3,227 Bytes
bc357ac aed5fac bc357ac aed5fac bc357ac aed5fac bc357ac aed5fac bc357ac aed5fac bc357ac c7a0fe2 bc357ac aed5fac bc357ac aed5fac bc357ac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
language:
- en
license: apache-2.0
tags:
- vision-language
- multimodal
- robotics
- edge-deployment
- tiny-vlm
- repvit
- tinyllm
- stage2
base_model:
- tinyllm
library_name: transformers
pipeline_tag: image-text-to-text
---
# EmberVLM: Tiny (~35M parameters)
**π₯ Efficient Vision-Language Model for Edge Deployment & Robotic Applications**
This model is currently in training - **STAGE2 (Epoch 1)**.
## π Current Training Status
- **Stage**: Multimodal Instruction Tuning - Following complex instructions
- **Epoch**: 1
- **Last Updated**: 2026-02-01 16:01:18 UTC
### Latest Metrics
- **instruction_loss**: 0.0000
- **loss**: 5.2714
## ποΈ Model Architecture
- **Size**: Tiny (~35M parameters)
- **Total Parameters**: 40,196,257
- **Trainable Parameters**: 26,212,929 (65.2%)
- **Vision Encoder**: RepViT-M0.9 (~5M params)
- **Language Model**: TinyLLM-30M (30M params)
## π― Training Curriculum
EmberVLM follows a 4-stage training curriculum:
1. β
**Stage 1: Visual-Language Alignment** - Grounding vision and language
2. β
**Stage 2: Multimodal Instruction Tuning** - Following instructions
3. β
**Stage 3: Robot Fleet Selection** - Task-robot matching
4. β³ **Stage 4: Chain-of-Thought Reasoning** - Reasoning generation
**Current Stage**: STAGE2
## π» Usage
```python
from transformers import AutoTokenizer
from embervlm import EmberVLM
from PIL import Image
# Load model and tokenizer
model = EmberVLM.from_pretrained("euhidaman/embervlm-tiny")
tokenizer = AutoTokenizer.from_pretrained("euhidaman/embervlm-tiny")
# Load image
image = Image.open("scene.jpg")
# Generate response
prompt = "<image>Describe what you see and select the best robot for this task."
outputs = model.generate(
image=image,
prompt=prompt,
tokenizer=tokenizer,
max_new_tokens=256
)
print(outputs)
```
## π Training Details
- **Vision Backbone**: repvit
- **Language Backbone**: tinyllm
- **Optimization**: AdamW with cosine learning rate schedule
- **Mixed Precision**: bfloat16
- **Distributed Training**: Multi-GPU with DDP
- **Class Balancing**: Focal loss for robot selection (Stage 3)
- **Reasoning**: Chain-of-thought with reinforcement learning (Stage 4)
## π Environmental Impact
This model is designed for edge deployment to minimize energy consumption.
## π― Intended Use
- **Primary**: Edge deployment on resource-constrained devices
- **Applications**:
- Robotic vision-language understanding
- Real-time multimodal reasoning
- Robot fleet selection and task planning
- Mobile/embedded AI systems
## β οΈ Limitations
- Model is still in training - performance will improve as training progresses
- Optimized for efficiency over maximum accuracy
- Best suited for edge/mobile deployment scenarios
- Training focused on robot-centric scenarios
## π Citation
```bibtex
@software{embervlm_2026,
title = {EmberVLM: Efficient Vision-Language Model for Edge Deployment},
author = {EmberVLM Team},
year = {2026},
url = {https://huggingface.co/euhidaman/embervlm-tiny}
}
```
## π License
Apache 2.0
---
**Note**: This is a checkpoint from stage2 training (epoch 1).
The model will be updated after each epoch with improved performance.
|