File size: 3,227 Bytes
bc357ac
 
 
 
 
 
 
 
 
 
 
 
aed5fac
bc357ac
 
 
 
 
 
 
 
 
 
aed5fac
bc357ac
 
 
aed5fac
bc357ac
aed5fac
bc357ac
 
aed5fac
 
bc357ac
 
 
 
c7a0fe2
 
bc357ac
 
 
 
 
 
 
 
 
 
 
 
aed5fac
bc357ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aed5fac
bc357ac
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
language:
- en
license: apache-2.0
tags:
- vision-language
- multimodal
- robotics
- edge-deployment
- tiny-vlm
- repvit
- tinyllm
- stage2
base_model:
- tinyllm
library_name: transformers
pipeline_tag: image-text-to-text
---

# EmberVLM: Tiny (~35M parameters)

**πŸ”₯ Efficient Vision-Language Model for Edge Deployment & Robotic Applications**

This model is currently in training - **STAGE2 (Epoch 1)**.

## πŸ“Š Current Training Status

- **Stage**: Multimodal Instruction Tuning - Following complex instructions
- **Epoch**: 1
- **Last Updated**: 2026-02-01 16:01:18 UTC

### Latest Metrics
- **instruction_loss**: 0.0000
- **loss**: 5.2714

## πŸ—οΈ Model Architecture

- **Size**: Tiny (~35M parameters)
- **Total Parameters**: 40,196,257
- **Trainable Parameters**: 26,212,929 (65.2%)
- **Vision Encoder**: RepViT-M0.9 (~5M params)
- **Language Model**: TinyLLM-30M (30M params)

## 🎯 Training Curriculum

EmberVLM follows a 4-stage training curriculum:

1. βœ… **Stage 1: Visual-Language Alignment** - Grounding vision and language
2. βœ… **Stage 2: Multimodal Instruction Tuning** - Following instructions
3. βœ… **Stage 3: Robot Fleet Selection** - Task-robot matching
4. ⏳ **Stage 4: Chain-of-Thought Reasoning** - Reasoning generation

**Current Stage**: STAGE2

## πŸ’» Usage

```python
from transformers import AutoTokenizer
from embervlm import EmberVLM
from PIL import Image

# Load model and tokenizer
model = EmberVLM.from_pretrained("euhidaman/embervlm-tiny")
tokenizer = AutoTokenizer.from_pretrained("euhidaman/embervlm-tiny")

# Load image
image = Image.open("scene.jpg")

# Generate response
prompt = "<image>Describe what you see and select the best robot for this task."
outputs = model.generate(
    image=image,
    prompt=prompt,
    tokenizer=tokenizer,
    max_new_tokens=256
)

print(outputs)
```

## πŸŽ“ Training Details

- **Vision Backbone**: repvit
- **Language Backbone**: tinyllm
- **Optimization**: AdamW with cosine learning rate schedule
- **Mixed Precision**: bfloat16
- **Distributed Training**: Multi-GPU with DDP
- **Class Balancing**: Focal loss for robot selection (Stage 3)
- **Reasoning**: Chain-of-thought with reinforcement learning (Stage 4)

## 🌍 Environmental Impact

This model is designed for edge deployment to minimize energy consumption.

## 🎯 Intended Use

- **Primary**: Edge deployment on resource-constrained devices
- **Applications**: 
  - Robotic vision-language understanding
  - Real-time multimodal reasoning
  - Robot fleet selection and task planning
  - Mobile/embedded AI systems

## ⚠️ Limitations

- Model is still in training - performance will improve as training progresses
- Optimized for efficiency over maximum accuracy
- Best suited for edge/mobile deployment scenarios
- Training focused on robot-centric scenarios

## πŸ“š Citation

```bibtex
@software{embervlm_2026,
  title = {EmberVLM: Efficient Vision-Language Model for Edge Deployment},
  author = {EmberVLM Team},
  year = {2026},
  url = {https://huggingface.co/euhidaman/embervlm-tiny}
}
```

## πŸ“ License

Apache 2.0

---

**Note**: This is a checkpoint from stage2 training (epoch 1). 
The model will be updated after each epoch with improved performance.