Update model - STAGE1 Epoch 1 | Loss: 6.5212

Browse files

Files changed (10) hide show

README.md +130 -0
added_tokens.json +7 -0
config.json +46 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
special_tokens_map.json +61 -0
tokenizer.json +0 -0
tokenizer_config.json +70 -0
training_info.json +15 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,130 @@

+---
+language:
+- en
+license: apache-2.0
+tags:
+- vision-language
+- multimodal
+- robotics
+- edge-deployment
+- tiny-vlm
+- repvit
+- tinyllm
+- stage1
+base_model:
+- tinyllm
+library_name: transformers
+pipeline_tag: image-text-to-text
+---
+# EmberVLM: Tiny (~35M parameters)
+**🔥 Efficient Vision-Language Model for Edge Deployment & Robotic Applications**
+This model is currently in training - **STAGE1 (Epoch 1)**.
+## 📊 Current Training Status
+- **Stage**: Visual-Language Alignment - Learning to ground vision and language
+- **Epoch**: 1
+- **Last Updated**: 2026-01-28 15:03:00 UTC
+### Latest Metrics
+- **captioning_loss**: 8.4406
+- **contrastive_loss**: 4.6019
+- **loss**: 6.5212
+## 🏗️ Model Architecture
+- **Size**: Tiny (~35M parameters)
+- **Total Parameters**: 37,237,665
+- **Trainable Parameters**: 23,254,337 (62.4%)
+- **Vision Encoder**: RepViT-M0.9 (~5M params)
+- **Language Model**: TinyLLM-30M (30M params)
+## 🎯 Training Curriculum
+EmberVLM follows a 4-stage training curriculum:
+1. ✅ **Stage 1: Visual-Language Alignment** - Grounding vision and language
+2. ✅ **Stage 2: Multimodal Instruction Tuning** - Following instructions
+3. ✅ **Stage 3: Robot Fleet Selection** - Task-robot matching
+4. ⏳ **Stage 4: Chain-of-Thought Reasoning** - Reasoning generation
+**Current Stage**: STAGE1
+## 💻 Usage
+```python
+from transformers import AutoTokenizer
+from embervlm import EmberVLM
+from PIL import Image
+# Load model and tokenizer
+model = EmberVLM.from_pretrained("euhidaman/embervlm-tiny")
+tokenizer = AutoTokenizer.from_pretrained("euhidaman/embervlm-tiny")
+# Load image
+image = Image.open("scene.jpg")
+# Generate response
+prompt = "<image>Describe what you see and select the best robot for this task."
+outputs = model.generate(
+    image=image,
+    prompt=prompt,
+    tokenizer=tokenizer,
+    max_new_tokens=256
+)
+print(outputs)
+```
+## 🎓 Training Details
+- **Vision Backbone**: repvit
+- **Language Backbone**: tinyllm
+- **Optimization**: AdamW with cosine learning rate schedule
+- **Mixed Precision**: bfloat16
+- **Distributed Training**: Multi-GPU with DDP
+- **Class Balancing**: Focal loss for robot selection (Stage 3)
+- **Reasoning**: Chain-of-thought with reinforcement learning (Stage 4)
+## 🌍 Environmental Impact
+This model is designed for edge deployment to minimize energy consumption.
+## 🎯 Intended Use
+- **Primary**: Edge deployment on resource-constrained devices
+- **Applications**:
+  - Robotic vision-language understanding
+  - Real-time multimodal reasoning
+  - Robot fleet selection and task planning
+  - Mobile/embedded AI systems
+## ⚠️ Limitations
+- Model is still in training - performance will improve as training progresses
+- Optimized for efficiency over maximum accuracy
+- Best suited for edge/mobile deployment scenarios
+- Training focused on robot-centric scenarios
+## 📚 Citation
+```bibtex
+@software{embervlm_2026,
+  title = {EmberVLM: Efficient Vision-Language Model for Edge Deployment},
+  author = {EmberVLM Team},
+  year = {2026},
+  url = {https://huggingface.co/euhidaman/embervlm-tiny}
+}
+```
+## 📝 License
+Apache 2.0
+---
+**Note**: This is a checkpoint from stage1 training (epoch 1).
+The model will be updated after each epoch with improved performance.

added_tokens.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "<|action_plan|>": 50260,
+  "<|image|>": 50261,
+  "<|reasoning_end|>": 50258,
+  "<|reasoning_start|>": 50257,
+  "<|robot_selection|>": 50259
+}

config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+  "vision_backbone": "repvit",
+  "language_backbone": "tinyllm",
+  "vision_model": "repvit_m0_9",
+  "vision_pretrained": true,
+  "freeze_vision": true,
+  "num_visual_tokens": 8,
+  "vision_output_dim": 384,
+  "image_size": 224,
+  "language_hidden_size": 384,
+  "language_num_layers": 6,
+  "language_num_heads": 6,
+  "language_vocab_size": 50262,
+  "language_max_length": 1024,
+  "freeze_language_base": true,
+  "unfreeze_last_layer": true,
+  "use_pretrained_language": true,
+  "pretrained_language_model": "tinyllm/30M-0.4",
+  "fusion_bottleneck_dim": 48,
+  "fusion_dropout": 0.1,
+  "use_qk_norm": true,
+  "reasoning_enabled": true,
+  "reasoning_hidden_dim": 192,
+  "reasoning_num_layers": 2,
+  "reasoning_num_heads": 4,
+  "num_reasoning_steps": 4,
+  "max_plan_steps": 5,
+  "num_robots": 5,
+  "robot_names": [
+    "Drone",
+    "Humanoid",
+    "Wheeled",
+    "Legged",
+    "Underwater"
+  ],
+  "special_tokens": {
+    "reasoning_start": "<|reasoning_start|>",
+    "reasoning_end": "<|reasoning_end|>",
+    "robot_selection": "<|robot_selection|>",
+    "action_plan": "<|action_plan|>",
+    "image_token": "<|image|>"
+  },
+  "dropout": 0.1,
+  "initializer_range": 0.02,
+  "vocab_size": 50262
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:27a0610a57d6c72d939c944e5b71106e003f0c4a3d6fc9daa5b9ac934e22922e
+size 88817547

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+  "additional_special_tokens": [
+    {
+      "content": "<|reasoning_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|reasoning_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|robot_selection|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|action_plan|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|image|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,70 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "50256": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50257": {
+      "content": "<|reasoning_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50258": {
+      "content": "<|reasoning_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50259": {
+      "content": "<|robot_selection|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50260": {
+      "content": "<|action_plan|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50261": {
+      "content": "<|image|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|reasoning_start|>",
+    "<|reasoning_end|>",
+    "<|robot_selection|>",
+    "<|action_plan|>",
+    "<|image|>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 1024,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>"
+}

training_info.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "stage": "stage1",
+  "epoch": 1,
+  "metrics": {
+    "loss": 6.521240068518597,
+    "contrastive_loss": 4.601919858351998,
+    "captioning_loss": 8.440560257953146
+  },
+  "carbon_emissions_kg": 0.0,
+  "timestamp": "2026-01-28T15:03:00.655056",
+  "vision_backbone": "repvit",
+  "language_backbone": "tinyllm",
+  "total_parameters": 37237665,
+  "trainable_parameters": 23254337
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff