Update model - STAGE2 Epoch 1 | Loss: 2.5060

Browse files

Files changed (10) hide show

README.md +129 -0
added_tokens.json +7 -0
config.json +46 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
special_tokens_map.json +61 -0
tokenizer.json +0 -0
tokenizer_config.json +197 -0
training_info.json +14 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,129 @@

+---
+language:
+- en
+license: apache-2.0
+tags:
+- vision-language
+- multimodal
+- robotics
+- edge-deployment
+- tiny-vlm
+- mobilevit_xs
+- smollm_135m
+- stage2
+base_model:
+- smollm_135m
+library_name: transformers
+pipeline_tag: image-text-to-text
+---
+# EmberVLM: Small (~137M parameters)
+**🔥 Efficient Vision-Language Model for Edge Deployment & Robotic Applications**
+This model is currently in training - **STAGE2 (Epoch 1)**.
+## 📊 Current Training Status
+- **Stage**: Multimodal Instruction Tuning - Following complex instructions
+- **Epoch**: 1
+- **Last Updated**: 2026-01-25 08:39:09 UTC
+### Latest Metrics
+- **instruction_loss**: 0.0000
+- **loss**: 2.5060
+## 🏗️ Model Architecture
+- **Size**: Small (~137M parameters)
+- **Total Parameters**: 138,908,785
+- **Trainable Parameters**: 34,313,153 (24.7%)
+- **Vision Encoder**: Apple MobileViT-XS (~2.3M params)
+- **Language Model**: SmolLM-135M (135M params)
+## 🎯 Training Curriculum
+EmberVLM follows a 4-stage training curriculum:
+1. ✅ **Stage 1: Visual-Language Alignment** - Grounding vision and language
+2. ✅ **Stage 2: Multimodal Instruction Tuning** - Following instructions
+3. ✅ **Stage 3: Robot Fleet Selection** - Task-robot matching
+4. ⏳ **Stage 4: Chain-of-Thought Reasoning** - Reasoning generation
+**Current Stage**: STAGE2
+## 💻 Usage
+```python
+from transformers import AutoTokenizer
+from embervlm import EmberVLM
+from PIL import Image
+# Load model and tokenizer
+model = EmberVLM.from_pretrained("euhidaman/embervlm-small")
+tokenizer = AutoTokenizer.from_pretrained("euhidaman/embervlm-small")
+# Load image
+image = Image.open("scene.jpg")
+# Generate response
+prompt = "<image>Describe what you see and select the best robot for this task."
+outputs = model.generate(
+    image=image,
+    prompt=prompt,
+    tokenizer=tokenizer,
+    max_new_tokens=256
+)
+print(outputs)
+```
+## 🎓 Training Details
+- **Vision Backbone**: mobilevit_xs
+- **Language Backbone**: smollm_135m
+- **Optimization**: AdamW with cosine learning rate schedule
+- **Mixed Precision**: bfloat16
+- **Distributed Training**: Multi-GPU with DDP
+- **Class Balancing**: Focal loss for robot selection (Stage 3)
+- **Reasoning**: Chain-of-thought with reinforcement learning (Stage 4)
+## 🌍 Environmental Impact
+This model is designed for edge deployment to minimize energy consumption.
+## 🎯 Intended Use
+- **Primary**: Edge deployment on resource-constrained devices
+- **Applications**:
+  - Robotic vision-language understanding
+  - Real-time multimodal reasoning
+  - Robot fleet selection and task planning
+  - Mobile/embedded AI systems
+## ⚠️ Limitations
+- Model is still in training - performance will improve as training progresses
+- Optimized for efficiency over maximum accuracy
+- Best suited for edge/mobile deployment scenarios
+- Training focused on robot-centric scenarios
+## 📚 Citation
+```bibtex
+@software{embervlm_2026,
+  title = {EmberVLM: Efficient Vision-Language Model for Edge Deployment},
+  author = {EmberVLM Team},
+  year = {2026},
+  url = {https://huggingface.co/euhidaman/embervlm-small}
+}
+```
+## 📝 License
+Apache 2.0
+---
+**Note**: This is a checkpoint from stage2 training (epoch 1).
+The model will be updated after each epoch with improved performance.

added_tokens.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "<|action_plan|>": 49155,
+  "<|image|>": 49156,
+  "<|reasoning_end|>": 49153,
+  "<|reasoning_start|>": 49152,
+  "<|robot_selection|>": 49154
+}

config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+  "vision_backbone": "mobilevit_xs",
+  "language_backbone": "smollm_135m",
+  "vision_model": "apple/mobilevit-x-small",
+  "vision_pretrained": true,
+  "freeze_vision": true,
+  "num_visual_tokens": 8,
+  "vision_output_dim": 384,
+  "image_size": 224,
+  "language_hidden_size": 576,
+  "language_num_layers": 30,
+  "language_num_heads": 9,
+  "language_vocab_size": 49157,
+  "language_max_length": 1024,
+  "freeze_language_base": true,
+  "unfreeze_last_layer": true,
+  "use_pretrained_language": true,
+  "pretrained_language_model": "HuggingFaceTB/SmolLM-135M",
+  "fusion_bottleneck_dim": 48,
+  "fusion_dropout": 0.1,
+  "use_qk_norm": true,
+  "reasoning_enabled": true,
+  "reasoning_hidden_dim": 192,
+  "reasoning_num_layers": 2,
+  "reasoning_num_heads": 4,
+  "num_reasoning_steps": 4,
+  "max_plan_steps": 5,
+  "num_robots": 5,
+  "robot_names": [
+    "Drone",
+    "Humanoid",
+    "Wheeled",
+    "Legged",
+    "Underwater"
+  ],
+  "special_tokens": {
+    "reasoning_start": "<|reasoning_start|>",
+    "reasoning_end": "<|reasoning_end|>",
+    "robot_selection": "<|robot_selection|>",
+    "action_plan": "<|action_plan|>",
+    "image_token": "<|image|>"
+  },
+  "dropout": 0.1,
+  "initializer_range": 0.02,
+  "vocab_size": 49157
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f885364ae88a3ee563b6e89dc792ae6d5e3c64fa7e3e0e01ab932b1d9f485255
+size 286953683

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+  "additional_special_tokens": [
+    {
+      "content": "<|reasoning_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|reasoning_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|robot_selection|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|action_plan|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|image|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,197 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<repo_name>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<jupyter_script>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49152": {
+      "content": "<|reasoning_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49153": {
+      "content": "<|reasoning_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49154": {
+      "content": "<|robot_selection|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49155": {
+      "content": "<|action_plan|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49156": {
+      "content": "<|image|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|reasoning_start|>",
+    "<|reasoning_end|>",
+    "<|robot_selection|>",
+    "<|action_plan|>",
+    "<|image|>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

training_info.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "stage": "stage2",
+  "epoch": 1,
+  "metrics": {
+    "loss": 2.5060146195547923,
+    "instruction_loss": 0.0
+  },
+  "carbon_emissions_kg": 0.0,
+  "timestamp": "2026-01-25T08:39:09.549990",
+  "vision_backbone": "mobilevit_xs",
+  "language_backbone": "smollm_135m",
+  "total_parameters": 138908785,
+  "trainable_parameters": 34313153
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff