Update EmberNet Stage 2 Epoch 1/5 | loss 4.9617 | step 625
Browse files- README.md +173 -0
- config.json +51 -0
- pytorch_model.bin +3 -0
- tokenizer.json +0 -0
- tokenizer_config.json +11 -0
README.md
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: mit
|
| 4 |
+
tags:
|
| 5 |
+
- vision-language-model
|
| 6 |
+
- bitnet
|
| 7 |
+
- mixture-of-experts
|
| 8 |
+
- vlm
|
| 9 |
+
- multimodal
|
| 10 |
+
- edge-ai
|
| 11 |
+
pipeline_tag: image-text-to-text
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# EmberNet — BitNet b1.58 MoE VLM
|
| 15 |
+
|
| 16 |
+
> **Status:** Stage 2/2, Epoch 1/5, Loss 4.9617
|
| 17 |
+
|
| 18 |
+
EmberNet is a tiny but capable Vision-Language Model built for edge deployment
|
| 19 |
+
and domain-expert reasoning. It combines a frozen **SigLIP** vision backbone
|
| 20 |
+
with a **BitNet b1.58 ternary-quantized Mixture-of-Experts** language decoder,
|
| 21 |
+
achieving ~3× memory reduction over a full-precision equivalent while
|
| 22 |
+
preserving strong visual understanding across 8 specialised domains.
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## Model Details
|
| 27 |
+
|
| 28 |
+
| Property | Value |
|
| 29 |
+
|---|---|
|
| 30 |
+
| **Model type** | Vision-Language Model (VLM) |
|
| 31 |
+
| **Quantisation** | BitNet b1.58 (ternary weights: −1, 0, +1) |
|
| 32 |
+
| **Total parameters** | 840.8 M |
|
| 33 |
+
| **Trainable parameters** | 723.3 M |
|
| 34 |
+
| **Active parameters / forward** | ~235.4 M (top-2 routing) |
|
| 35 |
+
| **Carbon footprint** | 0.2091 kg CO₂eq |
|
| 36 |
+
| **Training stage** | Stage 2/2 — Expert SFT |
|
| 37 |
+
| **Epoch** | 1/5 |
|
| 38 |
+
| **Best loss** | 4.9617 |
|
| 39 |
+
| **Last updated** | 2026-03-07 22:40 UTC |
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## Architecture
|
| 44 |
+
|
| 45 |
+
```
|
| 46 |
+
EmberNet VLM
|
| 47 |
+
├── Vision Encoder (frozen)
|
| 48 |
+
│ ├── SigLIP-base-patch16-224 92.9 M params
|
| 49 |
+
│ ├── Token Compressor 2.4 M params
|
| 50 |
+
│ ├── Spatial Pooler 2.4 M params
|
| 51 |
+
│ └── BitLinear Projector 10.1 M params
|
| 52 |
+
│
|
| 53 |
+
└── BitNet b1.58 MoE Decoder 733.1 M params total
|
| 54 |
+
├── Layers: 16 Hidden: 768 Heads: 12 (GQA kv=6)
|
| 55 |
+
├── Experts: 8 domain + 1 shared (always active)
|
| 56 |
+
├── Routing: Top-2 per token
|
| 57 |
+
└── Quantisation: ternary weights, 4-bit activations
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
| Decoder Component | Parameters |
|
| 61 |
+
|---|---|
|
| 62 |
+
| Embeddings | 24.6 M |
|
| 63 |
+
| Attention (all layers) | 0 |
|
| 64 |
+
| Router (all layers) | 98.4 K |
|
| 65 |
+
| Shared Expert | 75.6 M |
|
| 66 |
+
| Domain Experts (8×) | 604.4 M (75.6 M/expert) |
|
| 67 |
+
|
| 68 |
+
### Expert Domains
|
| 69 |
+
|
| 70 |
+
| ID | Expert | Trained on |
|
| 71 |
+
|----|--------|-----------|
|
| 72 |
+
| 0 | `vision_ocr` | TextVQA, DocVQA, OCR-VQA, InfoVQA |
|
| 73 |
+
| 1 | `vision_diagram` | AI2D, InfoVQA diagrams |
|
| 74 |
+
| 2 | `code_math_chart` | ChartQA, PlotQA, FigureQA, DVQA |
|
| 75 |
+
| 3 | `code_math_formula` | MathVista, math formula datasets |
|
| 76 |
+
| 4 | `spatial_scene` | VQAv2, GQA, Visual Genome |
|
| 77 |
+
| 5 | `spatial_reasoning` | RefCOCO, GQA spatial splits |
|
| 78 |
+
| 6 | `agentic_knowledge` | OK-VQA, A-OKVQA |
|
| 79 |
+
| 7 | `agentic_reasoning` | ScienceQA, CLEVR |
|
| 80 |
+
| — | `shared` | All domains (always active) |
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
## Training
|
| 85 |
+
|
| 86 |
+
### Configuration
|
| 87 |
+
|
| 88 |
+
```yaml
|
| 89 |
+
stage_1_projector_alignment:
|
| 90 |
+
epochs: 3
|
| 91 |
+
batch_size: 8 (effective: 32 with grad-accum 4)
|
| 92 |
+
learning_rate: 1e-4
|
| 93 |
+
trainable: vision projector + compressor + pooler only
|
| 94 |
+
|
| 95 |
+
stage_2_expert_sft:
|
| 96 |
+
epochs: 10
|
| 97 |
+
batch_size: 4 (effective: 16 with grad-accum 4)
|
| 98 |
+
learning_rate: 3e-4
|
| 99 |
+
trainable: router + all 8 expert FFNs + shared expert
|
| 100 |
+
expert_supervision_weight: 0.1
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
### Optimiser
|
| 104 |
+
|
| 105 |
+
- **BitNetStableOptimizer** — custom Adam with FP32 master weights
|
| 106 |
+
- Two-phase LR: full LR for 60 % of training, then 0.1 × LR
|
| 107 |
+
- Warmup: 100 steps
|
| 108 |
+
- Weight clamp: [−3, 3] (maps cleanly to −1 / 0 / +1 at inference)
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## Usage
|
| 113 |
+
|
| 114 |
+
```python
|
| 115 |
+
import torch
|
| 116 |
+
from PIL import Image
|
| 117 |
+
from transformers import AutoTokenizer
|
| 118 |
+
|
| 119 |
+
# Clone the repo and add it to your Python path, then:
|
| 120 |
+
from models import EmberNetVLM
|
| 121 |
+
from models.vlm import EmberNetConfig
|
| 122 |
+
|
| 123 |
+
# Load
|
| 124 |
+
config = EmberNetConfig()
|
| 125 |
+
model = EmberNetVLM(config)
|
| 126 |
+
ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False)
|
| 127 |
+
model.load_state_dict(ckpt["model_state_dict"])
|
| 128 |
+
model.eval()
|
| 129 |
+
|
| 130 |
+
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
|
| 131 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 132 |
+
|
| 133 |
+
# Inference
|
| 134 |
+
image = Image.open("scene.jpg").convert("RGB")
|
| 135 |
+
prompt = "<image>\nDescribe what you see."
|
| 136 |
+
|
| 137 |
+
response = model.generate(
|
| 138 |
+
image=image,
|
| 139 |
+
prompt=prompt,
|
| 140 |
+
tokenizer=tokenizer,
|
| 141 |
+
max_new_tokens=256,
|
| 142 |
+
)
|
| 143 |
+
print(response)
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## Intended Uses
|
| 149 |
+
|
| 150 |
+
- **Edge & embedded deployment** — ternary weights run efficiently on CPUs and NPUs
|
| 151 |
+
- **Domain-aware visual reasoning** — dedicated experts for OCR, charts, math, spatial, and agentic tasks
|
| 152 |
+
- **Robotic / agentic pipelines** — `agentic_knowledge` + `agentic_reasoning` experts support multi-step planning
|
| 153 |
+
- **Fine-tuning base** — swap in domain datasets to specialise any of the 8 experts independently
|
| 154 |
+
|
| 155 |
+
## Limitations
|
| 156 |
+
|
| 157 |
+
- Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size
|
| 158 |
+
- Image resolution fixed at 224 × 224; very fine-grained OCR may degrade
|
| 159 |
+
- Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned
|
| 160 |
+
- Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited
|
| 161 |
+
|
| 162 |
+
---
|
| 163 |
+
|
| 164 |
+
## Citation
|
| 165 |
+
|
| 166 |
+
```bibtex
|
| 167 |
+
@software{embernet_vlm,
|
| 168 |
+
title = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model},
|
| 169 |
+
author = {Aman Euh},
|
| 170 |
+
year = {2026},
|
| 171 |
+
url = {https://huggingface.co/euhidaman/EmberNet}
|
| 172 |
+
}
|
| 173 |
+
```
|
config.json
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "embernet_vlm",
|
| 3 |
+
"architecture": "BitNet b1.58 MoE VLM",
|
| 4 |
+
"vision_encoder": {
|
| 5 |
+
"model_name": "google/siglip-base-patch16-224",
|
| 6 |
+
"num_image_tokens": 64,
|
| 7 |
+
"freeze_vision": true
|
| 8 |
+
},
|
| 9 |
+
"language_decoder": {
|
| 10 |
+
"vocab_size": 32002,
|
| 11 |
+
"hidden_size": 768,
|
| 12 |
+
"intermediate_size": 2048,
|
| 13 |
+
"num_layers": 16,
|
| 14 |
+
"num_attention_heads": 12,
|
| 15 |
+
"num_kv_heads": 6,
|
| 16 |
+
"max_position_embeddings": 4096,
|
| 17 |
+
"num_experts": 8,
|
| 18 |
+
"num_experts_per_tok": 2,
|
| 19 |
+
"use_shared_expert": true,
|
| 20 |
+
"expert_domains": [
|
| 21 |
+
"vision_ocr",
|
| 22 |
+
"vision_diagram",
|
| 23 |
+
"code_math_chart",
|
| 24 |
+
"code_math_formula",
|
| 25 |
+
"spatial_scene",
|
| 26 |
+
"spatial_reasoning",
|
| 27 |
+
"agentic_knowledge",
|
| 28 |
+
"agentic_reasoning"
|
| 29 |
+
],
|
| 30 |
+
"quantisation": "BitNet b1.58 (ternary)",
|
| 31 |
+
"activation_bits": 4
|
| 32 |
+
},
|
| 33 |
+
"torch_dtype": "bfloat16",
|
| 34 |
+
"transformers_version": ">=4.36.0",
|
| 35 |
+
"parameter_counts": {
|
| 36 |
+
"vision_encoder": 107748864,
|
| 37 |
+
"vision_encoder_breakdown": {
|
| 38 |
+
"encoder": 92884224,
|
| 39 |
+
"compressor": 2363904,
|
| 40 |
+
"pooler": 2412288,
|
| 41 |
+
"projector": 10088448
|
| 42 |
+
},
|
| 43 |
+
"decoder_total": 733055360,
|
| 44 |
+
"decoder_embeddings": 24577536,
|
| 45 |
+
"decoder_attention": 0,
|
| 46 |
+
"decoder_router": 98432,
|
| 47 |
+
"decoder_shared_expert": 75554816,
|
| 48 |
+
"decoder_domain_experts": 604438528,
|
| 49 |
+
"num_domain_experts": 8
|
| 50 |
+
}
|
| 51 |
+
}
|
pytorch_model.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:aeb837296425d584e4ee59d0a9cbfc46d3ad4fd0d2a51c4e232b1ee53ba6377c
|
| 3 |
+
size 3397346561
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_prefix_space": false,
|
| 3 |
+
"backend": "tokenizers",
|
| 4 |
+
"bos_token": "<|endoftext|>",
|
| 5 |
+
"clean_up_tokenization_spaces": true,
|
| 6 |
+
"eos_token": "<|endoftext|>",
|
| 7 |
+
"is_local": false,
|
| 8 |
+
"model_max_length": 2048,
|
| 9 |
+
"tokenizer_class": "TokenizersBackend",
|
| 10 |
+
"unk_token": "<|endoftext|>"
|
| 11 |
+
}
|