File size: 5,011 Bytes
ca284fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf6049e
ca284fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf6049e
ca284fc
bf6049e
 
 
ca284fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
language: en
license: mit
tags:
  - vision-language-model
  - bitnet
  - mixture-of-experts
  - vlm
  - multimodal
  - edge-ai
pipeline_tag: image-text-to-text
---

# EmberNet β€” BitNet b1.58 MoE VLM

> **Status:** Stage 2/2, Epoch 3/5, Loss 4.1517

EmberNet is a tiny but capable Vision-Language Model built for edge deployment
and domain-expert reasoning.  It combines a frozen **SigLIP** vision backbone
with a **BitNet b1.58 ternary-quantized Mixture-of-Experts** language decoder,
achieving ~3Γ— memory reduction over a full-precision equivalent while
preserving strong visual understanding across 8 specialised domains.

---

## Model Details

| Property | Value |
|---|---|
| **Model type** | Vision-Language Model (VLM) |
| **Quantisation** | BitNet b1.58 (ternary weights: βˆ’1, 0, +1) |
| **Total parameters** | 840.8 M |
| **Trainable parameters** | 723.3 M |
| **Active parameters / forward** | ~235.4 M (top-2 routing) |
| **Carbon footprint** | 0.6390 kg COβ‚‚eq |
| **Training stage** | Stage 2/2 β€” Expert SFT |
| **Epoch** | 3/5 |
| **Best loss** | 4.1517 |
| **Last updated** | 2026-03-08 06:05 UTC |

---

## Architecture

```
EmberNet VLM
β”œβ”€β”€ Vision Encoder  (frozen)
β”‚   β”œβ”€β”€ SigLIP-base-patch16-224       92.9 M params
β”‚   β”œβ”€β”€ Token Compressor              2.4 M params
β”‚   β”œβ”€β”€ Spatial Pooler                2.4 M params
β”‚   └── BitLinear Projector           10.1 M params
β”‚
└── BitNet b1.58 MoE Decoder          733.1 M params total
    β”œβ”€β”€ Layers: 16   Hidden: 768   Heads: 12 (GQA kv=6)
    β”œβ”€β”€ Experts: 8 domain + 1 shared (always active)
    β”œβ”€β”€ Routing: Top-2 per token
    └── Quantisation: ternary weights, 4-bit activations
```

| Decoder Component | Parameters |
|---|---|
| Embeddings | 24.6 M |
| Attention (all layers) | 0 |
| Router (all layers) | 98.4 K |
| Shared Expert | 75.6 M |
| Domain Experts (8Γ—) | 604.4 M (75.6 M/expert) |

### Expert Domains

| ID | Expert | Trained on |
|----|--------|-----------|
| 0 | `vision_ocr` | TextVQA, DocVQA, OCR-VQA, InfoVQA |
| 1 | `vision_diagram` | AI2D, InfoVQA diagrams |
| 2 | `code_math_chart` | ChartQA, PlotQA, FigureQA, DVQA |
| 3 | `code_math_formula` | MathVista, math formula datasets |
| 4 | `spatial_scene` | VQAv2, GQA, Visual Genome |
| 5 | `spatial_reasoning` | RefCOCO, GQA spatial splits |
| 6 | `agentic_knowledge` | OK-VQA, A-OKVQA |
| 7 | `agentic_reasoning` | ScienceQA, CLEVR |
| β€” | `shared` | All domains (always active) |

---

## Training

### Configuration

```yaml
stage_1_projector_alignment:
  epochs: 3
  batch_size: 8  (effective: 32 with grad-accum 4)
  learning_rate: 1e-4
  trainable: vision projector + compressor + pooler only

stage_2_expert_sft:
  epochs: 10
  batch_size: 4  (effective: 16 with grad-accum 4)
  learning_rate: 3e-4
  trainable: router + all 8 expert FFNs + shared expert
  expert_supervision_weight: 0.1
```

### Optimiser

- **BitNetStableOptimizer** β€” custom Adam with FP32 master weights  
- Two-phase LR: full LR for 60 % of training, then 0.1 Γ— LR  
- Warmup: 100 steps  
- Weight clamp: [βˆ’3, 3] (maps cleanly to βˆ’1 / 0 / +1 at inference)

---

## Usage

```python
import torch
from PIL import Image
from transformers import AutoTokenizer

# Clone the repo and add it to your Python path, then:
from models import EmberNetVLM
from models.vlm import EmberNetConfig

# Load
config = EmberNetConfig()
model = EmberNetVLM(config)
ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# Inference
image = Image.open("scene.jpg").convert("RGB")
prompt = "<image>\nDescribe what you see."

response = model.generate(
    image=image,
    prompt=prompt,
    tokenizer=tokenizer,
    max_new_tokens=256,
)
print(response)
```

---

## Intended Uses

- **Edge & embedded deployment** β€” ternary weights run efficiently on CPUs and NPUs  
- **Domain-aware visual reasoning** β€” dedicated experts for OCR, charts, math, spatial, and agentic tasks  
- **Robotic / agentic pipelines** β€” `agentic_knowledge` + `agentic_reasoning` experts support multi-step planning  
- **Fine-tuning base** β€” swap in domain datasets to specialise any of the 8 experts independently  

## Limitations

- Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size  
- Image resolution fixed at 224 Γ— 224; very fine-grained OCR may degrade  
- Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned  
- Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited  

---

## Citation

```bibtex
@software{embernet_vlm,
  title  = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model},
  author = {Aman Euh},
  year   = {2026},
  url    = {https://huggingface.co/euhidaman/EmberNet}
}
```