File size: 5,011 Bytes
ca284fc bf6049e ca284fc bf6049e ca284fc bf6049e ca284fc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | ---
language: en
license: mit
tags:
- vision-language-model
- bitnet
- mixture-of-experts
- vlm
- multimodal
- edge-ai
pipeline_tag: image-text-to-text
---
# EmberNet β BitNet b1.58 MoE VLM
> **Status:** Stage 2/2, Epoch 3/5, Loss 4.1517
EmberNet is a tiny but capable Vision-Language Model built for edge deployment
and domain-expert reasoning. It combines a frozen **SigLIP** vision backbone
with a **BitNet b1.58 ternary-quantized Mixture-of-Experts** language decoder,
achieving ~3Γ memory reduction over a full-precision equivalent while
preserving strong visual understanding across 8 specialised domains.
---
## Model Details
| Property | Value |
|---|---|
| **Model type** | Vision-Language Model (VLM) |
| **Quantisation** | BitNet b1.58 (ternary weights: β1, 0, +1) |
| **Total parameters** | 840.8 M |
| **Trainable parameters** | 723.3 M |
| **Active parameters / forward** | ~235.4 M (top-2 routing) |
| **Carbon footprint** | 0.6390 kg COβeq |
| **Training stage** | Stage 2/2 β Expert SFT |
| **Epoch** | 3/5 |
| **Best loss** | 4.1517 |
| **Last updated** | 2026-03-08 06:05 UTC |
---
## Architecture
```
EmberNet VLM
βββ Vision Encoder (frozen)
β βββ SigLIP-base-patch16-224 92.9 M params
β βββ Token Compressor 2.4 M params
β βββ Spatial Pooler 2.4 M params
β βββ BitLinear Projector 10.1 M params
β
βββ BitNet b1.58 MoE Decoder 733.1 M params total
βββ Layers: 16 Hidden: 768 Heads: 12 (GQA kv=6)
βββ Experts: 8 domain + 1 shared (always active)
βββ Routing: Top-2 per token
βββ Quantisation: ternary weights, 4-bit activations
```
| Decoder Component | Parameters |
|---|---|
| Embeddings | 24.6 M |
| Attention (all layers) | 0 |
| Router (all layers) | 98.4 K |
| Shared Expert | 75.6 M |
| Domain Experts (8Γ) | 604.4 M (75.6 M/expert) |
### Expert Domains
| ID | Expert | Trained on |
|----|--------|-----------|
| 0 | `vision_ocr` | TextVQA, DocVQA, OCR-VQA, InfoVQA |
| 1 | `vision_diagram` | AI2D, InfoVQA diagrams |
| 2 | `code_math_chart` | ChartQA, PlotQA, FigureQA, DVQA |
| 3 | `code_math_formula` | MathVista, math formula datasets |
| 4 | `spatial_scene` | VQAv2, GQA, Visual Genome |
| 5 | `spatial_reasoning` | RefCOCO, GQA spatial splits |
| 6 | `agentic_knowledge` | OK-VQA, A-OKVQA |
| 7 | `agentic_reasoning` | ScienceQA, CLEVR |
| β | `shared` | All domains (always active) |
---
## Training
### Configuration
```yaml
stage_1_projector_alignment:
epochs: 3
batch_size: 8 (effective: 32 with grad-accum 4)
learning_rate: 1e-4
trainable: vision projector + compressor + pooler only
stage_2_expert_sft:
epochs: 10
batch_size: 4 (effective: 16 with grad-accum 4)
learning_rate: 3e-4
trainable: router + all 8 expert FFNs + shared expert
expert_supervision_weight: 0.1
```
### Optimiser
- **BitNetStableOptimizer** β custom Adam with FP32 master weights
- Two-phase LR: full LR for 60 % of training, then 0.1 Γ LR
- Warmup: 100 steps
- Weight clamp: [β3, 3] (maps cleanly to β1 / 0 / +1 at inference)
---
## Usage
```python
import torch
from PIL import Image
from transformers import AutoTokenizer
# Clone the repo and add it to your Python path, then:
from models import EmberNetVLM
from models.vlm import EmberNetConfig
# Load
config = EmberNetConfig()
model = EmberNetVLM(config)
ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# Inference
image = Image.open("scene.jpg").convert("RGB")
prompt = "<image>\nDescribe what you see."
response = model.generate(
image=image,
prompt=prompt,
tokenizer=tokenizer,
max_new_tokens=256,
)
print(response)
```
---
## Intended Uses
- **Edge & embedded deployment** β ternary weights run efficiently on CPUs and NPUs
- **Domain-aware visual reasoning** β dedicated experts for OCR, charts, math, spatial, and agentic tasks
- **Robotic / agentic pipelines** β `agentic_knowledge` + `agentic_reasoning` experts support multi-step planning
- **Fine-tuning base** β swap in domain datasets to specialise any of the 8 experts independently
## Limitations
- Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size
- Image resolution fixed at 224 Γ 224; very fine-grained OCR may degrade
- Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned
- Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited
---
## Citation
```bibtex
@software{embernet_vlm,
title = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model},
author = {Aman Euh},
year = {2026},
url = {https://huggingface.co/euhidaman/EmberNet}
}
```
|