File size: 5,011 Bytes

ca284fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf6049e
ca284fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf6049e
ca284fc
bf6049e
 
 
ca284fc

---
language: en
license: mit
tags:
  - vision-language-model
  - bitnet
  - mixture-of-experts
  - vlm
  - multimodal
  - edge-ai
pipeline_tag: image-text-to-text
---

# EmberNet — BitNet b1.58 MoE VLM

> **Status:** Stage 2/2, Epoch 3/5, Loss 4.1517

EmberNet is a tiny but capable Vision-Language Model built for edge deployment
and domain-expert reasoning.  It combines a frozen **SigLIP** vision backbone
with a **BitNet b1.58 ternary-quantized Mixture-of-Experts** language decoder,
achieving ~3× memory reduction over a full-precision equivalent while
preserving strong visual understanding across 8 specialised domains.

---

## Model Details

| Property | Value |
|---|---|
| **Model type** | Vision-Language Model (VLM) |
| **Quantisation** | BitNet b1.58 (ternary weights: −1, 0, +1) |
| **Total parameters** | 840.8 M |
| **Trainable parameters** | 723.3 M |
| **Active parameters / forward** | ~235.4 M (top-2 routing) |
| **Carbon footprint** | 0.6390 kg CO₂eq |
| **Training stage** | Stage 2/2 — Expert SFT |
| **Epoch** | 3/5 |
| **Best loss** | 4.1517 |
| **Last updated** | 2026-03-08 06:05 UTC |

---

## Architecture

```
EmberNet VLM
├── Vision Encoder  (frozen)
│   ├── SigLIP-base-patch16-224       92.9 M params
│   ├── Token Compressor              2.4 M params
│   ├── Spatial Pooler                2.4 M params
│   └── BitLinear Projector           10.1 M params
│
└── BitNet b1.58 MoE Decoder          733.1 M params total
    ├── Layers: 16   Hidden: 768   Heads: 12 (GQA kv=6)
    ├── Experts: 8 domain + 1 shared (always active)
    ├── Routing: Top-2 per token
    └── Quantisation: ternary weights, 4-bit activations
```

| Decoder Component | Parameters |
|---|---|
| Embeddings | 24.6 M |
| Attention (all layers) | 0 |
| Router (all layers) | 98.4 K |
| Shared Expert | 75.6 M |
| Domain Experts (8×) | 604.4 M (75.6 M/expert) |

### Expert Domains

| ID | Expert | Trained on |
|----|--------|-----------|
| 0 | `vision_ocr` | TextVQA, DocVQA, OCR-VQA, InfoVQA |
| 1 | `vision_diagram` | AI2D, InfoVQA diagrams |
| 2 | `code_math_chart` | ChartQA, PlotQA, FigureQA, DVQA |
| 3 | `code_math_formula` | MathVista, math formula datasets |
| 4 | `spatial_scene` | VQAv2, GQA, Visual Genome |
| 5 | `spatial_reasoning` | RefCOCO, GQA spatial splits |
| 6 | `agentic_knowledge` | OK-VQA, A-OKVQA |
| 7 | `agentic_reasoning` | ScienceQA, CLEVR |
| — | `shared` | All domains (always active) |

---

## Training

### Configuration

```yaml
stage_1_projector_alignment:
  epochs: 3
  batch_size: 8  (effective: 32 with grad-accum 4)
  learning_rate: 1e-4
  trainable: vision projector + compressor + pooler only

stage_2_expert_sft:
  epochs: 10
  batch_size: 4  (effective: 16 with grad-accum 4)
  learning_rate: 3e-4
  trainable: router + all 8 expert FFNs + shared expert
  expert_supervision_weight: 0.1
```

### Optimiser

- **BitNetStableOptimizer** — custom Adam with FP32 master weights  
- Two-phase LR: full LR for 60 % of training, then 0.1 × LR  
- Warmup: 100 steps  
- Weight clamp: [−3, 3] (maps cleanly to −1 / 0 / +1 at inference)

---

## Usage

```python
import torch
from PIL import Image
from transformers import AutoTokenizer

# Clone the repo and add it to your Python path, then:
from models import EmberNetVLM
from models.vlm import EmberNetConfig

# Load
config = EmberNetConfig()
model = EmberNetVLM(config)
ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# Inference
image = Image.open("scene.jpg").convert("RGB")
prompt = "<image>\nDescribe what you see."

response = model.generate(
    image=image,
    prompt=prompt,
    tokenizer=tokenizer,
    max_new_tokens=256,
)
print(response)
```

---

## Intended Uses

- **Edge & embedded deployment** — ternary weights run efficiently on CPUs and NPUs  
- **Domain-aware visual reasoning** — dedicated experts for OCR, charts, math, spatial, and agentic tasks  
- **Robotic / agentic pipelines** — `agentic_knowledge` + `agentic_reasoning` experts support multi-step planning  
- **Fine-tuning base** — swap in domain datasets to specialise any of the 8 experts independently  

## Limitations

- Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size  
- Image resolution fixed at 224 × 224; very fine-grained OCR may degrade  
- Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned  
- Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited  

---

## Citation

```bibtex
@software{embernet_vlm,
  title  = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model},
  author = {Aman Euh},
  year   = {2026},
  url    = {https://huggingface.co/euhidaman/EmberNet}
}
```