EmberNet / README.md
euhidaman's picture
Update EmberNet Stage 2 Epoch 3/5 | loss 4.1517 | step 1875
bf6049e verified
---
language: en
license: mit
tags:
- vision-language-model
- bitnet
- mixture-of-experts
- vlm
- multimodal
- edge-ai
pipeline_tag: image-text-to-text
---
# EmberNet β€” BitNet b1.58 MoE VLM
> **Status:** Stage 2/2, Epoch 3/5, Loss 4.1517
EmberNet is a tiny but capable Vision-Language Model built for edge deployment
and domain-expert reasoning. It combines a frozen **SigLIP** vision backbone
with a **BitNet b1.58 ternary-quantized Mixture-of-Experts** language decoder,
achieving ~3Γ— memory reduction over a full-precision equivalent while
preserving strong visual understanding across 8 specialised domains.
---
## Model Details
| Property | Value |
|---|---|
| **Model type** | Vision-Language Model (VLM) |
| **Quantisation** | BitNet b1.58 (ternary weights: βˆ’1, 0, +1) |
| **Total parameters** | 840.8 M |
| **Trainable parameters** | 723.3 M |
| **Active parameters / forward** | ~235.4 M (top-2 routing) |
| **Carbon footprint** | 0.6390 kg COβ‚‚eq |
| **Training stage** | Stage 2/2 β€” Expert SFT |
| **Epoch** | 3/5 |
| **Best loss** | 4.1517 |
| **Last updated** | 2026-03-08 06:05 UTC |
---
## Architecture
```
EmberNet VLM
β”œβ”€β”€ Vision Encoder (frozen)
β”‚ β”œβ”€β”€ SigLIP-base-patch16-224 92.9 M params
β”‚ β”œβ”€β”€ Token Compressor 2.4 M params
β”‚ β”œβ”€β”€ Spatial Pooler 2.4 M params
β”‚ └── BitLinear Projector 10.1 M params
β”‚
└── BitNet b1.58 MoE Decoder 733.1 M params total
β”œβ”€β”€ Layers: 16 Hidden: 768 Heads: 12 (GQA kv=6)
β”œβ”€β”€ Experts: 8 domain + 1 shared (always active)
β”œβ”€β”€ Routing: Top-2 per token
└── Quantisation: ternary weights, 4-bit activations
```
| Decoder Component | Parameters |
|---|---|
| Embeddings | 24.6 M |
| Attention (all layers) | 0 |
| Router (all layers) | 98.4 K |
| Shared Expert | 75.6 M |
| Domain Experts (8Γ—) | 604.4 M (75.6 M/expert) |
### Expert Domains
| ID | Expert | Trained on |
|----|--------|-----------|
| 0 | `vision_ocr` | TextVQA, DocVQA, OCR-VQA, InfoVQA |
| 1 | `vision_diagram` | AI2D, InfoVQA diagrams |
| 2 | `code_math_chart` | ChartQA, PlotQA, FigureQA, DVQA |
| 3 | `code_math_formula` | MathVista, math formula datasets |
| 4 | `spatial_scene` | VQAv2, GQA, Visual Genome |
| 5 | `spatial_reasoning` | RefCOCO, GQA spatial splits |
| 6 | `agentic_knowledge` | OK-VQA, A-OKVQA |
| 7 | `agentic_reasoning` | ScienceQA, CLEVR |
| β€” | `shared` | All domains (always active) |
---
## Training
### Configuration
```yaml
stage_1_projector_alignment:
epochs: 3
batch_size: 8 (effective: 32 with grad-accum 4)
learning_rate: 1e-4
trainable: vision projector + compressor + pooler only
stage_2_expert_sft:
epochs: 10
batch_size: 4 (effective: 16 with grad-accum 4)
learning_rate: 3e-4
trainable: router + all 8 expert FFNs + shared expert
expert_supervision_weight: 0.1
```
### Optimiser
- **BitNetStableOptimizer** β€” custom Adam with FP32 master weights
- Two-phase LR: full LR for 60 % of training, then 0.1 Γ— LR
- Warmup: 100 steps
- Weight clamp: [βˆ’3, 3] (maps cleanly to βˆ’1 / 0 / +1 at inference)
---
## Usage
```python
import torch
from PIL import Image
from transformers import AutoTokenizer
# Clone the repo and add it to your Python path, then:
from models import EmberNetVLM
from models.vlm import EmberNetConfig
# Load
config = EmberNetConfig()
model = EmberNetVLM(config)
ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# Inference
image = Image.open("scene.jpg").convert("RGB")
prompt = "<image>\nDescribe what you see."
response = model.generate(
image=image,
prompt=prompt,
tokenizer=tokenizer,
max_new_tokens=256,
)
print(response)
```
---
## Intended Uses
- **Edge & embedded deployment** β€” ternary weights run efficiently on CPUs and NPUs
- **Domain-aware visual reasoning** β€” dedicated experts for OCR, charts, math, spatial, and agentic tasks
- **Robotic / agentic pipelines** β€” `agentic_knowledge` + `agentic_reasoning` experts support multi-step planning
- **Fine-tuning base** β€” swap in domain datasets to specialise any of the 8 experts independently
## Limitations
- Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size
- Image resolution fixed at 224 Γ— 224; very fine-grained OCR may degrade
- Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned
- Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited
---
## Citation
```bibtex
@software{embernet_vlm,
title = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model},
author = {Aman Euh},
year = {2026},
url = {https://huggingface.co/euhidaman/EmberNet}
}
```