EmberNet / README.md

Update EmberNet Stage 2 Epoch 3/5 | loss 4.1517 | step 1875

bf6049e verified 12 days ago

5.01 kB

	---
	language: en
	license: mit
	tags:
	- vision-language-model
	- bitnet
	- mixture-of-experts
	- vlm
	- multimodal
	- edge-ai
	pipeline_tag: image-text-to-text
	---

	# EmberNet — BitNet b1.58 MoE VLM

	> Status: Stage 2/2, Epoch 3/5, Loss 4.1517

	EmberNet is a tiny but capable Vision-Language Model built for edge deployment
	and domain-expert reasoning. It combines a frozen SigLIP vision backbone
	with a BitNet b1.58 ternary-quantized Mixture-of-Experts language decoder,
	achieving ~3× memory reduction over a full-precision equivalent while
	preserving strong visual understanding across 8 specialised domains.

	---

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Model type \| Vision-Language Model (VLM) \|
	\| Quantisation \| BitNet b1.58 (ternary weights: −1, 0, +1) \|
	\| Total parameters \| 840.8 M \|
	\| Trainable parameters \| 723.3 M \|
	\| Active parameters / forward \| ~235.4 M (top-2 routing) \|
	\| Carbon footprint \| 0.6390 kg CO₂eq \|
	\| Training stage \| Stage 2/2 — Expert SFT \|
	\| Epoch \| 3/5 \|
	\| Best loss \| 4.1517 \|
	\| Last updated \| 2026-03-08 06:05 UTC \|

	---

	## Architecture

	```
	EmberNet VLM
	├── Vision Encoder (frozen)
	│ ├── SigLIP-base-patch16-224 92.9 M params
	│ ├── Token Compressor 2.4 M params
	│ ├── Spatial Pooler 2.4 M params
	│ └── BitLinear Projector 10.1 M params
	│
	└── BitNet b1.58 MoE Decoder 733.1 M params total
	├── Layers: 16 Hidden: 768 Heads: 12 (GQA kv=6)
	├── Experts: 8 domain + 1 shared (always active)
	├── Routing: Top-2 per token
	└── Quantisation: ternary weights, 4-bit activations
	```

	\| Decoder Component \| Parameters \|
	\|---\|---\|
	\| Embeddings \| 24.6 M \|
	\| Attention (all layers) \| 0 \|
	\| Router (all layers) \| 98.4 K \|
	\| Shared Expert \| 75.6 M \|
	\| Domain Experts (8×) \| 604.4 M (75.6 M/expert) \|

	### Expert Domains

	\| ID \| Expert \| Trained on \|
	\|----\|--------\|-----------\|
	\| 0 \| `vision_ocr` \| TextVQA, DocVQA, OCR-VQA, InfoVQA \|
	\| 1 \| `vision_diagram` \| AI2D, InfoVQA diagrams \|
	\| 2 \| `code_math_chart` \| ChartQA, PlotQA, FigureQA, DVQA \|
	\| 3 \| `code_math_formula` \| MathVista, math formula datasets \|
	\| 4 \| `spatial_scene` \| VQAv2, GQA, Visual Genome \|
	\| 5 \| `spatial_reasoning` \| RefCOCO, GQA spatial splits \|
	\| 6 \| `agentic_knowledge` \| OK-VQA, A-OKVQA \|
	\| 7 \| `agentic_reasoning` \| ScienceQA, CLEVR \|
	\| — \| `shared` \| All domains (always active) \|

	---

	## Training

	### Configuration

	```yaml
	stage_1_projector_alignment:
	epochs: 3
	batch_size: 8 (effective: 32 with grad-accum 4)
	learning_rate: 1e-4
	trainable: vision projector + compressor + pooler only

	stage_2_expert_sft:
	epochs: 10
	batch_size: 4 (effective: 16 with grad-accum 4)
	learning_rate: 3e-4
	trainable: router + all 8 expert FFNs + shared expert
	expert_supervision_weight: 0.1
	```

	### Optimiser

	- BitNetStableOptimizer — custom Adam with FP32 master weights
	- Two-phase LR: full LR for 60 % of training, then 0.1 × LR
	- Warmup: 100 steps
	- Weight clamp: [−3, 3] (maps cleanly to −1 / 0 / +1 at inference)

	---

	## Usage

	```python
	import torch
	from PIL import Image
	from transformers import AutoTokenizer

	# Clone the repo and add it to your Python path, then:
	from models import EmberNetVLM
	from models.vlm import EmberNetConfig

	# Load
	config = EmberNetConfig()
	model = EmberNetVLM(config)
	ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False)
	model.load_state_dict(ckpt["model_state_dict"])
	model.eval()

	tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
	tokenizer.pad_token = tokenizer.eos_token

	# Inference
	image = Image.open("scene.jpg").convert("RGB")
	prompt = "<image>\nDescribe what you see."

	response = model.generate(
	image=image,
	prompt=prompt,
	tokenizer=tokenizer,
	max_new_tokens=256,
	)
	print(response)
	```

	---

	## Intended Uses

	- Edge & embedded deployment — ternary weights run efficiently on CPUs and NPUs
	- Domain-aware visual reasoning — dedicated experts for OCR, charts, math, spatial, and agentic tasks
	- Robotic / agentic pipelines — `agentic_knowledge` + `agentic_reasoning` experts support multi-step planning
	- Fine-tuning base — swap in domain datasets to specialise any of the 8 experts independently

	## Limitations

	- Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size
	- Image resolution fixed at 224 × 224; very fine-grained OCR may degrade
	- Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned
	- Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited

	---

	## Citation

	```bibtex
	@software{embernet_vlm,
	title = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model},
	author = {Aman Euh},
	year = {2026},
	url = {https://huggingface.co/euhidaman/EmberNet}
	}
	```