--- language: en license: mit tags: - vision-language-model - bitnet - mixture-of-experts - vlm - multimodal - edge-ai pipeline_tag: image-text-to-text --- # EmberNet — BitNet b1.58 MoE VLM > **Status:** Stage 2/2, Epoch 3/5, Loss 4.1517 EmberNet is a tiny but capable Vision-Language Model built for edge deployment and domain-expert reasoning. It combines a frozen **SigLIP** vision backbone with a **BitNet b1.58 ternary-quantized Mixture-of-Experts** language decoder, achieving ~3× memory reduction over a full-precision equivalent while preserving strong visual understanding across 8 specialised domains. --- ## Model Details | Property | Value | |---|---| | **Model type** | Vision-Language Model (VLM) | | **Quantisation** | BitNet b1.58 (ternary weights: −1, 0, +1) | | **Total parameters** | 840.8 M | | **Trainable parameters** | 723.3 M | | **Active parameters / forward** | ~235.4 M (top-2 routing) | | **Carbon footprint** | 0.6390 kg CO₂eq | | **Training stage** | Stage 2/2 — Expert SFT | | **Epoch** | 3/5 | | **Best loss** | 4.1517 | | **Last updated** | 2026-03-08 06:05 UTC | --- ## Architecture ``` EmberNet VLM ├── Vision Encoder (frozen) │ ├── SigLIP-base-patch16-224 92.9 M params │ ├── Token Compressor 2.4 M params │ ├── Spatial Pooler 2.4 M params │ └── BitLinear Projector 10.1 M params │ └── BitNet b1.58 MoE Decoder 733.1 M params total ├── Layers: 16 Hidden: 768 Heads: 12 (GQA kv=6) ├── Experts: 8 domain + 1 shared (always active) ├── Routing: Top-2 per token └── Quantisation: ternary weights, 4-bit activations ``` | Decoder Component | Parameters | |---|---| | Embeddings | 24.6 M | | Attention (all layers) | 0 | | Router (all layers) | 98.4 K | | Shared Expert | 75.6 M | | Domain Experts (8×) | 604.4 M (75.6 M/expert) | ### Expert Domains | ID | Expert | Trained on | |----|--------|-----------| | 0 | `vision_ocr` | TextVQA, DocVQA, OCR-VQA, InfoVQA | | 1 | `vision_diagram` | AI2D, InfoVQA diagrams | | 2 | `code_math_chart` | ChartQA, PlotQA, FigureQA, DVQA | | 3 | `code_math_formula` | MathVista, math formula datasets | | 4 | `spatial_scene` | VQAv2, GQA, Visual Genome | | 5 | `spatial_reasoning` | RefCOCO, GQA spatial splits | | 6 | `agentic_knowledge` | OK-VQA, A-OKVQA | | 7 | `agentic_reasoning` | ScienceQA, CLEVR | | — | `shared` | All domains (always active) | --- ## Training ### Configuration ```yaml stage_1_projector_alignment: epochs: 3 batch_size: 8 (effective: 32 with grad-accum 4) learning_rate: 1e-4 trainable: vision projector + compressor + pooler only stage_2_expert_sft: epochs: 10 batch_size: 4 (effective: 16 with grad-accum 4) learning_rate: 3e-4 trainable: router + all 8 expert FFNs + shared expert expert_supervision_weight: 0.1 ``` ### Optimiser - **BitNetStableOptimizer** — custom Adam with FP32 master weights - Two-phase LR: full LR for 60 % of training, then 0.1 × LR - Warmup: 100 steps - Weight clamp: [−3, 3] (maps cleanly to −1 / 0 / +1 at inference) --- ## Usage ```python import torch from PIL import Image from transformers import AutoTokenizer # Clone the repo and add it to your Python path, then: from models import EmberNetVLM from models.vlm import EmberNetConfig # Load config = EmberNetConfig() model = EmberNetVLM(config) ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False) model.load_state_dict(ckpt["model_state_dict"]) model.eval() tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token # Inference image = Image.open("scene.jpg").convert("RGB") prompt = "\nDescribe what you see." response = model.generate( image=image, prompt=prompt, tokenizer=tokenizer, max_new_tokens=256, ) print(response) ``` --- ## Intended Uses - **Edge & embedded deployment** — ternary weights run efficiently on CPUs and NPUs - **Domain-aware visual reasoning** — dedicated experts for OCR, charts, math, spatial, and agentic tasks - **Robotic / agentic pipelines** — `agentic_knowledge` + `agentic_reasoning` experts support multi-step planning - **Fine-tuning base** — swap in domain datasets to specialise any of the 8 experts independently ## Limitations - Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size - Image resolution fixed at 224 × 224; very fine-grained OCR may degrade - Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned - Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited --- ## Citation ```bibtex @software{embernet_vlm, title = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model}, author = {Aman Euh}, year = {2026}, url = {https://huggingface.co/euhidaman/EmberNet} } ```