| --- |
| language: en |
| license: mit |
| tags: |
| - vision-language-model |
| - bitnet |
| - mixture-of-experts |
| - vlm |
| - multimodal |
| - edge-ai |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # EmberNet β BitNet b1.58 MoE VLM |
|
|
| > **Status:** Stage 2/2, Epoch 3/5, Loss 4.1517 |
|
|
| EmberNet is a tiny but capable Vision-Language Model built for edge deployment |
| and domain-expert reasoning. It combines a frozen **SigLIP** vision backbone |
| with a **BitNet b1.58 ternary-quantized Mixture-of-Experts** language decoder, |
| achieving ~3Γ memory reduction over a full-precision equivalent while |
| preserving strong visual understanding across 8 specialised domains. |
|
|
| --- |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | **Model type** | Vision-Language Model (VLM) | |
| | **Quantisation** | BitNet b1.58 (ternary weights: β1, 0, +1) | |
| | **Total parameters** | 840.8 M | |
| | **Trainable parameters** | 723.3 M | |
| | **Active parameters / forward** | ~235.4 M (top-2 routing) | |
| | **Carbon footprint** | 0.6390 kg COβeq | |
| | **Training stage** | Stage 2/2 β Expert SFT | |
| | **Epoch** | 3/5 | |
| | **Best loss** | 4.1517 | |
| | **Last updated** | 2026-03-08 06:05 UTC | |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ``` |
| EmberNet VLM |
| βββ Vision Encoder (frozen) |
| β βββ SigLIP-base-patch16-224 92.9 M params |
| β βββ Token Compressor 2.4 M params |
| β βββ Spatial Pooler 2.4 M params |
| β βββ BitLinear Projector 10.1 M params |
| β |
| βββ BitNet b1.58 MoE Decoder 733.1 M params total |
| βββ Layers: 16 Hidden: 768 Heads: 12 (GQA kv=6) |
| βββ Experts: 8 domain + 1 shared (always active) |
| βββ Routing: Top-2 per token |
| βββ Quantisation: ternary weights, 4-bit activations |
| ``` |
|
|
| | Decoder Component | Parameters | |
| |---|---| |
| | Embeddings | 24.6 M | |
| | Attention (all layers) | 0 | |
| | Router (all layers) | 98.4 K | |
| | Shared Expert | 75.6 M | |
| | Domain Experts (8Γ) | 604.4 M (75.6 M/expert) | |
|
|
| ### Expert Domains |
|
|
| | ID | Expert | Trained on | |
| |----|--------|-----------| |
| | 0 | `vision_ocr` | TextVQA, DocVQA, OCR-VQA, InfoVQA | |
| | 1 | `vision_diagram` | AI2D, InfoVQA diagrams | |
| | 2 | `code_math_chart` | ChartQA, PlotQA, FigureQA, DVQA | |
| | 3 | `code_math_formula` | MathVista, math formula datasets | |
| | 4 | `spatial_scene` | VQAv2, GQA, Visual Genome | |
| | 5 | `spatial_reasoning` | RefCOCO, GQA spatial splits | |
| | 6 | `agentic_knowledge` | OK-VQA, A-OKVQA | |
| | 7 | `agentic_reasoning` | ScienceQA, CLEVR | |
| | β | `shared` | All domains (always active) | |
|
|
| --- |
|
|
| ## Training |
|
|
| ### Configuration |
|
|
| ```yaml |
| stage_1_projector_alignment: |
| epochs: 3 |
| batch_size: 8 (effective: 32 with grad-accum 4) |
| learning_rate: 1e-4 |
| trainable: vision projector + compressor + pooler only |
| |
| stage_2_expert_sft: |
| epochs: 10 |
| batch_size: 4 (effective: 16 with grad-accum 4) |
| learning_rate: 3e-4 |
| trainable: router + all 8 expert FFNs + shared expert |
| expert_supervision_weight: 0.1 |
| ``` |
|
|
| ### Optimiser |
|
|
| - **BitNetStableOptimizer** β custom Adam with FP32 master weights |
| - Two-phase LR: full LR for 60 % of training, then 0.1 Γ LR |
| - Warmup: 100 steps |
| - Weight clamp: [β3, 3] (maps cleanly to β1 / 0 / +1 at inference) |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from PIL import Image |
| from transformers import AutoTokenizer |
| |
| # Clone the repo and add it to your Python path, then: |
| from models import EmberNetVLM |
| from models.vlm import EmberNetConfig |
| |
| # Load |
| config = EmberNetConfig() |
| model = EmberNetVLM(config) |
| ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False) |
| model.load_state_dict(ckpt["model_state_dict"]) |
| model.eval() |
| |
| tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True) |
| tokenizer.pad_token = tokenizer.eos_token |
| |
| # Inference |
| image = Image.open("scene.jpg").convert("RGB") |
| prompt = "<image>\nDescribe what you see." |
| |
| response = model.generate( |
| image=image, |
| prompt=prompt, |
| tokenizer=tokenizer, |
| max_new_tokens=256, |
| ) |
| print(response) |
| ``` |
|
|
| --- |
|
|
| ## Intended Uses |
|
|
| - **Edge & embedded deployment** β ternary weights run efficiently on CPUs and NPUs |
| - **Domain-aware visual reasoning** β dedicated experts for OCR, charts, math, spatial, and agentic tasks |
| - **Robotic / agentic pipelines** β `agentic_knowledge` + `agentic_reasoning` experts support multi-step planning |
| - **Fine-tuning base** β swap in domain datasets to specialise any of the 8 experts independently |
|
|
| ## Limitations |
|
|
| - Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size |
| - Image resolution fixed at 224 Γ 224; very fine-grained OCR may degrade |
| - Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned |
| - Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{embernet_vlm, |
| title = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model}, |
| author = {Aman Euh}, |
| year = {2026}, |
| url = {https://huggingface.co/euhidaman/EmberNet} |
| } |
| ``` |
|
|