Update EmberNet Stage 2 Epoch 1/5 | loss 4.9617 | step 625

Browse files

Files changed (5) hide show

README.md +173 -0
config.json +51 -0
pytorch_model.bin +3 -0
tokenizer.json +0 -0
tokenizer_config.json +11 -0

README.md ADDED Viewed

	@@ -0,0 +1,173 @@

+---
+language: en
+license: mit
+tags:
+  - vision-language-model
+  - bitnet
+  - mixture-of-experts
+  - vlm
+  - multimodal
+  - edge-ai
+pipeline_tag: image-text-to-text
+---
+# EmberNet — BitNet b1.58 MoE VLM
+> **Status:** Stage 2/2, Epoch 1/5, Loss 4.9617
+EmberNet is a tiny but capable Vision-Language Model built for edge deployment
+and domain-expert reasoning.  It combines a frozen **SigLIP** vision backbone
+with a **BitNet b1.58 ternary-quantized Mixture-of-Experts** language decoder,
+achieving ~3× memory reduction over a full-precision equivalent while
+preserving strong visual understanding across 8 specialised domains.
+---
+## Model Details
+| Property | Value |
+|---|---|
+| **Model type** | Vision-Language Model (VLM) |
+| **Quantisation** | BitNet b1.58 (ternary weights: −1, 0, +1) |
+| **Total parameters** | 840.8 M |
+| **Trainable parameters** | 723.3 M |
+| **Active parameters / forward** | ~235.4 M (top-2 routing) |
+| **Carbon footprint** | 0.2091 kg CO₂eq |
+| **Training stage** | Stage 2/2 — Expert SFT |
+| **Epoch** | 1/5 |
+| **Best loss** | 4.9617 |
+| **Last updated** | 2026-03-07 22:40 UTC |
+---
+## Architecture
+```
+EmberNet VLM
+├── Vision Encoder  (frozen)
+│   ├── SigLIP-base-patch16-224       92.9 M params
+│   ├── Token Compressor              2.4 M params
+│   ├── Spatial Pooler                2.4 M params
+│   └── BitLinear Projector           10.1 M params
+│
+└── BitNet b1.58 MoE Decoder          733.1 M params total
+    ├── Layers: 16   Hidden: 768   Heads: 12 (GQA kv=6)
+    ├── Experts: 8 domain + 1 shared (always active)
+    ├── Routing: Top-2 per token
+    └── Quantisation: ternary weights, 4-bit activations
+```
+| Decoder Component | Parameters |
+|---|---|
+| Embeddings | 24.6 M |
+| Attention (all layers) | 0 |
+| Router (all layers) | 98.4 K |
+| Shared Expert | 75.6 M |
+| Domain Experts (8×) | 604.4 M (75.6 M/expert) |
+### Expert Domains
+| ID | Expert | Trained on |
+|----|--------|-----------|
+| 0 | `vision_ocr` | TextVQA, DocVQA, OCR-VQA, InfoVQA |
+| 1 | `vision_diagram` | AI2D, InfoVQA diagrams |
+| 2 | `code_math_chart` | ChartQA, PlotQA, FigureQA, DVQA |
+| 3 | `code_math_formula` | MathVista, math formula datasets |
+| 4 | `spatial_scene` | VQAv2, GQA, Visual Genome |
+| 5 | `spatial_reasoning` | RefCOCO, GQA spatial splits |
+| 6 | `agentic_knowledge` | OK-VQA, A-OKVQA |
+| 7 | `agentic_reasoning` | ScienceQA, CLEVR |
+| — | `shared` | All domains (always active) |
+---
+## Training
+### Configuration
+```yaml
+stage_1_projector_alignment:
+  epochs: 3
+  batch_size: 8  (effective: 32 with grad-accum 4)
+  learning_rate: 1e-4
+  trainable: vision projector + compressor + pooler only
+stage_2_expert_sft:
+  epochs: 10
+  batch_size: 4  (effective: 16 with grad-accum 4)
+  learning_rate: 3e-4
+  trainable: router + all 8 expert FFNs + shared expert
+  expert_supervision_weight: 0.1
+```
+### Optimiser
+- **BitNetStableOptimizer** — custom Adam with FP32 master weights
+- Two-phase LR: full LR for 60 % of training, then 0.1 × LR
+- Warmup: 100 steps
+- Weight clamp: [−3, 3] (maps cleanly to −1 / 0 / +1 at inference)
+---
+## Usage
+```python
+import torch
+from PIL import Image
+from transformers import AutoTokenizer
+# Clone the repo and add it to your Python path, then:
+from models import EmberNetVLM
+from models.vlm import EmberNetConfig
+# Load
+config = EmberNetConfig()
+model = EmberNetVLM(config)
+ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False)
+model.load_state_dict(ckpt["model_state_dict"])
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
+tokenizer.pad_token = tokenizer.eos_token
+# Inference
+image = Image.open("scene.jpg").convert("RGB")
+prompt = "<image>\nDescribe what you see."
+response = model.generate(
+    image=image,
+    prompt=prompt,
+    tokenizer=tokenizer,
+    max_new_tokens=256,
+)
+print(response)
+```
+---
+## Intended Uses
+- **Edge & embedded deployment** — ternary weights run efficiently on CPUs and NPUs
+- **Domain-aware visual reasoning** — dedicated experts for OCR, charts, math, spatial, and agentic tasks
+- **Robotic / agentic pipelines** — `agentic_knowledge` + `agentic_reasoning` experts support multi-step planning
+- **Fine-tuning base** — swap in domain datasets to specialise any of the 8 experts independently
+## Limitations
+- Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size
+- Image resolution fixed at 224 × 224; very fine-grained OCR may degrade
+- Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned
+- Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited
+---
+## Citation
+```bibtex
+@software{embernet_vlm,
+  title  = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model},
+  author = {Aman Euh},
+  year   = {2026},
+  url    = {https://huggingface.co/euhidaman/EmberNet}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "model_type": "embernet_vlm",
+  "architecture": "BitNet b1.58 MoE VLM",
+  "vision_encoder": {
+    "model_name": "google/siglip-base-patch16-224",
+    "num_image_tokens": 64,
+    "freeze_vision": true
+  },
+  "language_decoder": {
+    "vocab_size": 32002,
+    "hidden_size": 768,
+    "intermediate_size": 2048,
+    "num_layers": 16,
+    "num_attention_heads": 12,
+    "num_kv_heads": 6,
+    "max_position_embeddings": 4096,
+    "num_experts": 8,
+    "num_experts_per_tok": 2,
+    "use_shared_expert": true,
+    "expert_domains": [
+      "vision_ocr",
+      "vision_diagram",
+      "code_math_chart",
+      "code_math_formula",
+      "spatial_scene",
+      "spatial_reasoning",
+      "agentic_knowledge",
+      "agentic_reasoning"
+    ],
+    "quantisation": "BitNet b1.58 (ternary)",
+    "activation_bits": 4
+  },
+  "torch_dtype": "bfloat16",
+  "transformers_version": ">=4.36.0",
+  "parameter_counts": {
+    "vision_encoder": 107748864,
+    "vision_encoder_breakdown": {
+      "encoder": 92884224,
+      "compressor": 2363904,
+      "pooler": 2412288,
+      "projector": 10088448
+    },
+    "decoder_total": 733055360,
+    "decoder_embeddings": 24577536,
+    "decoder_attention": 0,
+    "decoder_router": 98432,
+    "decoder_shared_expert": 75554816,
+    "decoder_domain_experts": 604438528,
+    "num_domain_experts": 8
+  }
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aeb837296425d584e4ee59d0a9cbfc46d3ad4fd0d2a51c4e232b1ee53ba6377c
+size 3397346561

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "is_local": false,
+  "model_max_length": 2048,
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": "<|endoftext|>"
+}