euhidaman commited on
Commit
ca284fc
·
verified ·
1 Parent(s): de24599

Update EmberNet Stage 2 Epoch 1/5 | loss 4.9617 | step 625

Browse files
Files changed (5) hide show
  1. README.md +173 -0
  2. config.json +51 -0
  3. pytorch_model.bin +3 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +11 -0
README.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - vision-language-model
6
+ - bitnet
7
+ - mixture-of-experts
8
+ - vlm
9
+ - multimodal
10
+ - edge-ai
11
+ pipeline_tag: image-text-to-text
12
+ ---
13
+
14
+ # EmberNet — BitNet b1.58 MoE VLM
15
+
16
+ > **Status:** Stage 2/2, Epoch 1/5, Loss 4.9617
17
+
18
+ EmberNet is a tiny but capable Vision-Language Model built for edge deployment
19
+ and domain-expert reasoning. It combines a frozen **SigLIP** vision backbone
20
+ with a **BitNet b1.58 ternary-quantized Mixture-of-Experts** language decoder,
21
+ achieving ~3× memory reduction over a full-precision equivalent while
22
+ preserving strong visual understanding across 8 specialised domains.
23
+
24
+ ---
25
+
26
+ ## Model Details
27
+
28
+ | Property | Value |
29
+ |---|---|
30
+ | **Model type** | Vision-Language Model (VLM) |
31
+ | **Quantisation** | BitNet b1.58 (ternary weights: −1, 0, +1) |
32
+ | **Total parameters** | 840.8 M |
33
+ | **Trainable parameters** | 723.3 M |
34
+ | **Active parameters / forward** | ~235.4 M (top-2 routing) |
35
+ | **Carbon footprint** | 0.2091 kg CO₂eq |
36
+ | **Training stage** | Stage 2/2 — Expert SFT |
37
+ | **Epoch** | 1/5 |
38
+ | **Best loss** | 4.9617 |
39
+ | **Last updated** | 2026-03-07 22:40 UTC |
40
+
41
+ ---
42
+
43
+ ## Architecture
44
+
45
+ ```
46
+ EmberNet VLM
47
+ ├── Vision Encoder (frozen)
48
+ │ ├── SigLIP-base-patch16-224 92.9 M params
49
+ │ ├── Token Compressor 2.4 M params
50
+ │ ├── Spatial Pooler 2.4 M params
51
+ │ └── BitLinear Projector 10.1 M params
52
+
53
+ └── BitNet b1.58 MoE Decoder 733.1 M params total
54
+ ├── Layers: 16 Hidden: 768 Heads: 12 (GQA kv=6)
55
+ ├── Experts: 8 domain + 1 shared (always active)
56
+ ├── Routing: Top-2 per token
57
+ └── Quantisation: ternary weights, 4-bit activations
58
+ ```
59
+
60
+ | Decoder Component | Parameters |
61
+ |---|---|
62
+ | Embeddings | 24.6 M |
63
+ | Attention (all layers) | 0 |
64
+ | Router (all layers) | 98.4 K |
65
+ | Shared Expert | 75.6 M |
66
+ | Domain Experts (8×) | 604.4 M (75.6 M/expert) |
67
+
68
+ ### Expert Domains
69
+
70
+ | ID | Expert | Trained on |
71
+ |----|--------|-----------|
72
+ | 0 | `vision_ocr` | TextVQA, DocVQA, OCR-VQA, InfoVQA |
73
+ | 1 | `vision_diagram` | AI2D, InfoVQA diagrams |
74
+ | 2 | `code_math_chart` | ChartQA, PlotQA, FigureQA, DVQA |
75
+ | 3 | `code_math_formula` | MathVista, math formula datasets |
76
+ | 4 | `spatial_scene` | VQAv2, GQA, Visual Genome |
77
+ | 5 | `spatial_reasoning` | RefCOCO, GQA spatial splits |
78
+ | 6 | `agentic_knowledge` | OK-VQA, A-OKVQA |
79
+ | 7 | `agentic_reasoning` | ScienceQA, CLEVR |
80
+ | — | `shared` | All domains (always active) |
81
+
82
+ ---
83
+
84
+ ## Training
85
+
86
+ ### Configuration
87
+
88
+ ```yaml
89
+ stage_1_projector_alignment:
90
+ epochs: 3
91
+ batch_size: 8 (effective: 32 with grad-accum 4)
92
+ learning_rate: 1e-4
93
+ trainable: vision projector + compressor + pooler only
94
+
95
+ stage_2_expert_sft:
96
+ epochs: 10
97
+ batch_size: 4 (effective: 16 with grad-accum 4)
98
+ learning_rate: 3e-4
99
+ trainable: router + all 8 expert FFNs + shared expert
100
+ expert_supervision_weight: 0.1
101
+ ```
102
+
103
+ ### Optimiser
104
+
105
+ - **BitNetStableOptimizer** — custom Adam with FP32 master weights
106
+ - Two-phase LR: full LR for 60 % of training, then 0.1 × LR
107
+ - Warmup: 100 steps
108
+ - Weight clamp: [−3, 3] (maps cleanly to −1 / 0 / +1 at inference)
109
+
110
+ ---
111
+
112
+ ## Usage
113
+
114
+ ```python
115
+ import torch
116
+ from PIL import Image
117
+ from transformers import AutoTokenizer
118
+
119
+ # Clone the repo and add it to your Python path, then:
120
+ from models import EmberNetVLM
121
+ from models.vlm import EmberNetConfig
122
+
123
+ # Load
124
+ config = EmberNetConfig()
125
+ model = EmberNetVLM(config)
126
+ ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False)
127
+ model.load_state_dict(ckpt["model_state_dict"])
128
+ model.eval()
129
+
130
+ tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
131
+ tokenizer.pad_token = tokenizer.eos_token
132
+
133
+ # Inference
134
+ image = Image.open("scene.jpg").convert("RGB")
135
+ prompt = "<image>\nDescribe what you see."
136
+
137
+ response = model.generate(
138
+ image=image,
139
+ prompt=prompt,
140
+ tokenizer=tokenizer,
141
+ max_new_tokens=256,
142
+ )
143
+ print(response)
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Intended Uses
149
+
150
+ - **Edge & embedded deployment** — ternary weights run efficiently on CPUs and NPUs
151
+ - **Domain-aware visual reasoning** — dedicated experts for OCR, charts, math, spatial, and agentic tasks
152
+ - **Robotic / agentic pipelines** — `agentic_knowledge` + `agentic_reasoning` experts support multi-step planning
153
+ - **Fine-tuning base** — swap in domain datasets to specialise any of the 8 experts independently
154
+
155
+ ## Limitations
156
+
157
+ - Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size
158
+ - Image resolution fixed at 224 × 224; very fine-grained OCR may degrade
159
+ - Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned
160
+ - Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited
161
+
162
+ ---
163
+
164
+ ## Citation
165
+
166
+ ```bibtex
167
+ @software{embernet_vlm,
168
+ title = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model},
169
+ author = {Aman Euh},
170
+ year = {2026},
171
+ url = {https://huggingface.co/euhidaman/EmberNet}
172
+ }
173
+ ```
config.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "embernet_vlm",
3
+ "architecture": "BitNet b1.58 MoE VLM",
4
+ "vision_encoder": {
5
+ "model_name": "google/siglip-base-patch16-224",
6
+ "num_image_tokens": 64,
7
+ "freeze_vision": true
8
+ },
9
+ "language_decoder": {
10
+ "vocab_size": 32002,
11
+ "hidden_size": 768,
12
+ "intermediate_size": 2048,
13
+ "num_layers": 16,
14
+ "num_attention_heads": 12,
15
+ "num_kv_heads": 6,
16
+ "max_position_embeddings": 4096,
17
+ "num_experts": 8,
18
+ "num_experts_per_tok": 2,
19
+ "use_shared_expert": true,
20
+ "expert_domains": [
21
+ "vision_ocr",
22
+ "vision_diagram",
23
+ "code_math_chart",
24
+ "code_math_formula",
25
+ "spatial_scene",
26
+ "spatial_reasoning",
27
+ "agentic_knowledge",
28
+ "agentic_reasoning"
29
+ ],
30
+ "quantisation": "BitNet b1.58 (ternary)",
31
+ "activation_bits": 4
32
+ },
33
+ "torch_dtype": "bfloat16",
34
+ "transformers_version": ">=4.36.0",
35
+ "parameter_counts": {
36
+ "vision_encoder": 107748864,
37
+ "vision_encoder_breakdown": {
38
+ "encoder": 92884224,
39
+ "compressor": 2363904,
40
+ "pooler": 2412288,
41
+ "projector": 10088448
42
+ },
43
+ "decoder_total": 733055360,
44
+ "decoder_embeddings": 24577536,
45
+ "decoder_attention": 0,
46
+ "decoder_router": 98432,
47
+ "decoder_shared_expert": 75554816,
48
+ "decoder_domain_experts": 604438528,
49
+ "num_domain_experts": 8
50
+ }
51
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeb837296425d584e4ee59d0a9cbfc46d3ad4fd0d2a51c4e232b1ee53ba6377c
3
+ size 3397346561
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": "<|endoftext|>",
5
+ "clean_up_tokenization_spaces": true,
6
+ "eos_token": "<|endoftext|>",
7
+ "is_local": false,
8
+ "model_max_length": 2048,
9
+ "tokenizer_class": "TokenizersBackend",
10
+ "unk_token": "<|endoftext|>"
11
+ }