rcgalbo commited on
Commit
051c2da
·
verified ·
1 Parent(s): 25ac0e1

Update model card with full architecture and training details

Browse files
Files changed (1) hide show
  1. README.md +278 -58
README.md CHANGED
@@ -1,82 +1,302 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
- - en
5
- - es
6
- - fr
7
- - de
8
- - zh
9
- - ja
10
- - ko
11
- - ar
12
- - hi
13
- - tr
14
- - sw
15
- - id
16
- - pt
17
- - ru
18
- tags:
19
- - multilingual
20
- - mamba
21
- - moe
22
- - distillation
23
- - aya
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ---
26
 
27
- # Aetheris — Hybrid Mamba-MoE Multilingual Model
28
 
29
- **Aetheris** is a ~536M parameter hybrid SSM-MoE language model distilled from
30
- [CohereLabs/tiny-aya-global](https://huggingface.co/CohereLabs/tiny-aya-global) (3.35B).
31
- It supports **67 languages** with 6.3x compression.
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ## Architecture
34
- - **Type**: Hybrid Mamba-MoE (interleaved SSM + Sparse MoE layers)
35
- - **Layers**: 24 (12 SSM + 12 MoE)
36
- - **Hidden dim**: 1024
37
- - **Experts**: 4 (top-1 routing)
38
- - **Vocab**: 80,000 tokens (pruned from 261K Aya tokenizer)
39
- - **Parameters**: 536M (pruned from 722M via vocabulary pruning)
40
-
41
- ## Compression
42
- | Stage | Technique | Before | After | Savings |
43
- |-------|-----------|--------|-------|---------|
44
- | 1 | Knowledge Distillation | 3,350M | 722M | 4.6x |
45
- | 2 | Vocabulary Pruning | 722M | 536M | 25.7% |
46
- | **Total** | | **3,350M** | **536M** | **6.3x** |
47
-
48
- ### Vocabulary Pruning Details
49
- - Original vocab: 255,000 tokens Pruned: 80,000 tokens
50
- - Dead tokens removed: 131,231 (never used by any of 67 target languages)
51
- - Per-language coverage preserved via frequency-based keep-list union
52
- - Mean fertility increase: <5% across all languages
53
- - Weight tying preserved (embedding = lm_head)
 
 
 
54
 
55
  ## Training
56
- - **Stage 1**: CKA-guided layer alignment (10K steps)
57
- - **Stage 2**: KL divergence distillation, T=2.0, alpha=0.7 (20K steps, best loss=2.73)
58
- - **Stage 3**: SFT fine-tuning (pending)
59
- - **Teacher**: CohereLabs/tiny-aya-global (3.35B)
60
- - **Data**: ClimbMix (NVIDIA)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
  ## Usage
63
 
64
  ```python
65
- import torch, yaml, sys
66
- sys.path.insert(0, ".")
 
 
 
 
 
 
 
67
  from aetheris.config import AetherisConfig
68
  from aetheris.model import HybridMambaMoE
69
 
70
- config = AetherisConfig.from_yaml("config.yaml")
71
  model = HybridMambaMoE(config)
72
- sd = torch.load("pytorch_model.pt", map_location="cpu")
 
 
 
 
 
73
  model.load_state_dict(sd)
74
  model.eval()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  ```
76
 
77
- **Note**: This model uses a pruned vocabulary. Use the `vocab_mapping.json` file
78
- to map between original Aya tokenizer IDs and pruned model IDs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
- ## Wayy Research
81
  *People for research, research for people.*
82
- Buffalo, NY — Est. 2024
 
1
  ---
 
2
  language:
3
+ - en
4
+ - fr
5
+ - es
6
+ - pt
7
+ - it
8
+ - ro
9
+ - de
10
+ - nl
11
+ - da
12
+ - sv
13
+ - "no"
14
+ - ru
15
+ - uk
16
+ - pl
17
+ - cs
18
+ - sk
19
+ - hr
20
+ - sr
21
+ - sl
22
+ - bg
23
+ - lv
24
+ - lt
25
+ - el
26
+ - et
27
+ - fi
28
+ - hu
29
+ - eu
30
+ - cy
31
+ - ga
32
+ - ar
33
+ - fa
34
+ - he
35
+ - tr
36
+ - hi
37
+ - ur
38
+ - bn
39
+ - mr
40
+ - gu
41
+ - pa
42
+ - ne
43
+ - ta
44
+ - te
45
+ - zh
46
+ - ja
47
+ - ko
48
+ - id
49
+ - ms
50
+ - tl
51
+ - jv
52
+ - vi
53
+ - km
54
+ - th
55
+ - lo
56
+ - my
57
+ - am
58
+ - ha
59
+ - ig
60
+ - sw
61
+ - yo
62
+ - so
63
+ - zu
64
+ - xh
65
+ - ca
66
+ - gl
67
+ - mt
68
+ license: apache-2.0
69
+ library_name: pytorch
70
  pipeline_tag: text-generation
71
+ tags:
72
+ - mamba
73
+ - ssm
74
+ - state-space-model
75
+ - mixture-of-experts
76
+ - moe
77
+ - multilingual
78
+ - distillation
79
+ - knowledge-distillation
80
+ - aya
81
+ - hybrid-architecture
82
+ - wayy-research
83
+ model-index:
84
+ - name: aetheris
85
+ results: []
86
  ---
87
 
88
+ # Aetheris
89
 
90
+ > A hybrid Mamba-MoE language model distilled from Aya for efficient multilingual generation across 67 languages.
91
+
92
+ **Aetheris** is a 536M-parameter hybrid architecture that interleaves State Space Model (Mamba) layers with Sparse Mixture-of-Experts (MoE) layers. It was distilled from [CohereLabs/tiny-aya-global](https://huggingface.co/CohereForAI/aya-expanse-8b) (3.35B params) using a 3-stage pipeline: CKA-guided alignment, KL divergence distillation across 67 languages, and supervised fine-tuning on multilingual chat data.
93
+
94
+ The goal: compress a massively multilingual teacher into a model small enough to run on consumer hardware, without abandoning low-resource languages.
95
+
96
+ | | |
97
+ |---|---|
98
+ | **Developer** | [Wayy Research](https://wayyresearch.com), Buffalo NY |
99
+ | **Parameters** | 536M (pruned) / 722M (full vocab) |
100
+ | **Teacher** | CohereLabs/tiny-aya-global (3.35B) |
101
+ | **Compression** | ~4.6x (base config) |
102
+ | **Languages** | 67 |
103
+ | **License** | Apache 2.0 |
104
+ | **Demo** | [aetheris-playground](https://huggingface.co/spaces/wayyresearch/aetheris-playground) |
105
 
106
  ## Architecture
107
+
108
+ Aetheris uses a hybrid design that alternates between two layer types across 24 total layers:
109
+
110
+ - **12 SSM (Mamba) layers** (even indices) -- linear-time sequence modeling with selective state spaces
111
+ - **12 Sparse MoE layers** (odd indices) -- capacity scaling through top-1 routing over 4 experts
112
+
113
+ This interleaving gives the model both efficient long-range dependency modeling (SSM) and parameter-efficient capacity (MoE).
114
+
115
+ ### Configuration
116
+
117
+ | Hyperparameter | Value |
118
+ |---|---|
119
+ | `d_model` | 1024 |
120
+ | `d_ff` | 3072 |
121
+ | `d_inner` (SSM) | 2048 |
122
+ | `n_layer` | 24 (12 SSM + 12 MoE) |
123
+ | `ssm_d_state` | 16 |
124
+ | `ssm_expand` | 2 |
125
+ | `num_experts` | 4 |
126
+ | `top_k` (routing) | 1 |
127
+ | `vocab_size` | 261,019 (shared Aya tokenizer) |
128
+ | `max_seq_len` | 2048 |
129
+ | Weight tying | Embedding + LM head shared |
130
 
131
  ## Training
132
+
133
+ ### 3-Stage Distillation Pipeline
134
+
135
+ **Stage 1 -- CKA Layer Alignment**
136
+ Aligns student hidden representations to teacher layers using Centered Kernel Alignment. This gives the student a structural initialization before distillation begins.
137
+
138
+ **Stage 2 -- KL Divergence Distillation**
139
+ Full knowledge distillation across 67 languages. 20K training steps. Best validation loss: **2.73**.
140
+
141
+ Key findings from this stage:
142
+ - SSM layers receive ~27x less gradient than MoE layers (gradient imbalance ratio = 0.037)
143
+ - A **10x learning rate boost** for SSM layers resolved this, reducing KL by 26% and increasing teacher-student agreement by 12x
144
+ - Optimal temperature: T=2.0 with alpha=0.7 and cosine schedule
145
+
146
+ **Stage 3 -- Supervised Fine-Tuning** *(in progress)*
147
+ Fine-tuning on multilingual chat data from CohereForAI/aya_collection and aya_evaluation_suite.
148
+
149
+ | Parameter | Value |
150
+ |---|---|
151
+ | Data | 16,907 examples, 10 languages (en, es, hi, zh, ar, sw, tr, ja, id, te) |
152
+ | Loss masking | Assistant tokens only |
153
+ | Learning rate | 2e-5 |
154
+ | Batch size | 4 (x4 gradient accumulation) |
155
+ | Steps | 5,000 |
156
+ | Max sequence length | 512 |
157
+
158
+ ### Expert Initialization
159
+
160
+ MoE experts were initialized using SVD decomposition of teacher FFN weights, producing genuinely diverse experts (inter-expert CKA = 0.097) rather than near-identical copies (CKA = 0.88 for naive replication).
161
+
162
+ ### Vocab Pruning
163
+
164
+ The original Aya vocabulary (255K tokens) was pruned to 80K tokens, reducing the model from 722M to 536M parameters (25.7% reduction) with less than 5% increase in fertility across languages.
165
+
166
+ ## Languages
167
+
168
+ Aetheris supports 67 languages spanning 13 script families:
169
+
170
+ **Latin**: English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Danish, Swedish, Norwegian, Polish, Czech, Slovak, Croatian, Slovenian, Catalan, Galician, Maltese, Basque, Welsh, Irish, Latvian, Lithuanian, Estonian, Finnish, Hungarian, Turkish, Indonesian, Malay, Tagalog, Javanese, Vietnamese, Swahili, Hausa, Igbo, Yoruba, Somali, Zulu, Xhosa
171
+
172
+ **Cyrillic**: Russian, Ukrainian, Serbian, Bulgarian
173
+
174
+ **Arabic**: Arabic, Persian, Urdu
175
+
176
+ **Devanagari**: Hindi, Marathi, Nepali
177
+
178
+ **CJK**: Chinese, Japanese, Korean
179
+
180
+ **Other scripts**: Bengali, Gujarati, Punjabi (Gurmukhi), Tamil, Telugu, Hebrew, Greek, Thai, Khmer, Lao, Burmese, Amharic (Ge'ez)
181
+
182
+ ### Equity Findings
183
+
184
+ Tokenizer analysis revealed a **4.4x fertility ratio** across languages (p=0.002), with script being the strongest predictor of tokenizer efficiency (p=0.047). Eight high-priority languages were identified for equity monitoring, with the hardest being Amharic (KL=1.80), Burmese (1.64), and Lao (1.56).
185
+
186
+ Cross-lingual representation similarity of **0.88** indicates strong transfer potential across the language set.
187
 
188
  ## Usage
189
 
190
  ```python
191
+ import torch
192
+ import sys
193
+ from huggingface_hub import snapshot_download
194
+
195
+ # Download model
196
+ local_dir = snapshot_download("wayyresearch/aetheris")
197
+ sys.path.insert(0, local_dir)
198
+
199
+ # Load model
200
  from aetheris.config import AetherisConfig
201
  from aetheris.model import HybridMambaMoE
202
 
203
+ config = AetherisConfig.from_yaml(f"{local_dir}/config.yaml")
204
  model = HybridMambaMoE(config)
205
+
206
+ sd = torch.load(
207
+ f"{local_dir}/pytorch_model.pt",
208
+ map_location="cpu",
209
+ weights_only=True,
210
+ )
211
  model.load_state_dict(sd)
212
  model.eval()
213
+
214
+ # Tokenize (uses the Aya tokenizer)
215
+ from transformers import AutoTokenizer
216
+
217
+ tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-expanse-8b")
218
+
219
+ input_ids = tokenizer.encode("Hello, how are you?", return_tensors="pt")
220
+ with torch.no_grad():
221
+ output = model(input_ids)
222
+ logits = output["logits"]
223
+
224
+ # Get next-token prediction
225
+ next_token = torch.argmax(logits[:, -1, :], dim=-1)
226
+ print(tokenizer.decode(next_token))
227
+ ```
228
+
229
+ ### Generation Loop
230
+
231
+ ```python
232
+ def generate(model, tokenizer, prompt, max_new_tokens=100):
233
+ input_ids = tokenizer.encode(prompt, return_tensors="pt")
234
+ generated = input_ids
235
+
236
+ with torch.no_grad():
237
+ for _ in range(max_new_tokens):
238
+ output = model(generated)
239
+ next_token = torch.argmax(output["logits"][:, -1, :], dim=-1, keepdim=True)
240
+ generated = torch.cat([generated, next_token], dim=-1)
241
+ if next_token.item() == tokenizer.eos_token_id:
242
+ break
243
+
244
+ return tokenizer.decode(generated[0], skip_special_tokens=True)
245
+
246
+ print(generate(model, tokenizer, "The capital of France is"))
247
  ```
248
 
249
+ ### Multilingual Example
250
+
251
+ ```python
252
+ prompts = [
253
+ "The weather today is", # English
254
+ "El clima de hoy es", # Spanish
255
+ "La capitale de la France est", # French
256
+ ]
257
+
258
+ for prompt in prompts:
259
+ print(f"{prompt} -> {generate(model, tokenizer, prompt, max_new_tokens=20)}")
260
+ ```
261
+
262
+ ## Files in This Repository
263
+
264
+ | File | Description |
265
+ |---|---|
266
+ | `pytorch_model.pt` | Model weights (state_dict) |
267
+ | `config.yaml` | Model configuration (AetherisConfig) |
268
+ | `aetheris/` | Model source code (importable Python package) |
269
+ | `student_config.yaml` | Student architecture config used during training |
270
+ | `training_config.yaml` | Training hyperparameters |
271
+ | `stage1_checkpoint.pt` | Stage 1 (CKA alignment) checkpoint |
272
+ | `stage2_best.pt` | Stage 2 (KL distillation) best checkpoint |
273
+
274
+ ## Limitations
275
+
276
+ - **Stage 3 SFT is in progress.** The current weights reflect Stage 2 distillation. Conversational and instruction-following quality will improve after SFT completes.
277
+ - **Not a chat model yet.** The model generates continuations, not structured dialogue. SFT will address this.
278
+ - **Low-resource language quality varies.** Languages with non-Latin scripts (Amharic, Burmese, Lao) show higher loss. This is an active area of work.
279
+ - **No CUDA-optimized SSM kernels.** The current implementation uses a pure-Python SSM fallback. Inference speed will improve with Mamba CUDA kernels.
280
+ - **Evaluation benchmarks pending.** Systematic multilingual benchmarks are planned post-SFT.
281
+
282
+ ## Citation
283
+
284
+ ```bibtex
285
+ @misc{aetheris2026,
286
+ title={Aetheris: A Hybrid Mamba-MoE Model for Efficient Multilingual Generation},
287
+ author={Wayy Research},
288
+ year={2026},
289
+ url={https://huggingface.co/wayyresearch/aetheris},
290
+ }
291
+ ```
292
+
293
+ ## Acknowledgments
294
+
295
+ - [CohereForAI](https://cohere.com/research) for the Aya model family and multilingual datasets
296
+ - The [Mamba](https://arxiv.org/abs/2312.00752) authors for state space model foundations
297
+ - The open-source multilingual NLP community
298
+
299
+ ---
300
 
301
+ Built with frustration and determination by [Wayy Research](https://wayyresearch.com), Buffalo NY.
302
  *People for research, research for people.*