wayyresearch
/

aetheris

@@ -1,82 +1,302 @@
 ---
-license: apache-2.0
 language:
-- en
-- es
-- fr
-- de
-- zh
-- ja
-- ko
-- ar
-- hi
-- tr
-- sw
-- id
-- pt
-- ru
-tags:
-- multilingual
-- mamba
-- moe
-- distillation
-- aya
 pipeline_tag: text-generation
 ---
-# Aetheris — Hybrid Mamba-MoE Multilingual Model
-**Aetheris** is a ~536M parameter hybrid SSM-MoE language model distilled from
-[CohereLabs/tiny-aya-global](https://huggingface.co/CohereLabs/tiny-aya-global) (3.35B).
-It supports **67 languages** with 6.3x compression.
 ## Architecture
-- **Type**: Hybrid Mamba-MoE (interleaved SSM + Sparse MoE layers)
-- **Layers**: 24 (12 SSM + 12 MoE)
-- **Hidden dim**: 1024
-- **Experts**: 4 (top-1 routing)
-- **Vocab**: 80,000 tokens (pruned from 261K Aya tokenizer)
-- **Parameters**: 536M (pruned from 722M via vocabulary pruning)
-## Compression
-| Stage | Technique | Before | After | Savings |
-|-------|-----------|--------|-------|---------|
-| 1 | Knowledge Distillation | 3,350M | 722M | 4.6x |
-| 2 | Vocabulary Pruning | 722M | 536M | 25.7% |
-| **Total** | | **3,350M** | **536M** | **6.3x** |
-### Vocabulary Pruning Details
-- Original vocab: 255,000 tokens → Pruned: 80,000 tokens
-- Dead tokens removed: 131,231 (never used by any of 67 target languages)
-- Per-language coverage preserved via frequency-based keep-list union
-- Mean fertility increase: <5% across all languages
-- Weight tying preserved (embedding = lm_head)
 ## Training
-- **Stage 1**: CKA-guided layer alignment (10K steps)
-- **Stage 2**: KL divergence distillation, T=2.0, alpha=0.7 (20K steps, best loss=2.73)
-- **Stage 3**: SFT fine-tuning (pending)
-- **Teacher**: CohereLabs/tiny-aya-global (3.35B)
-- **Data**: ClimbMix (NVIDIA)
 ## Usage
 ```python
-import torch, yaml, sys
-sys.path.insert(0, ".")
 from aetheris.config import AetherisConfig
 from aetheris.model import HybridMambaMoE
-config = AetherisConfig.from_yaml("config.yaml")
 model = HybridMambaMoE(config)
-sd = torch.load("pytorch_model.pt", map_location="cpu")
 model.load_state_dict(sd)
 model.eval()
 ```
-**Note**: This model uses a pruned vocabulary. Use the `vocab_mapping.json` file
-to map between original Aya tokenizer IDs and pruned model IDs.
-## Wayy Research
 *People for research, research for people.*
-Buffalo, NY — Est. 2024

 ---
 language:
+  - en
+  - fr
+  - es
+  - pt
+  - it
+  - ro
+  - de
+  - nl
+  - da
+  - sv
+  - "no"
+  - ru
+  - uk
+  - pl
+  - cs
+  - sk
+  - hr
+  - sr
+  - sl
+  - bg
+  - lv
+  - lt
+  - el
+  - et
+  - fi
+  - hu
+  - eu
+  - cy
+  - ga
+  - ar
+  - fa
+  - he
+  - tr
+  - hi
+  - ur
+  - bn
+  - mr
+  - gu
+  - pa
+  - ne
+  - ta
+  - te
+  - zh
+  - ja
+  - ko
+  - id
+  - ms
+  - tl
+  - jv
+  - vi
+  - km
+  - th
+  - lo
+  - my
+  - am
+  - ha
+  - ig
+  - sw
+  - yo
+  - so
+  - zu
+  - xh
+  - ca
+  - gl
+  - mt
+license: apache-2.0
+library_name: pytorch
 pipeline_tag: text-generation
+tags:
+  - mamba
+  - ssm
+  - state-space-model
+  - mixture-of-experts
+  - moe
+  - multilingual
+  - distillation
+  - knowledge-distillation
+  - aya
+  - hybrid-architecture
+  - wayy-research
+model-index:
+  - name: aetheris
+    results: []
 ---
+# Aetheris
+> A hybrid Mamba-MoE language model distilled from Aya for efficient multilingual generation across 67 languages.
+**Aetheris** is a 536M-parameter hybrid architecture that interleaves State Space Model (Mamba) layers with Sparse Mixture-of-Experts (MoE) layers. It was distilled from [CohereLabs/tiny-aya-global](https://huggingface.co/CohereForAI/aya-expanse-8b) (3.35B params) using a 3-stage pipeline: CKA-guided alignment, KL divergence distillation across 67 languages, and supervised fine-tuning on multilingual chat data.
+The goal: compress a massively multilingual teacher into a model small enough to run on consumer hardware, without abandoning low-resource languages.
+| | |
+|---|---|
+| **Developer** | [Wayy Research](https://wayyresearch.com), Buffalo NY |
+| **Parameters** | 536M (pruned) / 722M (full vocab) |
+| **Teacher** | CohereLabs/tiny-aya-global (3.35B) |
+| **Compression** | ~4.6x (base config) |
+| **Languages** | 67 |
+| **License** | Apache 2.0 |
+| **Demo** | [aetheris-playground](https://huggingface.co/spaces/wayyresearch/aetheris-playground) |
 ## Architecture
+Aetheris uses a hybrid design that alternates between two layer types across 24 total layers:
+- **12 SSM (Mamba) layers** (even indices) -- linear-time sequence modeling with selective state spaces
+- **12 Sparse MoE layers** (odd indices) -- capacity scaling through top-1 routing over 4 experts
+This interleaving gives the model both efficient long-range dependency modeling (SSM) and parameter-efficient capacity (MoE).
+### Configuration
+| Hyperparameter | Value |
+|---|---|
+| `d_model` | 1024 |
+| `d_ff` | 3072 |
+| `d_inner` (SSM) | 2048 |
+| `n_layer` | 24 (12 SSM + 12 MoE) |
+| `ssm_d_state` | 16 |
+| `ssm_expand` | 2 |
+| `num_experts` | 4 |
+| `top_k` (routing) | 1 |
+| `vocab_size` | 261,019 (shared Aya tokenizer) |
+| `max_seq_len` | 2048 |
+| Weight tying | Embedding + LM head shared |
 ## Training
+### 3-Stage Distillation Pipeline
+**Stage 1 -- CKA Layer Alignment**
+Aligns student hidden representations to teacher layers using Centered Kernel Alignment. This gives the student a structural initialization before distillation begins.
+**Stage 2 -- KL Divergence Distillation**
+Full knowledge distillation across 67 languages. 20K training steps. Best validation loss: **2.73**.
+Key findings from this stage:
+- SSM layers receive ~27x less gradient than MoE layers (gradient imbalance ratio = 0.037)
+- A **10x learning rate boost** for SSM layers resolved this, reducing KL by 26% and increasing teacher-student agreement by 12x
+- Optimal temperature: T=2.0 with alpha=0.7 and cosine schedule
+**Stage 3 -- Supervised Fine-Tuning** *(in progress)*
+Fine-tuning on multilingual chat data from CohereForAI/aya_collection and aya_evaluation_suite.
+| Parameter | Value |
+|---|---|
+| Data | 16,907 examples, 10 languages (en, es, hi, zh, ar, sw, tr, ja, id, te) |
+| Loss masking | Assistant tokens only |
+| Learning rate | 2e-5 |
+| Batch size | 4 (x4 gradient accumulation) |
+| Steps | 5,000 |
+| Max sequence length | 512 |
+### Expert Initialization
+MoE experts were initialized using SVD decomposition of teacher FFN weights, producing genuinely diverse experts (inter-expert CKA = 0.097) rather than near-identical copies (CKA = 0.88 for naive replication).
+### Vocab Pruning
+The original Aya vocabulary (255K tokens) was pruned to 80K tokens, reducing the model from 722M to 536M parameters (25.7% reduction) with less than 5% increase in fertility across languages.
+## Languages
+Aetheris supports 67 languages spanning 13 script families:
+**Latin**: English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Danish, Swedish, Norwegian, Polish, Czech, Slovak, Croatian, Slovenian, Catalan, Galician, Maltese, Basque, Welsh, Irish, Latvian, Lithuanian, Estonian, Finnish, Hungarian, Turkish, Indonesian, Malay, Tagalog, Javanese, Vietnamese, Swahili, Hausa, Igbo, Yoruba, Somali, Zulu, Xhosa
+**Cyrillic**: Russian, Ukrainian, Serbian, Bulgarian
+**Arabic**: Arabic, Persian, Urdu
+**Devanagari**: Hindi, Marathi, Nepali
+**CJK**: Chinese, Japanese, Korean
+**Other scripts**: Bengali, Gujarati, Punjabi (Gurmukhi), Tamil, Telugu, Hebrew, Greek, Thai, Khmer, Lao, Burmese, Amharic (Ge'ez)
+### Equity Findings
+Tokenizer analysis revealed a **4.4x fertility ratio** across languages (p=0.002), with script being the strongest predictor of tokenizer efficiency (p=0.047). Eight high-priority languages were identified for equity monitoring, with the hardest being Amharic (KL=1.80), Burmese (1.64), and Lao (1.56).
+Cross-lingual representation similarity of **0.88** indicates strong transfer potential across the language set.
 ## Usage
 ```python
+import torch
+import sys
+from huggingface_hub import snapshot_download
+# Download model
+local_dir = snapshot_download("wayyresearch/aetheris")
+sys.path.insert(0, local_dir)
+# Load model
 from aetheris.config import AetherisConfig
 from aetheris.model import HybridMambaMoE
+config = AetherisConfig.from_yaml(f"{local_dir}/config.yaml")
 model = HybridMambaMoE(config)
+sd = torch.load(
+    f"{local_dir}/pytorch_model.pt",
+    map_location="cpu",
+    weights_only=True,
+)
 model.load_state_dict(sd)
 model.eval()
+# Tokenize (uses the Aya tokenizer)
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-expanse-8b")
+input_ids = tokenizer.encode("Hello, how are you?", return_tensors="pt")
+with torch.no_grad():
+    output = model(input_ids)
+    logits = output["logits"]
+# Get next-token prediction
+next_token = torch.argmax(logits[:, -1, :], dim=-1)
+print(tokenizer.decode(next_token))
+```
+### Generation Loop
+```python
+def generate(model, tokenizer, prompt, max_new_tokens=100):
+    input_ids = tokenizer.encode(prompt, return_tensors="pt")
+    generated = input_ids
+    with torch.no_grad():
+        for _ in range(max_new_tokens):
+            output = model(generated)
+            next_token = torch.argmax(output["logits"][:, -1, :], dim=-1, keepdim=True)
+            generated = torch.cat([generated, next_token], dim=-1)
+            if next_token.item() == tokenizer.eos_token_id:
+                break
+    return tokenizer.decode(generated[0], skip_special_tokens=True)
+print(generate(model, tokenizer, "The capital of France is"))
 ```
+### Multilingual Example
+```python
+prompts = [
+    "The weather today is",           # English
+    "El clima de hoy es",             # Spanish
+    "La capitale de la France est",   # French
+]
+for prompt in prompts:
+    print(f"{prompt} -> {generate(model, tokenizer, prompt, max_new_tokens=20)}")
+```
+## Files in This Repository
+| File | Description |
+|---|---|
+| `pytorch_model.pt` | Model weights (state_dict) |
+| `config.yaml` | Model configuration (AetherisConfig) |
+| `aetheris/` | Model source code (importable Python package) |
+| `student_config.yaml` | Student architecture config used during training |
+| `training_config.yaml` | Training hyperparameters |
+| `stage1_checkpoint.pt` | Stage 1 (CKA alignment) checkpoint |
+| `stage2_best.pt` | Stage 2 (KL distillation) best checkpoint |
+## Limitations
+- **Stage 3 SFT is in progress.** The current weights reflect Stage 2 distillation. Conversational and instruction-following quality will improve after SFT completes.
+- **Not a chat model yet.** The model generates continuations, not structured dialogue. SFT will address this.
+- **Low-resource language quality varies.** Languages with non-Latin scripts (Amharic, Burmese, Lao) show higher loss. This is an active area of work.
+- **No CUDA-optimized SSM kernels.** The current implementation uses a pure-Python SSM fallback. Inference speed will improve with Mamba CUDA kernels.
+- **Evaluation benchmarks pending.** Systematic multilingual benchmarks are planned post-SFT.
+## Citation
+```bibtex
+@misc{aetheris2026,
+  title={Aetheris: A Hybrid Mamba-MoE Model for Efficient Multilingual Generation},
+  author={Wayy Research},
+  year={2026},
+  url={https://huggingface.co/wayyresearch/aetheris},
+}
+```
+## Acknowledgments
+- [CohereForAI](https://cohere.com/research) for the Aya model family and multilingual datasets
+- The [Mamba](https://arxiv.org/abs/2312.00752) authors for state space model foundations
+- The open-source multilingual NLP community
+---
+Built with frustration and determination by [Wayy Research](https://wayyresearch.com), Buffalo NY.
 *People for research, research for people.*