| --- |
| language: |
| - en |
| - fr |
| - es |
| - pt |
| - it |
| - ro |
| - de |
| - nl |
| - da |
| - sv |
| - "no" |
| - ru |
| - uk |
| - pl |
| - cs |
| - sk |
| - hr |
| - sr |
| - sl |
| - bg |
| - lv |
| - lt |
| - el |
| - et |
| - fi |
| - hu |
| - eu |
| - cy |
| - ga |
| - ar |
| - fa |
| - he |
| - tr |
| - hi |
| - ur |
| - bn |
| - mr |
| - gu |
| - pa |
| - ne |
| - ta |
| - te |
| - zh |
| - ja |
| - ko |
| - id |
| - ms |
| - tl |
| - jv |
| - vi |
| - km |
| - th |
| - lo |
| - my |
| - am |
| - ha |
| - ig |
| - sw |
| - yo |
| - so |
| - zu |
| - xh |
| - ca |
| - gl |
| - mt |
| license: apache-2.0 |
| library_name: pytorch |
| pipeline_tag: text-generation |
| tags: |
| - mamba |
| - ssm |
| - state-space-model |
| - mixture-of-experts |
| - moe |
| - multilingual |
| - distillation |
| - knowledge-distillation |
| - aya |
| - hybrid-architecture |
| - wayy-research |
| model-index: |
| - name: aetheris |
| results: [] |
| --- |
| |
| # Aetheris |
|
|
| > A hybrid Mamba-MoE language model distilled from Aya for efficient multilingual generation across 67 languages. |
|
|
| **Aetheris** is a 536M-parameter hybrid architecture that interleaves State Space Model (Mamba) layers with Sparse Mixture-of-Experts (MoE) layers. It was distilled from [CohereLabs/tiny-aya-global](https://huggingface.co/CohereForAI/aya-expanse-8b) (3.35B params) using a 3-stage pipeline: CKA-guided alignment, KL divergence distillation across 67 languages, and supervised fine-tuning on multilingual chat data. |
|
|
| The goal: compress a massively multilingual teacher into a model small enough to run on consumer hardware, without abandoning low-resource languages. |
|
|
| | | | |
| |---|---| |
| | **Developer** | [Wayy Research](https://wayyresearch.com), Buffalo NY | |
| | **Parameters** | 536M (pruned) / 722M (full vocab) | |
| | **Teacher** | CohereLabs/tiny-aya-global (3.35B) | |
| | **Compression** | ~4.6x (base config) | |
| | **Languages** | 67 | |
| | **License** | Apache 2.0 | |
| | **Demo** | [aetheris-playground](https://huggingface.co/spaces/wayyresearch/aetheris-playground) | |
|
|
| ## Architecture |
|
|
| Aetheris uses a hybrid design that alternates between two layer types across 24 total layers: |
|
|
| - **12 SSM (Mamba) layers** (even indices) -- linear-time sequence modeling with selective state spaces |
| - **12 Sparse MoE layers** (odd indices) -- capacity scaling through top-1 routing over 4 experts |
|
|
| This interleaving gives the model both efficient long-range dependency modeling (SSM) and parameter-efficient capacity (MoE). |
|
|
| ### Configuration |
|
|
| | Hyperparameter | Value | |
| |---|---| |
| | `d_model` | 1024 | |
| | `d_ff` | 3072 | |
| | `d_inner` (SSM) | 2048 | |
| | `n_layer` | 24 (12 SSM + 12 MoE) | |
| | `ssm_d_state` | 16 | |
| | `ssm_expand` | 2 | |
| | `num_experts` | 4 | |
| | `top_k` (routing) | 1 | |
| | `vocab_size` | 261,019 (shared Aya tokenizer) | |
| | `max_seq_len` | 2048 | |
| | Weight tying | Embedding + LM head shared | |
|
|
| ## Training |
|
|
| ### 3-Stage Distillation Pipeline |
|
|
| **Stage 1 -- CKA Layer Alignment** |
| Aligns student hidden representations to teacher layers using Centered Kernel Alignment. This gives the student a structural initialization before distillation begins. |
|
|
| **Stage 2 -- KL Divergence Distillation** |
| Full knowledge distillation across 67 languages. 20K training steps. Best validation loss: **2.73**. |
|
|
| Key findings from this stage: |
| - SSM layers receive ~27x less gradient than MoE layers (gradient imbalance ratio = 0.037) |
| - A **10x learning rate boost** for SSM layers resolved this, reducing KL by 26% and increasing teacher-student agreement by 12x |
| - Optimal temperature: T=2.0 with alpha=0.7 and cosine schedule |
|
|
| **Stage 3 -- Supervised Fine-Tuning** *(in progress)* |
| Fine-tuning on multilingual chat data from CohereForAI/aya_collection and aya_evaluation_suite. |
| |
| | Parameter | Value | |
| |---|---| |
| | Data | 16,907 examples, 10 languages (en, es, hi, zh, ar, sw, tr, ja, id, te) | |
| | Loss masking | Assistant tokens only | |
| | Learning rate | 2e-5 | |
| | Batch size | 4 (x4 gradient accumulation) | |
| | Steps | 5,000 | |
| | Max sequence length | 512 | |
| |
| ### Expert Initialization |
| |
| MoE experts were initialized using SVD decomposition of teacher FFN weights, producing genuinely diverse experts (inter-expert CKA = 0.097) rather than near-identical copies (CKA = 0.88 for naive replication). |
| |
| ### Vocab Pruning |
| |
| The original Aya vocabulary (255K tokens) was pruned to 80K tokens, reducing the model from 722M to 536M parameters (25.7% reduction) with less than 5% increase in fertility across languages. |
| |
| ## Languages |
| |
| Aetheris supports 67 languages spanning 13 script families: |
| |
| **Latin**: English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Danish, Swedish, Norwegian, Polish, Czech, Slovak, Croatian, Slovenian, Catalan, Galician, Maltese, Basque, Welsh, Irish, Latvian, Lithuanian, Estonian, Finnish, Hungarian, Turkish, Indonesian, Malay, Tagalog, Javanese, Vietnamese, Swahili, Hausa, Igbo, Yoruba, Somali, Zulu, Xhosa |
| |
| **Cyrillic**: Russian, Ukrainian, Serbian, Bulgarian |
| |
| **Arabic**: Arabic, Persian, Urdu |
| |
| **Devanagari**: Hindi, Marathi, Nepali |
| |
| **CJK**: Chinese, Japanese, Korean |
| |
| **Other scripts**: Bengali, Gujarati, Punjabi (Gurmukhi), Tamil, Telugu, Hebrew, Greek, Thai, Khmer, Lao, Burmese, Amharic (Ge'ez) |
| |
| ### Equity Findings |
| |
| Tokenizer analysis revealed a **4.4x fertility ratio** across languages (p=0.002), with script being the strongest predictor of tokenizer efficiency (p=0.047). Eight high-priority languages were identified for equity monitoring, with the hardest being Amharic (KL=1.80), Burmese (1.64), and Lao (1.56). |
| |
| Cross-lingual representation similarity of **0.88** indicates strong transfer potential across the language set. |
| |
| ## Usage |
| |
| ```python |
| import torch |
| import sys |
| from huggingface_hub import snapshot_download |
| |
| # Download model |
| local_dir = snapshot_download("wayyresearch/aetheris") |
| sys.path.insert(0, local_dir) |
|
|
| # Load model |
| from aetheris.config import AetherisConfig |
| from aetheris.model import HybridMambaMoE |
|
|
| config = AetherisConfig.from_yaml(f"{local_dir}/config.yaml") |
| model = HybridMambaMoE(config) |
|
|
| sd = torch.load( |
| f"{local_dir}/pytorch_model.pt", |
| map_location="cpu", |
| weights_only=True, |
| ) |
| model.load_state_dict(sd) |
| model.eval() |
| |
| # Tokenize (uses the Aya tokenizer) |
| from transformers import AutoTokenizer |
|
|
| tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-expanse-8b") |
| |
| input_ids = tokenizer.encode("Hello, how are you?", return_tensors="pt") |
| with torch.no_grad(): |
| output = model(input_ids) |
| logits = output["logits"] |
| |
| # Get next-token prediction |
| next_token = torch.argmax(logits[:, -1, :], dim=-1) |
| print(tokenizer.decode(next_token)) |
| ``` |
| |
| ### Generation Loop |
| |
| ```python |
| def generate(model, tokenizer, prompt, max_new_tokens=100): |
| input_ids = tokenizer.encode(prompt, return_tensors="pt") |
| generated = input_ids |
| |
| with torch.no_grad(): |
| for _ in range(max_new_tokens): |
| output = model(generated) |
| next_token = torch.argmax(output["logits"][:, -1, :], dim=-1, keepdim=True) |
| generated = torch.cat([generated, next_token], dim=-1) |
| if next_token.item() == tokenizer.eos_token_id: |
| break |
| |
| return tokenizer.decode(generated[0], skip_special_tokens=True) |
| |
| print(generate(model, tokenizer, "The capital of France is")) |
| ``` |
| |
| ### Multilingual Example |
| |
| ```python |
| prompts = [ |
| "The weather today is", # English |
| "El clima de hoy es", # Spanish |
| "La capitale de la France est", # French |
| ] |
| |
| for prompt in prompts: |
| print(f"{prompt} -> {generate(model, tokenizer, prompt, max_new_tokens=20)}") |
| ``` |
| |
| ## Files in This Repository |
|
|
| | File | Description | |
| |---|---| |
| | `pytorch_model.pt` | Model weights (state_dict) | |
| | `config.yaml` | Model configuration (AetherisConfig) | |
| | `aetheris/` | Model source code (importable Python package) | |
| | `student_config.yaml` | Student architecture config used during training | |
| | `training_config.yaml` | Training hyperparameters | |
| | `stage1_checkpoint.pt` | Stage 1 (CKA alignment) checkpoint | |
| | `stage2_best.pt` | Stage 2 (KL distillation) best checkpoint | |
|
|
| ## Limitations |
|
|
| - **Stage 3 SFT is in progress.** The current weights reflect Stage 2 distillation. Conversational and instruction-following quality will improve after SFT completes. |
| - **Not a chat model yet.** The model generates continuations, not structured dialogue. SFT will address this. |
| - **Low-resource language quality varies.** Languages with non-Latin scripts (Amharic, Burmese, Lao) show higher loss. This is an active area of work. |
| - **No CUDA-optimized SSM kernels.** The current implementation uses a pure-Python SSM fallback. Inference speed will improve with Mamba CUDA kernels. |
| - **Evaluation benchmarks pending.** Systematic multilingual benchmarks are planned post-SFT. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{aetheris2026, |
| title={Aetheris: A Hybrid Mamba-MoE Model for Efficient Multilingual Generation}, |
| author={Wayy Research}, |
| year={2026}, |
| url={https://huggingface.co/wayyresearch/aetheris}, |
| } |
| ``` |
|
|
| ## Acknowledgments |
|
|
| - [CohereForAI](https://cohere.com/research) for the Aya model family and multilingual datasets |
| - The [Mamba](https://arxiv.org/abs/2312.00752) authors for state space model foundations |
| - The open-source multilingual NLP community |
|
|
| --- |
|
|
| Built with frustration and determination by [Wayy Research](https://wayyresearch.com), Buffalo NY. |
| *People for research, research for people.* |
|
|