--- language: - en - fr - es - pt - it - ro - de - nl - da - sv - "no" - ru - uk - pl - cs - sk - hr - sr - sl - bg - lv - lt - el - et - fi - hu - eu - cy - ga - ar - fa - he - tr - hi - ur - bn - mr - gu - pa - ne - ta - te - zh - ja - ko - id - ms - tl - jv - vi - km - th - lo - my - am - ha - ig - sw - yo - so - zu - xh - ca - gl - mt license: apache-2.0 library_name: pytorch pipeline_tag: text-generation tags: - mamba - ssm - state-space-model - mixture-of-experts - moe - multilingual - distillation - knowledge-distillation - aya - hybrid-architecture - wayy-research model-index: - name: aetheris results: [] --- # Aetheris > A hybrid Mamba-MoE language model distilled from Aya for efficient multilingual generation across 67 languages. **Aetheris** is a 536M-parameter hybrid architecture that interleaves State Space Model (Mamba) layers with Sparse Mixture-of-Experts (MoE) layers. It was distilled from [CohereLabs/tiny-aya-global](https://huggingface.co/CohereForAI/aya-expanse-8b) (3.35B params) using a 3-stage pipeline: CKA-guided alignment, KL divergence distillation across 67 languages, and supervised fine-tuning on multilingual chat data. The goal: compress a massively multilingual teacher into a model small enough to run on consumer hardware, without abandoning low-resource languages. | | | |---|---| | **Developer** | [Wayy Research](https://wayyresearch.com), Buffalo NY | | **Parameters** | 536M (pruned) / 722M (full vocab) | | **Teacher** | CohereLabs/tiny-aya-global (3.35B) | | **Compression** | ~4.6x (base config) | | **Languages** | 67 | | **License** | Apache 2.0 | | **Demo** | [aetheris-playground](https://huggingface.co/spaces/wayyresearch/aetheris-playground) | ## Architecture Aetheris uses a hybrid design that alternates between two layer types across 24 total layers: - **12 SSM (Mamba) layers** (even indices) -- linear-time sequence modeling with selective state spaces - **12 Sparse MoE layers** (odd indices) -- capacity scaling through top-1 routing over 4 experts This interleaving gives the model both efficient long-range dependency modeling (SSM) and parameter-efficient capacity (MoE). ### Configuration | Hyperparameter | Value | |---|---| | `d_model` | 1024 | | `d_ff` | 3072 | | `d_inner` (SSM) | 2048 | | `n_layer` | 24 (12 SSM + 12 MoE) | | `ssm_d_state` | 16 | | `ssm_expand` | 2 | | `num_experts` | 4 | | `top_k` (routing) | 1 | | `vocab_size` | 261,019 (shared Aya tokenizer) | | `max_seq_len` | 2048 | | Weight tying | Embedding + LM head shared | ## Training ### 3-Stage Distillation Pipeline **Stage 1 -- CKA Layer Alignment** Aligns student hidden representations to teacher layers using Centered Kernel Alignment. This gives the student a structural initialization before distillation begins. **Stage 2 -- KL Divergence Distillation** Full knowledge distillation across 67 languages. 20K training steps. Best validation loss: **2.73**. Key findings from this stage: - SSM layers receive ~27x less gradient than MoE layers (gradient imbalance ratio = 0.037) - A **10x learning rate boost** for SSM layers resolved this, reducing KL by 26% and increasing teacher-student agreement by 12x - Optimal temperature: T=2.0 with alpha=0.7 and cosine schedule **Stage 3 -- Supervised Fine-Tuning** *(in progress)* Fine-tuning on multilingual chat data from CohereForAI/aya_collection and aya_evaluation_suite. | Parameter | Value | |---|---| | Data | 16,907 examples, 10 languages (en, es, hi, zh, ar, sw, tr, ja, id, te) | | Loss masking | Assistant tokens only | | Learning rate | 2e-5 | | Batch size | 4 (x4 gradient accumulation) | | Steps | 5,000 | | Max sequence length | 512 | ### Expert Initialization MoE experts were initialized using SVD decomposition of teacher FFN weights, producing genuinely diverse experts (inter-expert CKA = 0.097) rather than near-identical copies (CKA = 0.88 for naive replication). ### Vocab Pruning The original Aya vocabulary (255K tokens) was pruned to 80K tokens, reducing the model from 722M to 536M parameters (25.7% reduction) with less than 5% increase in fertility across languages. ## Languages Aetheris supports 67 languages spanning 13 script families: **Latin**: English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Danish, Swedish, Norwegian, Polish, Czech, Slovak, Croatian, Slovenian, Catalan, Galician, Maltese, Basque, Welsh, Irish, Latvian, Lithuanian, Estonian, Finnish, Hungarian, Turkish, Indonesian, Malay, Tagalog, Javanese, Vietnamese, Swahili, Hausa, Igbo, Yoruba, Somali, Zulu, Xhosa **Cyrillic**: Russian, Ukrainian, Serbian, Bulgarian **Arabic**: Arabic, Persian, Urdu **Devanagari**: Hindi, Marathi, Nepali **CJK**: Chinese, Japanese, Korean **Other scripts**: Bengali, Gujarati, Punjabi (Gurmukhi), Tamil, Telugu, Hebrew, Greek, Thai, Khmer, Lao, Burmese, Amharic (Ge'ez) ### Equity Findings Tokenizer analysis revealed a **4.4x fertility ratio** across languages (p=0.002), with script being the strongest predictor of tokenizer efficiency (p=0.047). Eight high-priority languages were identified for equity monitoring, with the hardest being Amharic (KL=1.80), Burmese (1.64), and Lao (1.56). Cross-lingual representation similarity of **0.88** indicates strong transfer potential across the language set. ## Usage ```python import torch import sys from huggingface_hub import snapshot_download # Download model local_dir = snapshot_download("wayyresearch/aetheris") sys.path.insert(0, local_dir) # Load model from aetheris.config import AetherisConfig from aetheris.model import HybridMambaMoE config = AetherisConfig.from_yaml(f"{local_dir}/config.yaml") model = HybridMambaMoE(config) sd = torch.load( f"{local_dir}/pytorch_model.pt", map_location="cpu", weights_only=True, ) model.load_state_dict(sd) model.eval() # Tokenize (uses the Aya tokenizer) from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-expanse-8b") input_ids = tokenizer.encode("Hello, how are you?", return_tensors="pt") with torch.no_grad(): output = model(input_ids) logits = output["logits"] # Get next-token prediction next_token = torch.argmax(logits[:, -1, :], dim=-1) print(tokenizer.decode(next_token)) ``` ### Generation Loop ```python def generate(model, tokenizer, prompt, max_new_tokens=100): input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids with torch.no_grad(): for _ in range(max_new_tokens): output = model(generated) next_token = torch.argmax(output["logits"][:, -1, :], dim=-1, keepdim=True) generated = torch.cat([generated, next_token], dim=-1) if next_token.item() == tokenizer.eos_token_id: break return tokenizer.decode(generated[0], skip_special_tokens=True) print(generate(model, tokenizer, "The capital of France is")) ``` ### Multilingual Example ```python prompts = [ "The weather today is", # English "El clima de hoy es", # Spanish "La capitale de la France est", # French ] for prompt in prompts: print(f"{prompt} -> {generate(model, tokenizer, prompt, max_new_tokens=20)}") ``` ## Files in This Repository | File | Description | |---|---| | `pytorch_model.pt` | Model weights (state_dict) | | `config.yaml` | Model configuration (AetherisConfig) | | `aetheris/` | Model source code (importable Python package) | | `student_config.yaml` | Student architecture config used during training | | `training_config.yaml` | Training hyperparameters | | `stage1_checkpoint.pt` | Stage 1 (CKA alignment) checkpoint | | `stage2_best.pt` | Stage 2 (KL distillation) best checkpoint | ## Limitations - **Stage 3 SFT is in progress.** The current weights reflect Stage 2 distillation. Conversational and instruction-following quality will improve after SFT completes. - **Not a chat model yet.** The model generates continuations, not structured dialogue. SFT will address this. - **Low-resource language quality varies.** Languages with non-Latin scripts (Amharic, Burmese, Lao) show higher loss. This is an active area of work. - **No CUDA-optimized SSM kernels.** The current implementation uses a pure-Python SSM fallback. Inference speed will improve with Mamba CUDA kernels. - **Evaluation benchmarks pending.** Systematic multilingual benchmarks are planned post-SFT. ## Citation ```bibtex @misc{aetheris2026, title={Aetheris: A Hybrid Mamba-MoE Model for Efficient Multilingual Generation}, author={Wayy Research}, year={2026}, url={https://huggingface.co/wayyresearch/aetheris}, } ``` ## Acknowledgments - [CohereForAI](https://cohere.com/research) for the Aya model family and multilingual datasets - The [Mamba](https://arxiv.org/abs/2312.00752) authors for state space model foundations - The open-source multilingual NLP community --- Built with frustration and determination by [Wayy Research](https://wayyresearch.com), Buffalo NY. *People for research, research for people.*