aetheris / README.md
rcgalbo's picture
Update model card with full architecture and training details
051c2da verified
---
language:
- en
- fr
- es
- pt
- it
- ro
- de
- nl
- da
- sv
- "no"
- ru
- uk
- pl
- cs
- sk
- hr
- sr
- sl
- bg
- lv
- lt
- el
- et
- fi
- hu
- eu
- cy
- ga
- ar
- fa
- he
- tr
- hi
- ur
- bn
- mr
- gu
- pa
- ne
- ta
- te
- zh
- ja
- ko
- id
- ms
- tl
- jv
- vi
- km
- th
- lo
- my
- am
- ha
- ig
- sw
- yo
- so
- zu
- xh
- ca
- gl
- mt
license: apache-2.0
library_name: pytorch
pipeline_tag: text-generation
tags:
- mamba
- ssm
- state-space-model
- mixture-of-experts
- moe
- multilingual
- distillation
- knowledge-distillation
- aya
- hybrid-architecture
- wayy-research
model-index:
- name: aetheris
results: []
---
# Aetheris
> A hybrid Mamba-MoE language model distilled from Aya for efficient multilingual generation across 67 languages.
**Aetheris** is a 536M-parameter hybrid architecture that interleaves State Space Model (Mamba) layers with Sparse Mixture-of-Experts (MoE) layers. It was distilled from [CohereLabs/tiny-aya-global](https://huggingface.co/CohereForAI/aya-expanse-8b) (3.35B params) using a 3-stage pipeline: CKA-guided alignment, KL divergence distillation across 67 languages, and supervised fine-tuning on multilingual chat data.
The goal: compress a massively multilingual teacher into a model small enough to run on consumer hardware, without abandoning low-resource languages.
| | |
|---|---|
| **Developer** | [Wayy Research](https://wayyresearch.com), Buffalo NY |
| **Parameters** | 536M (pruned) / 722M (full vocab) |
| **Teacher** | CohereLabs/tiny-aya-global (3.35B) |
| **Compression** | ~4.6x (base config) |
| **Languages** | 67 |
| **License** | Apache 2.0 |
| **Demo** | [aetheris-playground](https://huggingface.co/spaces/wayyresearch/aetheris-playground) |
## Architecture
Aetheris uses a hybrid design that alternates between two layer types across 24 total layers:
- **12 SSM (Mamba) layers** (even indices) -- linear-time sequence modeling with selective state spaces
- **12 Sparse MoE layers** (odd indices) -- capacity scaling through top-1 routing over 4 experts
This interleaving gives the model both efficient long-range dependency modeling (SSM) and parameter-efficient capacity (MoE).
### Configuration
| Hyperparameter | Value |
|---|---|
| `d_model` | 1024 |
| `d_ff` | 3072 |
| `d_inner` (SSM) | 2048 |
| `n_layer` | 24 (12 SSM + 12 MoE) |
| `ssm_d_state` | 16 |
| `ssm_expand` | 2 |
| `num_experts` | 4 |
| `top_k` (routing) | 1 |
| `vocab_size` | 261,019 (shared Aya tokenizer) |
| `max_seq_len` | 2048 |
| Weight tying | Embedding + LM head shared |
## Training
### 3-Stage Distillation Pipeline
**Stage 1 -- CKA Layer Alignment**
Aligns student hidden representations to teacher layers using Centered Kernel Alignment. This gives the student a structural initialization before distillation begins.
**Stage 2 -- KL Divergence Distillation**
Full knowledge distillation across 67 languages. 20K training steps. Best validation loss: **2.73**.
Key findings from this stage:
- SSM layers receive ~27x less gradient than MoE layers (gradient imbalance ratio = 0.037)
- A **10x learning rate boost** for SSM layers resolved this, reducing KL by 26% and increasing teacher-student agreement by 12x
- Optimal temperature: T=2.0 with alpha=0.7 and cosine schedule
**Stage 3 -- Supervised Fine-Tuning** *(in progress)*
Fine-tuning on multilingual chat data from CohereForAI/aya_collection and aya_evaluation_suite.
| Parameter | Value |
|---|---|
| Data | 16,907 examples, 10 languages (en, es, hi, zh, ar, sw, tr, ja, id, te) |
| Loss masking | Assistant tokens only |
| Learning rate | 2e-5 |
| Batch size | 4 (x4 gradient accumulation) |
| Steps | 5,000 |
| Max sequence length | 512 |
### Expert Initialization
MoE experts were initialized using SVD decomposition of teacher FFN weights, producing genuinely diverse experts (inter-expert CKA = 0.097) rather than near-identical copies (CKA = 0.88 for naive replication).
### Vocab Pruning
The original Aya vocabulary (255K tokens) was pruned to 80K tokens, reducing the model from 722M to 536M parameters (25.7% reduction) with less than 5% increase in fertility across languages.
## Languages
Aetheris supports 67 languages spanning 13 script families:
**Latin**: English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Danish, Swedish, Norwegian, Polish, Czech, Slovak, Croatian, Slovenian, Catalan, Galician, Maltese, Basque, Welsh, Irish, Latvian, Lithuanian, Estonian, Finnish, Hungarian, Turkish, Indonesian, Malay, Tagalog, Javanese, Vietnamese, Swahili, Hausa, Igbo, Yoruba, Somali, Zulu, Xhosa
**Cyrillic**: Russian, Ukrainian, Serbian, Bulgarian
**Arabic**: Arabic, Persian, Urdu
**Devanagari**: Hindi, Marathi, Nepali
**CJK**: Chinese, Japanese, Korean
**Other scripts**: Bengali, Gujarati, Punjabi (Gurmukhi), Tamil, Telugu, Hebrew, Greek, Thai, Khmer, Lao, Burmese, Amharic (Ge'ez)
### Equity Findings
Tokenizer analysis revealed a **4.4x fertility ratio** across languages (p=0.002), with script being the strongest predictor of tokenizer efficiency (p=0.047). Eight high-priority languages were identified for equity monitoring, with the hardest being Amharic (KL=1.80), Burmese (1.64), and Lao (1.56).
Cross-lingual representation similarity of **0.88** indicates strong transfer potential across the language set.
## Usage
```python
import torch
import sys
from huggingface_hub import snapshot_download
# Download model
local_dir = snapshot_download("wayyresearch/aetheris")
sys.path.insert(0, local_dir)
# Load model
from aetheris.config import AetherisConfig
from aetheris.model import HybridMambaMoE
config = AetherisConfig.from_yaml(f"{local_dir}/config.yaml")
model = HybridMambaMoE(config)
sd = torch.load(
f"{local_dir}/pytorch_model.pt",
map_location="cpu",
weights_only=True,
)
model.load_state_dict(sd)
model.eval()
# Tokenize (uses the Aya tokenizer)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-expanse-8b")
input_ids = tokenizer.encode("Hello, how are you?", return_tensors="pt")
with torch.no_grad():
output = model(input_ids)
logits = output["logits"]
# Get next-token prediction
next_token = torch.argmax(logits[:, -1, :], dim=-1)
print(tokenizer.decode(next_token))
```
### Generation Loop
```python
def generate(model, tokenizer, prompt, max_new_tokens=100):
input_ids = tokenizer.encode(prompt, return_tensors="pt")
generated = input_ids
with torch.no_grad():
for _ in range(max_new_tokens):
output = model(generated)
next_token = torch.argmax(output["logits"][:, -1, :], dim=-1, keepdim=True)
generated = torch.cat([generated, next_token], dim=-1)
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(generated[0], skip_special_tokens=True)
print(generate(model, tokenizer, "The capital of France is"))
```
### Multilingual Example
```python
prompts = [
"The weather today is", # English
"El clima de hoy es", # Spanish
"La capitale de la France est", # French
]
for prompt in prompts:
print(f"{prompt} -> {generate(model, tokenizer, prompt, max_new_tokens=20)}")
```
## Files in This Repository
| File | Description |
|---|---|
| `pytorch_model.pt` | Model weights (state_dict) |
| `config.yaml` | Model configuration (AetherisConfig) |
| `aetheris/` | Model source code (importable Python package) |
| `student_config.yaml` | Student architecture config used during training |
| `training_config.yaml` | Training hyperparameters |
| `stage1_checkpoint.pt` | Stage 1 (CKA alignment) checkpoint |
| `stage2_best.pt` | Stage 2 (KL distillation) best checkpoint |
## Limitations
- **Stage 3 SFT is in progress.** The current weights reflect Stage 2 distillation. Conversational and instruction-following quality will improve after SFT completes.
- **Not a chat model yet.** The model generates continuations, not structured dialogue. SFT will address this.
- **Low-resource language quality varies.** Languages with non-Latin scripts (Amharic, Burmese, Lao) show higher loss. This is an active area of work.
- **No CUDA-optimized SSM kernels.** The current implementation uses a pure-Python SSM fallback. Inference speed will improve with Mamba CUDA kernels.
- **Evaluation benchmarks pending.** Systematic multilingual benchmarks are planned post-SFT.
## Citation
```bibtex
@misc{aetheris2026,
title={Aetheris: A Hybrid Mamba-MoE Model for Efficient Multilingual Generation},
author={Wayy Research},
year={2026},
url={https://huggingface.co/wayyresearch/aetheris},
}
```
## Acknowledgments
- [CohereForAI](https://cohere.com/research) for the Aya model family and multilingual datasets
- The [Mamba](https://arxiv.org/abs/2312.00752) authors for state space model foundations
- The open-source multilingual NLP community
---
Built with frustration and determination by [Wayy Research](https://wayyresearch.com), Buffalo NY.
*People for research, research for people.*