---
language:
  - en
  - fr
  - es
  - pt
  - it
  - ro
  - de
  - nl
  - da
  - sv
  - "no"
  - ru
  - uk
  - pl
  - cs
  - sk
  - hr
  - sr
  - sl
  - bg
  - lv
  - lt
  - el
  - et
  - fi
  - hu
  - eu
  - cy
  - ga
  - ar
  - fa
  - he
  - tr
  - hi
  - ur
  - bn
  - mr
  - gu
  - pa
  - ne
  - ta
  - te
  - zh
  - ja
  - ko
  - id
  - ms
  - tl
  - jv
  - vi
  - km
  - th
  - lo
  - my
  - am
  - ha
  - ig
  - sw
  - yo
  - so
  - zu
  - xh
  - ca
  - gl
  - mt
license: apache-2.0
library_name: pytorch
pipeline_tag: text-generation
tags:
  - mamba
  - ssm
  - state-space-model
  - mixture-of-experts
  - moe
  - multilingual
  - distillation
  - knowledge-distillation
  - aya
  - hybrid-architecture
  - wayy-research
model-index:
  - name: aetheris
    results: []
---

# Aetheris

> A hybrid Mamba-MoE language model distilled from Aya for efficient multilingual generation across 67 languages.

**Aetheris** is a 536M-parameter hybrid architecture that interleaves State Space Model (Mamba) layers with Sparse Mixture-of-Experts (MoE) layers. It was distilled from [CohereLabs/tiny-aya-global](https://huggingface.co/CohereForAI/aya-expanse-8b) (3.35B params) using a 3-stage pipeline: CKA-guided alignment, KL divergence distillation across 67 languages, and supervised fine-tuning on multilingual chat data.

The goal: compress a massively multilingual teacher into a model small enough to run on consumer hardware, without abandoning low-resource languages.

| | |
|---|---|
| **Developer** | [Wayy Research](https://wayyresearch.com), Buffalo NY |
| **Parameters** | 536M (pruned) / 722M (full vocab) |
| **Teacher** | CohereLabs/tiny-aya-global (3.35B) |
| **Compression** | ~4.6x (base config) |
| **Languages** | 67 |
| **License** | Apache 2.0 |
| **Demo** | [aetheris-playground](https://huggingface.co/spaces/wayyresearch/aetheris-playground) |

## Architecture

Aetheris uses a hybrid design that alternates between two layer types across 24 total layers:

- **12 SSM (Mamba) layers** (even indices) -- linear-time sequence modeling with selective state spaces
- **12 Sparse MoE layers** (odd indices) -- capacity scaling through top-1 routing over 4 experts

This interleaving gives the model both efficient long-range dependency modeling (SSM) and parameter-efficient capacity (MoE).

### Configuration

| Hyperparameter | Value |
|---|---|
| `d_model` | 1024 |
| `d_ff` | 3072 |
| `d_inner` (SSM) | 2048 |
| `n_layer` | 24 (12 SSM + 12 MoE) |
| `ssm_d_state` | 16 |
| `ssm_expand` | 2 |
| `num_experts` | 4 |
| `top_k` (routing) | 1 |
| `vocab_size` | 261,019 (shared Aya tokenizer) |
| `max_seq_len` | 2048 |
| Weight tying | Embedding + LM head shared |

## Training

### 3-Stage Distillation Pipeline

**Stage 1 -- CKA Layer Alignment**
Aligns student hidden representations to teacher layers using Centered Kernel Alignment. This gives the student a structural initialization before distillation begins.

**Stage 2 -- KL Divergence Distillation**
Full knowledge distillation across 67 languages. 20K training steps. Best validation loss: **2.73**.

Key findings from this stage:
- SSM layers receive ~27x less gradient than MoE layers (gradient imbalance ratio = 0.037)
- A **10x learning rate boost** for SSM layers resolved this, reducing KL by 26% and increasing teacher-student agreement by 12x
- Optimal temperature: T=2.0 with alpha=0.7 and cosine schedule

**Stage 3 -- Supervised Fine-Tuning** *(in progress)*
Fine-tuning on multilingual chat data from CohereForAI/aya_collection and aya_evaluation_suite.

| Parameter | Value |
|---|---|
| Data | 16,907 examples, 10 languages (en, es, hi, zh, ar, sw, tr, ja, id, te) |
| Loss masking | Assistant tokens only |
| Learning rate | 2e-5 |
| Batch size | 4 (x4 gradient accumulation) |
| Steps | 5,000 |
| Max sequence length | 512 |

### Expert Initialization

MoE experts were initialized using SVD decomposition of teacher FFN weights, producing genuinely diverse experts (inter-expert CKA = 0.097) rather than near-identical copies (CKA = 0.88 for naive replication).

### Vocab Pruning

The original Aya vocabulary (255K tokens) was pruned to 80K tokens, reducing the model from 722M to 536M parameters (25.7% reduction) with less than 5% increase in fertility across languages.

## Languages

Aetheris supports 67 languages spanning 13 script families:

**Latin**: English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Danish, Swedish, Norwegian, Polish, Czech, Slovak, Croatian, Slovenian, Catalan, Galician, Maltese, Basque, Welsh, Irish, Latvian, Lithuanian, Estonian, Finnish, Hungarian, Turkish, Indonesian, Malay, Tagalog, Javanese, Vietnamese, Swahili, Hausa, Igbo, Yoruba, Somali, Zulu, Xhosa

**Cyrillic**: Russian, Ukrainian, Serbian, Bulgarian

**Arabic**: Arabic, Persian, Urdu

**Devanagari**: Hindi, Marathi, Nepali

**CJK**: Chinese, Japanese, Korean

**Other scripts**: Bengali, Gujarati, Punjabi (Gurmukhi), Tamil, Telugu, Hebrew, Greek, Thai, Khmer, Lao, Burmese, Amharic (Ge'ez)

### Equity Findings

Tokenizer analysis revealed a **4.4x fertility ratio** across languages (p=0.002), with script being the strongest predictor of tokenizer efficiency (p=0.047). Eight high-priority languages were identified for equity monitoring, with the hardest being Amharic (KL=1.80), Burmese (1.64), and Lao (1.56).

Cross-lingual representation similarity of **0.88** indicates strong transfer potential across the language set.

## Usage

```python
import torch
import sys
from huggingface_hub import snapshot_download

# Download model
local_dir = snapshot_download("wayyresearch/aetheris")
sys.path.insert(0, local_dir)

# Load model
from aetheris.config import AetherisConfig
from aetheris.model import HybridMambaMoE

config = AetherisConfig.from_yaml(f"{local_dir}/config.yaml")
model = HybridMambaMoE(config)

sd = torch.load(
    f"{local_dir}/pytorch_model.pt",
    map_location="cpu",
    weights_only=True,
)
model.load_state_dict(sd)
model.eval()

# Tokenize (uses the Aya tokenizer)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-expanse-8b")

input_ids = tokenizer.encode("Hello, how are you?", return_tensors="pt")
with torch.no_grad():
    output = model(input_ids)
    logits = output["logits"]

# Get next-token prediction
next_token = torch.argmax(logits[:, -1, :], dim=-1)
print(tokenizer.decode(next_token))
```

### Generation Loop

```python
def generate(model, tokenizer, prompt, max_new_tokens=100):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    generated = input_ids

    with torch.no_grad():
        for _ in range(max_new_tokens):
            output = model(generated)
            next_token = torch.argmax(output["logits"][:, -1, :], dim=-1, keepdim=True)
            generated = torch.cat([generated, next_token], dim=-1)
            if next_token.item() == tokenizer.eos_token_id:
                break

    return tokenizer.decode(generated[0], skip_special_tokens=True)

print(generate(model, tokenizer, "The capital of France is"))
```

### Multilingual Example

```python
prompts = [
    "The weather today is",           # English
    "El clima de hoy es",             # Spanish
    "La capitale de la France est",   # French
]

for prompt in prompts:
    print(f"{prompt} -> {generate(model, tokenizer, prompt, max_new_tokens=20)}")
```

## Files in This Repository

| File | Description |
|---|---|
| `pytorch_model.pt` | Model weights (state_dict) |
| `config.yaml` | Model configuration (AetherisConfig) |
| `aetheris/` | Model source code (importable Python package) |
| `student_config.yaml` | Student architecture config used during training |
| `training_config.yaml` | Training hyperparameters |
| `stage1_checkpoint.pt` | Stage 1 (CKA alignment) checkpoint |
| `stage2_best.pt` | Stage 2 (KL distillation) best checkpoint |

## Limitations

- **Stage 3 SFT is in progress.** The current weights reflect Stage 2 distillation. Conversational and instruction-following quality will improve after SFT completes.
- **Not a chat model yet.** The model generates continuations, not structured dialogue. SFT will address this.
- **Low-resource language quality varies.** Languages with non-Latin scripts (Amharic, Burmese, Lao) show higher loss. This is an active area of work.
- **No CUDA-optimized SSM kernels.** The current implementation uses a pure-Python SSM fallback. Inference speed will improve with Mamba CUDA kernels.
- **Evaluation benchmarks pending.** Systematic multilingual benchmarks are planned post-SFT.

## Citation

```bibtex
@misc{aetheris2026,
  title={Aetheris: A Hybrid Mamba-MoE Model for Efficient Multilingual Generation},
  author={Wayy Research},
  year={2026},
  url={https://huggingface.co/wayyresearch/aetheris},
}
```

## Acknowledgments

- [CohereForAI](https://cohere.com/research) for the Aya model family and multilingual datasets
- The [Mamba](https://arxiv.org/abs/2312.00752) authors for state space model foundations
- The open-source multilingual NLP community

---

Built with frustration and determination by [Wayy Research](https://wayyresearch.com), Buffalo NY.
*People for research, research for people.*