Aetheris

A hybrid Mamba-MoE language model distilled from Aya for efficient multilingual generation across 67 languages.

Aetheris is a 536M-parameter hybrid architecture that interleaves State Space Model (Mamba) layers with Sparse Mixture-of-Experts (MoE) layers. It was distilled from CohereLabs/tiny-aya-global (3.35B params) using a 3-stage pipeline: CKA-guided alignment, KL divergence distillation across 67 languages, and supervised fine-tuning on multilingual chat data.

The goal: compress a massively multilingual teacher into a model small enough to run on consumer hardware, without abandoning low-resource languages.


Developer	Wayy Research, Buffalo NY
Parameters	536M (pruned) / 722M (full vocab)
Teacher	CohereLabs/tiny-aya-global (3.35B)
Compression	~4.6x (base config)
Languages	67
License	Apache 2.0
Demo	aetheris-playground

Architecture

Aetheris uses a hybrid design that alternates between two layer types across 24 total layers:

12 SSM (Mamba) layers (even indices) -- linear-time sequence modeling with selective state spaces
12 Sparse MoE layers (odd indices) -- capacity scaling through top-1 routing over 4 experts

This interleaving gives the model both efficient long-range dependency modeling (SSM) and parameter-efficient capacity (MoE).

Configuration

Hyperparameter	Value
`d_model`	1024
`d_ff`	3072
`d_inner` (SSM)	2048
`n_layer`	24 (12 SSM + 12 MoE)
`ssm_d_state`	16
`ssm_expand`	2
`num_experts`	4
`top_k` (routing)	1
`vocab_size`	261,019 (shared Aya tokenizer)
`max_seq_len`	2048
Weight tying	Embedding + LM head shared

Training

3-Stage Distillation Pipeline

Stage 1 -- CKA Layer Alignment Aligns student hidden representations to teacher layers using Centered Kernel Alignment. This gives the student a structural initialization before distillation begins.

Stage 2 -- KL Divergence Distillation Full knowledge distillation across 67 languages. 20K training steps. Best validation loss: 2.73.

Key findings from this stage:

SSM layers receive ~27x less gradient than MoE layers (gradient imbalance ratio = 0.037)
A 10x learning rate boost for SSM layers resolved this, reducing KL by 26% and increasing teacher-student agreement by 12x
Optimal temperature: T=2.0 with alpha=0.7 and cosine schedule

Stage 3 -- Supervised Fine-Tuning (in progress) Fine-tuning on multilingual chat data from CohereForAI/aya_collection and aya_evaluation_suite.

Parameter	Value
Data	16,907 examples, 10 languages (en, es, hi, zh, ar, sw, tr, ja, id, te)
Loss masking	Assistant tokens only
Learning rate	2e-5
Batch size	4 (x4 gradient accumulation)
Steps	5,000
Max sequence length	512

Expert Initialization

MoE experts were initialized using SVD decomposition of teacher FFN weights, producing genuinely diverse experts (inter-expert CKA = 0.097) rather than near-identical copies (CKA = 0.88 for naive replication).

Vocab Pruning

The original Aya vocabulary (255K tokens) was pruned to 80K tokens, reducing the model from 722M to 536M parameters (25.7% reduction) with less than 5% increase in fertility across languages.

Languages

Aetheris supports 67 languages spanning 13 script families:

Latin: English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Danish, Swedish, Norwegian, Polish, Czech, Slovak, Croatian, Slovenian, Catalan, Galician, Maltese, Basque, Welsh, Irish, Latvian, Lithuanian, Estonian, Finnish, Hungarian, Turkish, Indonesian, Malay, Tagalog, Javanese, Vietnamese, Swahili, Hausa, Igbo, Yoruba, Somali, Zulu, Xhosa

Cyrillic: Russian, Ukrainian, Serbian, Bulgarian

Arabic: Arabic, Persian, Urdu

Devanagari: Hindi, Marathi, Nepali

CJK: Chinese, Japanese, Korean

Other scripts: Bengali, Gujarati, Punjabi (Gurmukhi), Tamil, Telugu, Hebrew, Greek, Thai, Khmer, Lao, Burmese, Amharic (Ge'ez)

Equity Findings

Tokenizer analysis revealed a 4.4x fertility ratio across languages (p=0.002), with script being the strongest predictor of tokenizer efficiency (p=0.047). Eight high-priority languages were identified for equity monitoring, with the hardest being Amharic (KL=1.80), Burmese (1.64), and Lao (1.56).

Cross-lingual representation similarity of 0.88 indicates strong transfer potential across the language set.

Usage

import torch
import sys
from huggingface_hub import snapshot_download

# Download model
local_dir = snapshot_download("wayyresearch/aetheris")
sys.path.insert(0, local_dir)

# Load model
from aetheris.config import AetherisConfig
from aetheris.model import HybridMambaMoE

config = AetherisConfig.from_yaml(f"{local_dir}/config.yaml")
model = HybridMambaMoE(config)

sd = torch.load(
    f"{local_dir}/pytorch_model.pt",
    map_location="cpu",
    weights_only=True,
)
model.load_state_dict(sd)
model.eval()

# Tokenize (uses the Aya tokenizer)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-expanse-8b")

input_ids = tokenizer.encode("Hello, how are you?", return_tensors="pt")
with torch.no_grad():
    output = model(input_ids)
    logits = output["logits"]

# Get next-token prediction
next_token = torch.argmax(logits[:, -1, :], dim=-1)
print(tokenizer.decode(next_token))

Generation Loop

def generate(model, tokenizer, prompt, max_new_tokens=100):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    generated = input_ids

    with torch.no_grad():
        for _ in range(max_new_tokens):
            output = model(generated)
            next_token = torch.argmax(output["logits"][:, -1, :], dim=-1, keepdim=True)
            generated = torch.cat([generated, next_token], dim=-1)
            if next_token.item() == tokenizer.eos_token_id:
                break

    return tokenizer.decode(generated[0], skip_special_tokens=True)

print(generate(model, tokenizer, "The capital of France is"))

Multilingual Example

prompts = [
    "The weather today is",           # English
    "El clima de hoy es",             # Spanish
    "La capitale de la France est",   # French
]

for prompt in prompts:
    print(f"{prompt} -> {generate(model, tokenizer, prompt, max_new_tokens=20)}")

Files in This Repository

File	Description
`pytorch_model.pt`	Model weights (state_dict)
`config.yaml`	Model configuration (AetherisConfig)
`aetheris/`	Model source code (importable Python package)
`student_config.yaml`	Student architecture config used during training
`training_config.yaml`	Training hyperparameters
`stage1_checkpoint.pt`	Stage 1 (CKA alignment) checkpoint
`stage2_best.pt`	Stage 2 (KL distillation) best checkpoint

Limitations

Stage 3 SFT is in progress. The current weights reflect Stage 2 distillation. Conversational and instruction-following quality will improve after SFT completes.
Not a chat model yet. The model generates continuations, not structured dialogue. SFT will address this.
Low-resource language quality varies. Languages with non-Latin scripts (Amharic, Burmese, Lao) show higher loss. This is an active area of work.
No CUDA-optimized SSM kernels. The current implementation uses a pure-Python SSM fallback. Inference speed will improve with Mamba CUDA kernels.
Evaluation benchmarks pending. Systematic multilingual benchmarks are planned post-SFT.

Citation

@misc{aetheris2026,
  title={Aetheris: A Hybrid Mamba-MoE Model for Efficient Multilingual Generation},
  author={Wayy Research},
  year={2026},
  url={https://huggingface.co/wayyresearch/aetheris},
}

Acknowledgments

CohereForAI for the Aya model family and multilingual datasets
The Mamba authors for state space model foundations
The open-source multilingual NLP community

Built with frustration and determination by Wayy Research, Buffalo NY. People for research, research for people.

Downloads last month: 12

Space using wayyresearch/aetheris 1

Paper for wayyresearch/aetheris

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Paper • 2312.00752 • Published Dec 1, 2023 • 150