aetheris / README.md

Update model card with full architecture and training details

051c2da verified 4 days ago

9.19 kB

	---
	language:
	- en
	- fr
	- es
	- pt
	- it
	- ro
	- de
	- nl
	- da
	- sv
	- "no"
	- ru
	- uk
	- pl
	- cs
	- sk
	- hr
	- sr
	- sl
	- bg
	- lv
	- lt
	- el
	- et
	- fi
	- hu
	- eu
	- cy
	- ga
	- ar
	- fa
	- he
	- tr
	- hi
	- ur
	- bn
	- mr
	- gu
	- pa
	- ne
	- ta
	- te
	- zh
	- ja
	- ko
	- id
	- ms
	- tl
	- jv
	- vi
	- km
	- th
	- lo
	- my
	- am
	- ha
	- ig
	- sw
	- yo
	- so
	- zu
	- xh
	- ca
	- gl
	- mt
	license: apache-2.0
	library_name: pytorch
	pipeline_tag: text-generation
	tags:
	- mamba
	- ssm
	- state-space-model
	- mixture-of-experts
	- moe
	- multilingual
	- distillation
	- knowledge-distillation
	- aya
	- hybrid-architecture
	- wayy-research
	model-index:
	- name: aetheris
	results: []
	---

	# Aetheris

	> A hybrid Mamba-MoE language model distilled from Aya for efficient multilingual generation across 67 languages.

	Aetheris is a 536M-parameter hybrid architecture that interleaves State Space Model (Mamba) layers with Sparse Mixture-of-Experts (MoE) layers. It was distilled from [CohereLabs/tiny-aya-global](https://huggingface.co/CohereForAI/aya-expanse-8b) (3.35B params) using a 3-stage pipeline: CKA-guided alignment, KL divergence distillation across 67 languages, and supervised fine-tuning on multilingual chat data.

	The goal: compress a massively multilingual teacher into a model small enough to run on consumer hardware, without abandoning low-resource languages.

	\| \| \|
	\|---\|---\|
	\| Developer \| [Wayy Research](https://wayyresearch.com), Buffalo NY \|
	\| Parameters \| 536M (pruned) / 722M (full vocab) \|
	\| Teacher \| CohereLabs/tiny-aya-global (3.35B) \|
	\| Compression \| ~4.6x (base config) \|
	\| Languages \| 67 \|
	\| License \| Apache 2.0 \|
	\| Demo \| [aetheris-playground](https://huggingface.co/spaces/wayyresearch/aetheris-playground) \|

	## Architecture

	Aetheris uses a hybrid design that alternates between two layer types across 24 total layers:

	- 12 SSM (Mamba) layers (even indices) -- linear-time sequence modeling with selective state spaces
	- 12 Sparse MoE layers (odd indices) -- capacity scaling through top-1 routing over 4 experts

	This interleaving gives the model both efficient long-range dependency modeling (SSM) and parameter-efficient capacity (MoE).

	### Configuration

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| `d_model` \| 1024 \|
	\| `d_ff` \| 3072 \|
	\| `d_inner` (SSM) \| 2048 \|
	\| `n_layer` \| 24 (12 SSM + 12 MoE) \|
	\| `ssm_d_state` \| 16 \|
	\| `ssm_expand` \| 2 \|
	\| `num_experts` \| 4 \|
	\| `top_k` (routing) \| 1 \|
	\| `vocab_size` \| 261,019 (shared Aya tokenizer) \|
	\| `max_seq_len` \| 2048 \|
	\| Weight tying \| Embedding + LM head shared \|

	## Training

	### 3-Stage Distillation Pipeline

	Stage 1 -- CKA Layer Alignment
	Aligns student hidden representations to teacher layers using Centered Kernel Alignment. This gives the student a structural initialization before distillation begins.

	Stage 2 -- KL Divergence Distillation
	Full knowledge distillation across 67 languages. 20K training steps. Best validation loss: 2.73.

	Key findings from this stage:
	- SSM layers receive ~27x less gradient than MoE layers (gradient imbalance ratio = 0.037)
	- A 10x learning rate boost for SSM layers resolved this, reducing KL by 26% and increasing teacher-student agreement by 12x
	- Optimal temperature: T=2.0 with alpha=0.7 and cosine schedule

	Stage 3 -- Supervised Fine-Tuning (in progress)
	Fine-tuning on multilingual chat data from CohereForAI/aya_collection and aya_evaluation_suite.

	\| Parameter \| Value \|
	\|---\|---\|
	\| Data \| 16,907 examples, 10 languages (en, es, hi, zh, ar, sw, tr, ja, id, te) \|
	\| Loss masking \| Assistant tokens only \|
	\| Learning rate \| 2e-5 \|
	\| Batch size \| 4 (x4 gradient accumulation) \|
	\| Steps \| 5,000 \|
	\| Max sequence length \| 512 \|

	### Expert Initialization

	MoE experts were initialized using SVD decomposition of teacher FFN weights, producing genuinely diverse experts (inter-expert CKA = 0.097) rather than near-identical copies (CKA = 0.88 for naive replication).

	### Vocab Pruning

	The original Aya vocabulary (255K tokens) was pruned to 80K tokens, reducing the model from 722M to 536M parameters (25.7% reduction) with less than 5% increase in fertility across languages.

	## Languages

	Aetheris supports 67 languages spanning 13 script families:

	Latin: English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Danish, Swedish, Norwegian, Polish, Czech, Slovak, Croatian, Slovenian, Catalan, Galician, Maltese, Basque, Welsh, Irish, Latvian, Lithuanian, Estonian, Finnish, Hungarian, Turkish, Indonesian, Malay, Tagalog, Javanese, Vietnamese, Swahili, Hausa, Igbo, Yoruba, Somali, Zulu, Xhosa

	Cyrillic: Russian, Ukrainian, Serbian, Bulgarian

	Arabic: Arabic, Persian, Urdu

	Devanagari: Hindi, Marathi, Nepali

	CJK: Chinese, Japanese, Korean

	Other scripts: Bengali, Gujarati, Punjabi (Gurmukhi), Tamil, Telugu, Hebrew, Greek, Thai, Khmer, Lao, Burmese, Amharic (Ge'ez)

	### Equity Findings

	Tokenizer analysis revealed a 4.4x fertility ratio across languages (p=0.002), with script being the strongest predictor of tokenizer efficiency (p=0.047). Eight high-priority languages were identified for equity monitoring, with the hardest being Amharic (KL=1.80), Burmese (1.64), and Lao (1.56).

	Cross-lingual representation similarity of 0.88 indicates strong transfer potential across the language set.

	## Usage

	```python
	import torch
	import sys
	from huggingface_hub import snapshot_download

	# Download model
	local_dir = snapshot_download("wayyresearch/aetheris")
	sys.path.insert(0, local_dir)

	# Load model
	from aetheris.config import AetherisConfig
	from aetheris.model import HybridMambaMoE

	config = AetherisConfig.from_yaml(f"{local_dir}/config.yaml")
	model = HybridMambaMoE(config)

	sd = torch.load(
	f"{local_dir}/pytorch_model.pt",
	map_location="cpu",
	weights_only=True,
	)
	model.load_state_dict(sd)
	model.eval()

	# Tokenize (uses the Aya tokenizer)
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-expanse-8b")

	input_ids = tokenizer.encode("Hello, how are you?", return_tensors="pt")
	with torch.no_grad():
	output = model(input_ids)
	logits = output["logits"]

	# Get next-token prediction
	next_token = torch.argmax(logits[:, -1, :], dim=-1)
	print(tokenizer.decode(next_token))
	```

	### Generation Loop

	```python
	def generate(model, tokenizer, prompt, max_new_tokens=100):
	input_ids = tokenizer.encode(prompt, return_tensors="pt")
	generated = input_ids

	with torch.no_grad():
	for _ in range(max_new_tokens):
	output = model(generated)
	next_token = torch.argmax(output["logits"][:, -1, :], dim=-1, keepdim=True)
	generated = torch.cat([generated, next_token], dim=-1)
	if next_token.item() == tokenizer.eos_token_id:
	break

	return tokenizer.decode(generated[0], skip_special_tokens=True)

	print(generate(model, tokenizer, "The capital of France is"))
	```

	### Multilingual Example

	```python
	prompts = [
	"The weather today is", # English
	"El clima de hoy es", # Spanish
	"La capitale de la France est", # French
	]

	for prompt in prompts:
	print(f"{prompt} -> {generate(model, tokenizer, prompt, max_new_tokens=20)}")
	```

	## Files in This Repository

	\| File \| Description \|
	\|---\|---\|
	\| `pytorch_model.pt` \| Model weights (state_dict) \|
	\| `config.yaml` \| Model configuration (AetherisConfig) \|
	\| `aetheris/` \| Model source code (importable Python package) \|
	\| `student_config.yaml` \| Student architecture config used during training \|
	\| `training_config.yaml` \| Training hyperparameters \|
	\| `stage1_checkpoint.pt` \| Stage 1 (CKA alignment) checkpoint \|
	\| `stage2_best.pt` \| Stage 2 (KL distillation) best checkpoint \|

	## Limitations

	- Stage 3 SFT is in progress. The current weights reflect Stage 2 distillation. Conversational and instruction-following quality will improve after SFT completes.
	- Not a chat model yet. The model generates continuations, not structured dialogue. SFT will address this.
	- Low-resource language quality varies. Languages with non-Latin scripts (Amharic, Burmese, Lao) show higher loss. This is an active area of work.
	- No CUDA-optimized SSM kernels. The current implementation uses a pure-Python SSM fallback. Inference speed will improve with Mamba CUDA kernels.
	- Evaluation benchmarks pending. Systematic multilingual benchmarks are planned post-SFT.

	## Citation

	```bibtex
	@misc{aetheris2026,
	title={Aetheris: A Hybrid Mamba-MoE Model for Efficient Multilingual Generation},
	author={Wayy Research},
	year={2026},
	url={https://huggingface.co/wayyresearch/aetheris},
	}
	```

	## Acknowledgments

	- [CohereForAI](https://cohere.com/research) for the Aya model family and multilingual datasets
	- The [Mamba](https://arxiv.org/abs/2312.00752) authors for state space model foundations
	- The open-source multilingual NLP community

	---

	Built with frustration and determination by [Wayy Research](https://wayyresearch.com), Buffalo NY.
	People for research, research for people.