Avey-D MoE 1B
An attention-free causal language model with Mixture-of-Experts, trained from scratch.
Model Details
| Property | Value |
|---|---|
| Architecture | Avey-D MoE (attention-free, interleaved Static/Dynamic layers) |
| Total Parameters | 1.01B |
| Active Parameters | 205M (per token) |
| Hidden Dimension | 640 |
| Layers | 20 (10 Static MoE + 10 Dynamic dense) |
| Experts | 32 routed + 1 shared, top-4 gating |
| Context Length | 2048 (chunk_size=256, k=3 retrieved chunks) |
| Vocabulary | 50,368 tokens |
| Training Data | FineWeb 10BT sample (~1.3B tokens seen) |
| Training Hardware | 1x AMD Instinct MI300X |
| Training Time | ~4.4 hours |
| Final Train Loss | 4.17 |
| Best Val Loss | 4.23 |
| MFU | 43.6% |
| Throughput | ~86,500 tok/s |
Architecture
Avey-D replaces self-attention with two types of interleaved layers:
- Static Layers (MoE): Learned causal spatial projection for token mixing + MoE Enricher/Fuser (32 routed experts, top-4 gating, 1 shared expert)
- Dynamic Layers (Dense): Cosine-similarity token mixing + dense Enricher/Fuser
- CausalRanker: Neural compression retrieving k=3 most recent preceding chunks
Expert dispatch uses batched torch.bmm via StackedExperts for efficient GPU utilization.
Usage
import torch
from transformers import AutoTokenizer
# Load model (requires the avey_d module)
from avey_d.modeling_moe import AveyDecoderMoEForCausalLM
model = AveyDecoderMoEForCausalLM.from_pretrained("yashmarathe/avey-d-moe-1b")
tokenizer = AutoTokenizer.from_pretrained("yashmarathe/avey-d-moe-1b")
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
prompt = "The capital of France is"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
with torch.no_grad():
output = model.generate(input_ids, max_new_tokens=100, temperature=0.8, top_k=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Training Configuration
model: d_embed=640, n_layers=20, num_experts=32, top_k=4, shared_expert=true
training: 5000 steps, batch=262144 tok/step, seq_length=2048
optimizer: AdamW, max_lr=3e-4, cosine schedule, warmup=500 steps
dataset: HuggingFaceFW/fineweb sample-10BT
Limitations
This is a small-scale experimental model trained on limited data (~1.3B tokens). It is not intended for production use. The model may generate incoherent, incorrect, or biased text.
Citation
@inproceedings{2026aveyb,
title={Avey-B},
author={Acharya, Devang and Hammoud, Mohammad},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}
- Downloads last month
- 16