Avey-D MoE 1B

An attention-free causal language model with Mixture-of-Experts, trained from scratch.

Model Details

Property	Value
Architecture	Avey-D MoE (attention-free, interleaved Static/Dynamic layers)
Total Parameters	1.01B
Active Parameters	205M (per token)
Hidden Dimension	640
Layers	20 (10 Static MoE + 10 Dynamic dense)
Experts	32 routed + 1 shared, top-4 gating
Context Length	2048 (chunk_size=256, k=3 retrieved chunks)
Vocabulary	50,368 tokens
Training Data	FineWeb 10BT sample (~1.3B tokens seen)
Training Hardware	1x AMD Instinct MI300X
Training Time	~4.4 hours
Final Train Loss	4.17
Best Val Loss	4.23
MFU	43.6%
Throughput	~86,500 tok/s

Architecture

Avey-D replaces self-attention with two types of interleaved layers:

Static Layers (MoE): Learned causal spatial projection for token mixing + MoE Enricher/Fuser (32 routed experts, top-4 gating, 1 shared expert)
Dynamic Layers (Dense): Cosine-similarity token mixing + dense Enricher/Fuser
CausalRanker: Neural compression retrieving k=3 most recent preceding chunks

Expert dispatch uses batched torch.bmm via StackedExperts for efficient GPU utilization.

Usage

import torch
from transformers import AutoTokenizer

# Load model (requires the avey_d module)
from avey_d.modeling_moe import AveyDecoderMoEForCausalLM

model = AveyDecoderMoEForCausalLM.from_pretrained("yashmarathe/avey-d-moe-1b")
tokenizer = AutoTokenizer.from_pretrained("yashmarathe/avey-d-moe-1b")

model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

prompt = "The capital of France is"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=100, temperature=0.8, top_k=50)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Configuration

model: d_embed=640, n_layers=20, num_experts=32, top_k=4, shared_expert=true
training: 5000 steps, batch=262144 tok/step, seq_length=2048
optimizer: AdamW, max_lr=3e-4, cosine schedule, warmup=500 steps
dataset: HuggingFaceFW/fineweb sample-10BT

Limitations

This is a small-scale experimental model trained on limited data (~1.3B tokens). It is not intended for production use. The model may generate incoherent, incorrect, or biased text.

Citation

@inproceedings{2026aveyb,
  title={Avey-B},
  author={Acharya, Devang and Hammoud, Mohammad},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

Downloads last month: 5

Safetensors

Model size

1B params

Tensor type

F32