Avey-D MoE 1B

An attention-free causal language model with Mixture-of-Experts, trained from scratch.

Model Details

Property Value
Architecture Avey-D MoE (attention-free, interleaved Static/Dynamic layers)
Total Parameters 1.01B
Active Parameters 205M (per token)
Hidden Dimension 640
Layers 20 (10 Static MoE + 10 Dynamic dense)
Experts 32 routed + 1 shared, top-4 gating
Context Length 2048 (chunk_size=256, k=3 retrieved chunks)
Vocabulary 50,368 tokens
Training Data FineWeb 10BT sample (~1.3B tokens seen)
Training Hardware 1x AMD Instinct MI300X
Training Time ~4.4 hours
Final Train Loss 4.17
Best Val Loss 4.23
MFU 43.6%
Throughput ~86,500 tok/s

Architecture

Avey-D replaces self-attention with two types of interleaved layers:

  • Static Layers (MoE): Learned causal spatial projection for token mixing + MoE Enricher/Fuser (32 routed experts, top-4 gating, 1 shared expert)
  • Dynamic Layers (Dense): Cosine-similarity token mixing + dense Enricher/Fuser
  • CausalRanker: Neural compression retrieving k=3 most recent preceding chunks

Expert dispatch uses batched torch.bmm via StackedExperts for efficient GPU utilization.

Usage

import torch
from transformers import AutoTokenizer

# Load model (requires the avey_d module)
from avey_d.modeling_moe import AveyDecoderMoEForCausalLM

model = AveyDecoderMoEForCausalLM.from_pretrained("yashmarathe/avey-d-moe-1b")
tokenizer = AutoTokenizer.from_pretrained("yashmarathe/avey-d-moe-1b")

model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

prompt = "The capital of France is"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=100, temperature=0.8, top_k=50)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Configuration

model: d_embed=640, n_layers=20, num_experts=32, top_k=4, shared_expert=true
training: 5000 steps, batch=262144 tok/step, seq_length=2048
optimizer: AdamW, max_lr=3e-4, cosine schedule, warmup=500 steps
dataset: HuggingFaceFW/fineweb sample-10BT

Limitations

This is a small-scale experimental model trained on limited data (~1.3B tokens). It is not intended for production use. The model may generate incoherent, incorrect, or biased text.

Citation

@inproceedings{2026aveyb,
  title={Avey-B},
  author={Acharya, Devang and Hammoud, Mohammad},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}
Downloads last month
16
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support