gpt2-4x124M-competitive-moe

A 4-expert Mixture of Experts built from fine-tuned GPT-2 variants, using entropy-based competitive routing β€” no learned router, no weight blending. Every token, the most confident expert wins.

How it works

Each expert runs a full forward pass independently. The router picks the winner by computing the Shannon entropy of each expert's next-token probability distribution and selecting the one with the lowest entropy (i.e. the most certain). The winning expert's logits are used as the output β€” all others are discarded.

winner = argmin_i H(P_i(x_{t+1} | x_{≀t}))

This is winner-take-all, not a weighted mixture. It's closer to a selection mechanism than traditional MoE blending.

Experts

All experts share the same GPT-2 tokenizer and base architecture. Vocab sizes differ slightly between checkpoints (50257 / 50259 / 50304) β€” logits are truncated to the minimum at inference time.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
 
model = AutoModelForCausalLM.from_pretrained(
    "Fu01978/gpt2-4x124M-competitive-moe",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("Fu01978/gpt2-4x124M-competitive-moe")
 
input_ids = tokenizer.encode("Question: What is the speed of light? Answer:", return_tensors="pt")
with torch.no_grad():
    out = model.generate(input_ids, max_new_tokens=40)
print(tokenizer.decode(out[0]))

Note: trust_remote_code=True is required due to the custom architecture.

Parameters

Active parameters per token ~124M
Routing Entropy-based, winner-take-all
Vocab size 50257 (effective)

Observed routing behavior

In practice, experts 2 and 3 dominate routing on Q&A style prompts, typically splitting around 30/70 to 60/40 depending on the query. Experts 0 and 1 rarely win β€” they have broader, more uniform distributions which results in higher entropy and therefore lose the confidence competition on most tokens.

Some example outputs on Question: X Answer: formatted prompts:

Question Answer
Largest planet in the solar system? Jupiter βœ“
Who wrote Romeo and Juliet? William Shakespeare βœ“
Boiling point of water? 100 degrees Celsius βœ“
Powerhouse of the cell? Mitochondria βœ“
Speed of light? 299,792,458 m/s βœ“
Who painted the Mona Lisa? Leonardo da Vinci βœ“
How many planets in the solar system? 8 βœ“
How many sides does a hexagon have? 4 βœ—
Chemical symbol for gold? Fe βœ—

These are GPT-2 scale limitations, not routing failures β€” the experts simply don't have reliable knowledge on everything.

Limitations

  • This is a GPT-2 scale experiment, not a production model. Don't expect reliable factual accuracy.

  • No instruction tuning at the MoE level β€” the Q&A format works because expert 2 was fine-tuned on it, but there's no guarantee of structured output.

  • Expert 1 effectively never wins routing due to its broad distribution. It contributes nothing at inference time under entropy routing.

  • Routing is greedy and non-differentiable β€” there's no way to fine-tune the routing behavior without replacing it with a learned router.

Intended use

This is a research project exploring competitive routing as an alternative to learned MoE routers. It's interesting to inspect routing decisions token-by-token to understand how expert distributions differ. It's not intended for any downstream task.

Downloads last month
508
Safetensors
Model size
0.5B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Fu01978/gpt2-4x124M-competitive-moe

Finetuned
(1)
this model