gpt2-4x124M-competitive-moe
A 4-expert Mixture of Experts built from fine-tuned GPT-2 variants, using entropy-based competitive routing β no learned router, no weight blending. Every token, the most confident expert wins.
How it works
Each expert runs a full forward pass independently. The router picks the winner by computing the Shannon entropy of each expert's next-token probability distribution and selecting the one with the lowest entropy (i.e. the most certain). The winning expert's logits are used as the output β all others are discarded.
winner = argmin_i H(P_i(x_{t+1} | x_{β€t}))
This is winner-take-all, not a weighted mixture. It's closer to a selection mechanism than traditional MoE blending.
Experts
| # | Model |
|---|---|
| 0 | openai-community/gpt2 |
| 1 | samkeet/GPT_124M-Instruct |
| 2 | MiniLLM/SFT-gpt2-120M |
| 3 | Arjun-G-Ravi/chat-GPT2 |
All experts share the same GPT-2 tokenizer and base architecture. Vocab sizes differ slightly between checkpoints (50257 / 50259 / 50304) β logits are truncated to the minimum at inference time.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Fu01978/gpt2-4x124M-competitive-moe",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("Fu01978/gpt2-4x124M-competitive-moe")
input_ids = tokenizer.encode("Question: What is the speed of light? Answer:", return_tensors="pt")
with torch.no_grad():
out = model.generate(input_ids, max_new_tokens=40)
print(tokenizer.decode(out[0]))
Note:
trust_remote_code=Trueis required due to the custom architecture.
Parameters
| Active parameters per token | ~124M |
| Routing | Entropy-based, winner-take-all |
| Vocab size | 50257 (effective) |
Observed routing behavior
In practice, experts 2 and 3 dominate routing on Q&A style prompts, typically splitting around 30/70 to 60/40 depending on the query. Experts 0 and 1 rarely win β they have broader, more uniform distributions which results in higher entropy and therefore lose the confidence competition on most tokens.
Some example outputs
on Question: X Answer:
formatted prompts:
| Question | Answer |
|---|---|
| Largest planet in the solar system? | Jupiter β |
| Who wrote Romeo and Juliet? | William Shakespeare β |
| Boiling point of water? | 100 degrees Celsius β |
| Powerhouse of the cell? | Mitochondria β |
| Speed of light? | 299,792,458 m/s β |
| Who painted the Mona Lisa? | Leonardo da Vinci β |
| How many planets in the solar system? | 8 β |
| How many sides does a hexagon have? | 4 β |
| Chemical symbol for gold? | Fe β |
These are GPT-2 scale limitations, not routing failures β the experts simply don't have reliable knowledge on everything.
Limitations
This is a GPT-2 scale experiment, not a production model. Don't expect reliable factual accuracy.
No instruction tuning at the MoE level β the Q&A format works because expert 2 was fine-tuned on it, but there's no guarantee of structured output.
Expert 1 effectively never wins routing due to its broad distribution. It contributes nothing at inference time under entropy routing.
Routing is greedy and non-differentiable β there's no way to fine-tune the routing behavior without replacing it with a learned router.
Intended use
This is a research project exploring competitive routing as an alternative to learned MoE routers. It's interesting to inspect routing decisions token-by-token to understand how expert distributions differ. It's not intended for any downstream task.
- Downloads last month
- 508