GOBA-OLMoE-Expert-Tuned: Domain-Specialized MoE via Expert Tuning
3 domain-specialized variants | JA / Finance / Code | No general performance loss | GGUF Q4_K_M | Apache 2.0
Domain-adapted variants of OLMoE-1B-7B-0125-Instruct created using Expert Tuning — a novel approach that repurposes low-importance expert slots in Mixture-of-Experts models for domain-specific adaptation, as an alternative to LoRA or full fine-tuning.
Highlights
- 3 domain variants: Japanese, Finance, and Code — each tuned independently
- Zero general performance loss: MMLU and GSM8K scores remain within ±2pp of the original
- Domain improvements confirmed: JMMLU +4.5pp, HumanEval+ +10pp, EDINET-Bench Fraud Detection +16pp
- Drop-in replacement: works with llama.cpp with no code changes
- Apache 2.0: fully open for commercial use
Included Models
| File | Domain | Size | Description |
|---|---|---|---|
OLMoE-ja-tuned.gguf |
Japanese | 3.9 GB | Japanese language comprehension and generation |
OLMoE-finance-tuned.gguf |
Finance | 3.9 GB | Financial analysis, fraud detection, regulatory knowledge |
OLMoE-code-tuned.gguf |
Code | 3.9 GB | Python code generation and reasoning |
Benchmark Results
General Benchmarks (no domain bias)
| Benchmark | Original | JA-tuned | Finance-tuned | Code-tuned |
|---|---|---|---|---|
| MMLU (0-shot, 100Q) | 53% | 54% (+1pp) | 51% (-2pp) | 52% (-1pp) |
| GSM8K (0-shot, 50Q) | 66% | 66% (=) | 66% (=) | 66% (=) |
Domain-Specific Benchmarks
| Benchmark | Original | Tuned | Delta | Verdict |
|---|---|---|---|---|
| JMMLU (200Q, stratified from 53 subjects) | 30.0% | 34.5% | +4.5pp | POSITIVE |
| EDINET-Bench (100Q, earnings + fraud) | 45.0% | 46.0% | +1.0pp | NEUTRAL |
| — Fraud detection subset | 34.0% | 50.0% | +16.0pp | POSITIVE |
| — Earnings forecast subset | 56.0% | 42.0% | -14.0pp | Regression |
| HumanEval+ (20Q subset) | 20.0% | 30.0% | +10.0pp | POSITIVE |
Note: OLMoE-1B-7B has 1.3B active parameters, so absolute scores are lower than larger models. The relative improvements from Expert Tuning are the key result.
Model Details
| Property | Value |
|---|---|
| Base model | allenai/OLMoE-1B-7B-0125-Instruct |
| Architecture | Transformer with Sparse MoE (SwiGLU experts) |
| Total / Active parameters | 6.9B / 1.3B |
| MoE layers | 16 |
| Experts per layer | 64 (top-8 routing) |
| Hidden dimension | 2048 |
| Expert FFN dimension | 1024 (SwiGLU: gate + up + down) |
| Context length | 4096 tokens |
| Quantization | Q4_K_M GGUF |
| License | Apache 2.0 |
What is Expert Tuning?
Expert Tuning is a novel domain adaptation technique for Mixture-of-Experts (MoE) models. Instead of adding external adapters (LoRA) or fine-tuning all parameters, it identifies low-importance experts within the existing MoE architecture and replaces them with domain-specialized experts trained via knowledge distillation.
Key advantages over LoRA:
- No additional parameters at inference time (experts replace existing slots)
- Native MoE routing automatically directs domain-relevant tokens to specialized experts
- Compatible with quantized GGUF inference — no adapter merging needed
Method overview:
- Compute importance scores for all experts across layers
- Select bottom-k experts as candidates for replacement
- Train domain-specific experts using cross-expert knowledge distillation with domain data
- Insert trained experts back into the GGUF with lossless Q4_K/Q6_K quantization
Training Data
| Domain | Dataset | Records | License |
|---|---|---|---|
| Japanese | izumi-lab/llm-japanese-dataset | 50,000 | CC BY-SA 4.0 |
| Finance | ronantakizawa/Finance-Instruct-500k-Japanese + y2lan/japan-law | ~250,000 | Apache 2.0 / Public Domain |
| Code | nvidia/OpenCodeReasoning-2 (Python subset) | 50,000 | CC BY 4.0 |
How to Use
With llama.cpp
# Run with llama-server (e.g., Japanese-tuned variant)
llama-server \
-m OLMoE-ja-tuned.gguf \
--port 8090 \
-ngl 99 \
-c 4096
# Query via OpenAI-compatible API
curl http://localhost:8090/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "olmoe",
"messages": [{"role": "user", "content": "日本の金融政策について説明してください"}],
"max_tokens": 512
}'
With moe-stream
moe-stream OLMoE-ja-tuned.gguf --server --preload-gates --preload-attn
Technical Notes
- Q4_K/Q6_K quantizer: Custom Python implementation matching llama.cpp reference, with correct sub-block min clamping, qs packing order, and symmetric signed encoding
- Insertion CosSim: 0.9974 (quantized vs. original trained weights), confirming lossless insertion
- Training CosSim: 0.831 average (teacher-student similarity after KD), indicating meaningful domain adaptation while preserving expert structure
- 64 trained experts per model: 16 layers x 4 experts/layer replaced
Limitations
- OLMoE-1B-7B has only 1.3B active parameters, limiting absolute performance on complex tasks
- Domain benchmarks use moderate sample sizes (20-200 questions); larger evaluations may show different effect sizes
- Finance-tuned model shows prediction bias toward fraud detection, with regression on earnings forecasting tasks
- Expert Tuning effectiveness scales with the number of experts per layer; models with fewer experts (e.g., 8-16) have less capacity for domain injection
Citation
@misc{goba2026expert,
title={Expert Tuning: Domain Adaptation via Expert Slot Repurposing in Mixture-of-Experts Models},
author={GOBA AI Labs},
year={2026},
url={https://huggingface.co/goba-ai-labs/GOBA-OLMoE-Expert-Tuned}
}
Related Models
- PrunedHub-GPT-OSS-20B-28x — Lossless expert pruning for GPT-OSS-20B
- PrunedHub-Qwen3-30B-A3B-EN-MxMoE — Mixed-quantization MoE pruning
- PrunedHub-Qwen3-30B-A3B-JA-MxMoE — Language-aware MoE pruning
Built by GOBA AI Labs — Making large MoE models practical on consumer hardware.
- Downloads last month
- 112
We're not able to determine the quantization variants.
Model tree for GOBA-AI-Labs/GOBA-OLMoE-Expert-Tuned
Base model
allenai/OLMoE-1B-7B-0125