GOBA-OLMoE-Expert-Tuned: Domain-Specialized MoE via Expert Tuning

3 domain-specialized variants | JA / Finance / Code | No general performance loss | GGUF Q4_K_M | Apache 2.0

Domain-adapted variants of OLMoE-1B-7B-0125-Instruct created using Expert Tuning — a novel approach that repurposes low-importance expert slots in Mixture-of-Experts models for domain-specific adaptation, as an alternative to LoRA or full fine-tuning.

Highlights

  • 3 domain variants: Japanese, Finance, and Code — each tuned independently
  • Zero general performance loss: MMLU and GSM8K scores remain within ±2pp of the original
  • Domain improvements confirmed: JMMLU +4.5pp, HumanEval+ +10pp, EDINET-Bench Fraud Detection +16pp
  • Drop-in replacement: works with llama.cpp with no code changes
  • Apache 2.0: fully open for commercial use

Included Models

File Domain Size Description
OLMoE-ja-tuned.gguf Japanese 3.9 GB Japanese language comprehension and generation
OLMoE-finance-tuned.gguf Finance 3.9 GB Financial analysis, fraud detection, regulatory knowledge
OLMoE-code-tuned.gguf Code 3.9 GB Python code generation and reasoning

Benchmark Results

General Benchmarks (no domain bias)

Benchmark Original JA-tuned Finance-tuned Code-tuned
MMLU (0-shot, 100Q) 53% 54% (+1pp) 51% (-2pp) 52% (-1pp)
GSM8K (0-shot, 50Q) 66% 66% (=) 66% (=) 66% (=)

Domain-Specific Benchmarks

Benchmark Original Tuned Delta Verdict
JMMLU (200Q, stratified from 53 subjects) 30.0% 34.5% +4.5pp POSITIVE
EDINET-Bench (100Q, earnings + fraud) 45.0% 46.0% +1.0pp NEUTRAL
— Fraud detection subset 34.0% 50.0% +16.0pp POSITIVE
— Earnings forecast subset 56.0% 42.0% -14.0pp Regression
HumanEval+ (20Q subset) 20.0% 30.0% +10.0pp POSITIVE

Note: OLMoE-1B-7B has 1.3B active parameters, so absolute scores are lower than larger models. The relative improvements from Expert Tuning are the key result.

Model Details

Property Value
Base model allenai/OLMoE-1B-7B-0125-Instruct
Architecture Transformer with Sparse MoE (SwiGLU experts)
Total / Active parameters 6.9B / 1.3B
MoE layers 16
Experts per layer 64 (top-8 routing)
Hidden dimension 2048
Expert FFN dimension 1024 (SwiGLU: gate + up + down)
Context length 4096 tokens
Quantization Q4_K_M GGUF
License Apache 2.0

What is Expert Tuning?

Expert Tuning is a novel domain adaptation technique for Mixture-of-Experts (MoE) models. Instead of adding external adapters (LoRA) or fine-tuning all parameters, it identifies low-importance experts within the existing MoE architecture and replaces them with domain-specialized experts trained via knowledge distillation.

Key advantages over LoRA:

  • No additional parameters at inference time (experts replace existing slots)
  • Native MoE routing automatically directs domain-relevant tokens to specialized experts
  • Compatible with quantized GGUF inference — no adapter merging needed

Method overview:

  1. Compute importance scores for all experts across layers
  2. Select bottom-k experts as candidates for replacement
  3. Train domain-specific experts using cross-expert knowledge distillation with domain data
  4. Insert trained experts back into the GGUF with lossless Q4_K/Q6_K quantization

Training Data

Domain Dataset Records License
Japanese izumi-lab/llm-japanese-dataset 50,000 CC BY-SA 4.0
Finance ronantakizawa/Finance-Instruct-500k-Japanese + y2lan/japan-law ~250,000 Apache 2.0 / Public Domain
Code nvidia/OpenCodeReasoning-2 (Python subset) 50,000 CC BY 4.0

How to Use

With llama.cpp

# Run with llama-server (e.g., Japanese-tuned variant)
llama-server \
  -m OLMoE-ja-tuned.gguf \
  --port 8090 \
  -ngl 99 \
  -c 4096

# Query via OpenAI-compatible API
curl http://localhost:8090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "olmoe",
    "messages": [{"role": "user", "content": "日本の金融政策について説明してください"}],
    "max_tokens": 512
  }'

With moe-stream

moe-stream OLMoE-ja-tuned.gguf --server --preload-gates --preload-attn

Technical Notes

  • Q4_K/Q6_K quantizer: Custom Python implementation matching llama.cpp reference, with correct sub-block min clamping, qs packing order, and symmetric signed encoding
  • Insertion CosSim: 0.9974 (quantized vs. original trained weights), confirming lossless insertion
  • Training CosSim: 0.831 average (teacher-student similarity after KD), indicating meaningful domain adaptation while preserving expert structure
  • 64 trained experts per model: 16 layers x 4 experts/layer replaced

Limitations

  • OLMoE-1B-7B has only 1.3B active parameters, limiting absolute performance on complex tasks
  • Domain benchmarks use moderate sample sizes (20-200 questions); larger evaluations may show different effect sizes
  • Finance-tuned model shows prediction bias toward fraud detection, with regression on earnings forecasting tasks
  • Expert Tuning effectiveness scales with the number of experts per layer; models with fewer experts (e.g., 8-16) have less capacity for domain injection

Citation

@misc{goba2026expert,
  title={Expert Tuning: Domain Adaptation via Expert Slot Repurposing in Mixture-of-Experts Models},
  author={GOBA AI Labs},
  year={2026},
  url={https://huggingface.co/goba-ai-labs/GOBA-OLMoE-Expert-Tuned}
}

Related Models


Built by GOBA AI Labs — Making large MoE models practical on consumer hardware.

Downloads last month
112
GGUF
Model size
7B params
Architecture
olmoe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GOBA-AI-Labs/GOBA-OLMoE-Expert-Tuned

Datasets used to train GOBA-AI-Labs/GOBA-OLMoE-Expert-Tuned