TOk-Atsuru commited on
Commit
189ec99
·
verified ·
1 Parent(s): b8f3f03

Add model card

Browse files
Files changed (1) hide show
  1. README.md +170 -0
README.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: gguf
4
+ base_model: allenai/OLMoE-1B-7B-0125-Instruct
5
+ tags:
6
+ - moe
7
+ - mixture-of-experts
8
+ - expert-tuning
9
+ - domain-adaptation
10
+ - gguf
11
+ - q4_k_m
12
+ - olmoe
13
+ - japanese
14
+ - finance
15
+ - code
16
+ - knowledge-distillation
17
+ model_name: GOBA-OLMoE-Expert-Tuned
18
+ pipeline_tag: text-generation
19
+ language:
20
+ - en
21
+ - ja
22
+ datasets:
23
+ - izumi-lab/llm-japanese-dataset
24
+ - ronantakizawa/Finance-Instruct-500k-Japanese
25
+ - nvidia/OpenCodeReasoning-2
26
+ ---
27
+
28
+ # GOBA-OLMoE-Expert-Tuned: Domain-Specialized MoE via Expert Tuning
29
+
30
+ **3 domain-specialized variants** | JA / Finance / Code | **No general performance loss** | GGUF Q4_K_M | Apache 2.0
31
+
32
+ Domain-adapted variants of [OLMoE-1B-7B-0125-Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0125-Instruct) created using **Expert Tuning** — a novel approach that repurposes low-importance expert slots in Mixture-of-Experts models for domain-specific adaptation, as an alternative to LoRA or full fine-tuning.
33
+
34
+ ## Highlights
35
+
36
+ - **3 domain variants**: Japanese, Finance, and Code — each tuned independently
37
+ - **Zero general performance loss**: MMLU and GSM8K scores remain within ±2pp of the original
38
+ - **Domain improvements confirmed**: JMMLU +4.5pp, HumanEval+ +10pp, EDINET-Bench Fraud Detection +16pp
39
+ - **Drop-in replacement**: works with llama.cpp with no code changes
40
+ - **Apache 2.0**: fully open for commercial use
41
+
42
+ ## Included Models
43
+
44
+ | File | Domain | Size | Description |
45
+ |------|--------|------|-------------|
46
+ | `OLMoE-ja-tuned.gguf` | Japanese | 3.9 GB | Japanese language comprehension and generation |
47
+ | `OLMoE-finance-tuned.gguf` | Finance | 3.9 GB | Financial analysis, fraud detection, regulatory knowledge |
48
+ | `OLMoE-code-tuned.gguf` | Code | 3.9 GB | Python code generation and reasoning |
49
+
50
+ ## Benchmark Results
51
+
52
+ ### General Benchmarks (no domain bias)
53
+
54
+ | Benchmark | Original | JA-tuned | Finance-tuned | Code-tuned |
55
+ |-----------|----------|----------|---------------|------------|
56
+ | **MMLU** (0-shot, 100Q) | 53% | 54% (+1pp) | 51% (-2pp) | 52% (-1pp) |
57
+ | **GSM8K** (0-shot, 50Q) | 66% | 66% (=) | 66% (=) | 66% (=) |
58
+
59
+ ### Domain-Specific Benchmarks
60
+
61
+ | Benchmark | Original | Tuned | Delta | Verdict |
62
+ |-----------|----------|-------|-------|---------|
63
+ | **JMMLU** (200Q, stratified from 53 subjects) | 30.0% | **34.5%** | **+4.5pp** | POSITIVE |
64
+ | **EDINET-Bench** (100Q, earnings + fraud) | 45.0% | 46.0% | +1.0pp | NEUTRAL |
65
+ | — Fraud detection subset | 34.0% | **50.0%** | **+16.0pp** | POSITIVE |
66
+ | — Earnings forecast subset | 56.0% | 42.0% | -14.0pp | Regression |
67
+ | **HumanEval+** (20Q subset) | 20.0% | **30.0%** | **+10.0pp** | POSITIVE |
68
+
69
+ > **Note**: OLMoE-1B-7B has 1.3B active parameters, so absolute scores are lower than larger models. The relative improvements from Expert Tuning are the key result.
70
+
71
+ ## Model Details
72
+
73
+ | Property | Value |
74
+ |----------|-------|
75
+ | Base model | [allenai/OLMoE-1B-7B-0125-Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0125-Instruct) |
76
+ | Architecture | Transformer with Sparse MoE (SwiGLU experts) |
77
+ | Total / Active parameters | 6.9B / 1.3B |
78
+ | MoE layers | 16 |
79
+ | Experts per layer | 64 (top-8 routing) |
80
+ | Hidden dimension | 2048 |
81
+ | Expert FFN dimension | 1024 (SwiGLU: gate + up + down) |
82
+ | Context length | 4096 tokens |
83
+ | Quantization | Q4_K_M GGUF |
84
+ | License | Apache 2.0 |
85
+
86
+ ## What is Expert Tuning?
87
+
88
+ Expert Tuning is a novel domain adaptation technique for Mixture-of-Experts (MoE) models. Instead of adding external adapters (LoRA) or fine-tuning all parameters, it identifies **low-importance experts** within the existing MoE architecture and replaces them with **domain-specialized experts** trained via knowledge distillation.
89
+
90
+ **Key advantages over LoRA:**
91
+ - No additional parameters at inference time (experts replace existing slots)
92
+ - Native MoE routing automatically directs domain-relevant tokens to specialized experts
93
+ - Compatible with quantized GGUF inference — no adapter merging needed
94
+
95
+ **Method overview:**
96
+ 1. Compute importance scores for all experts across layers
97
+ 2. Select bottom-k experts as candidates for replacement
98
+ 3. Train domain-specific experts using cross-expert knowledge distillation with domain data
99
+ 4. Insert trained experts back into the GGUF with lossless Q4_K/Q6_K quantization
100
+
101
+ ## Training Data
102
+
103
+ | Domain | Dataset | Records | License |
104
+ |--------|---------|---------|---------|
105
+ | **Japanese** | [izumi-lab/llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset) | 50,000 | CC BY-SA 4.0 |
106
+ | **Finance** | [ronantakizawa/Finance-Instruct-500k-Japanese](https://huggingface.co/datasets/ronantakizawa/Finance-Instruct-500k-Japanese) + [y2lan/japan-law](https://huggingface.co/datasets/y2lan/japan-law) | ~250,000 | Apache 2.0 / Public Domain |
107
+ | **Code** | [nvidia/OpenCodeReasoning-2](https://huggingface.co/datasets/nvidia/OpenCodeReasoning-2) (Python subset) | 50,000 | CC BY 4.0 |
108
+
109
+ ## How to Use
110
+
111
+ ### With llama.cpp
112
+
113
+ ```bash
114
+ # Run with llama-server (e.g., Japanese-tuned variant)
115
+ llama-server \
116
+ -m OLMoE-ja-tuned.gguf \
117
+ --port 8090 \
118
+ -ngl 99 \
119
+ -c 4096
120
+
121
+ # Query via OpenAI-compatible API
122
+ curl http://localhost:8090/v1/chat/completions \
123
+ -H "Content-Type: application/json" \
124
+ -d '{
125
+ "model": "olmoe",
126
+ "messages": [{"role": "user", "content": "日本の金融政策について説明してください"}],
127
+ "max_tokens": 512
128
+ }'
129
+ ```
130
+
131
+ ### With moe-stream
132
+
133
+ ```bash
134
+ moe-stream OLMoE-ja-tuned.gguf --server --preload-gates --preload-attn
135
+ ```
136
+
137
+ ## Technical Notes
138
+
139
+ - **Q4_K/Q6_K quantizer**: Custom Python implementation matching llama.cpp reference, with correct sub-block min clamping, qs packing order, and symmetric signed encoding
140
+ - **Insertion CosSim**: 0.9974 (quantized vs. original trained weights), confirming lossless insertion
141
+ - **Training CosSim**: 0.831 average (teacher-student similarity after KD), indicating meaningful domain adaptation while preserving expert structure
142
+ - **64 trained experts per model**: 16 layers x 4 experts/layer replaced
143
+
144
+ ## Limitations
145
+
146
+ - OLMoE-1B-7B has only 1.3B active parameters, limiting absolute performance on complex tasks
147
+ - Domain benchmarks use moderate sample sizes (20-200 questions); larger evaluations may show different effect sizes
148
+ - Finance-tuned model shows prediction bias toward fraud detection, with regression on earnings forecasting tasks
149
+ - Expert Tuning effectiveness scales with the number of experts per layer; models with fewer experts (e.g., 8-16) have less capacity for domain injection
150
+
151
+ ## Citation
152
+
153
+ ```bibtex
154
+ @misc{goba2026expert,
155
+ title={Expert Tuning: Domain Adaptation via Expert Slot Repurposing in Mixture-of-Experts Models},
156
+ author={GOBA AI Labs},
157
+ year={2026},
158
+ url={https://huggingface.co/goba-ai-labs/GOBA-OLMoE-Expert-Tuned}
159
+ }
160
+ ```
161
+
162
+ ## Related Models
163
+
164
+ - [PrunedHub-GPT-OSS-20B-28x](https://huggingface.co/goba-ai-labs/PrunedHub-GPT-OSS-20B-28x) — Lossless expert pruning for GPT-OSS-20B
165
+ - [PrunedHub-Qwen3-30B-A3B-EN-MxMoE](https://huggingface.co/goba-ai-labs/PrunedHub-Qwen3-30B-A3B-EN-MxMoE) — Mixed-quantization MoE pruning
166
+ - [PrunedHub-Qwen3-30B-A3B-JA-MxMoE](https://huggingface.co/goba-ai-labs/PrunedHub-Qwen3-30B-A3B-JA-MxMoE) — Language-aware MoE pruning
167
+
168
+ ---
169
+
170
+ *Built by [GOBA AI Labs](https://goba-ai-labs.github.io) — Making large MoE models practical on consumer hardware.*