Gemma3 MoE

This repository provides a Mixture-of-Experts (MoE) variant of Gemma-3, where the standard FFN layers are replaced with a top-k gated MoE architecture.

⚠️ This model requires trust_remote_code=True because it uses a custom configuration and model implementation (gemma3_moe).

This model is intended for research and experimentation, with a focus on:

Korean reasoning and reading comprehension
Fact-checking and factual consistency judgment
Multilingual instruction-following
Analysis of MoE routing and expert specialization

Model Overview

Base model: Gemma-3
Architecture: Decoder-only Transformer with MoE FFN
Experts per layer: Configurable (e.g. 8 experts)
Routing: Top-k soft routing
Auxiliary loss: Router load-balancing loss
Framework: PyTorch
Model format: safetensors

Architecture Details

Each Transformer FFN block is replaced with an MoE layer
A lightweight router maps hidden states to expert logits
Top-k experts are selected per token
Expert outputs are merged via weighted summation
Router logits are retained for auxiliary balancing loss computation

This design enables implicit expert specialization across:

Logical and step-by-step reasoning
Factual QA and verification
Creative and conversational responses
Code-related patterns

Training Data

The model was fine-tuned using a streaming-based mixed SFT dataset, explicitly balanced across languages and task types.

Language Distribution

Korean: ~55%
English: ~35%
Code: ~10%

Datasets were interleaved with fixed probabilities and shuffled using a streaming buffer.

English Instruction Data (35%)

Open-Orca / OpenOrca
- Subset: ~50K samples
- General instruction-following
- Multi-step reasoning and explanation tasks

Korean Instruction Data (55%)

Korean data was diversified with a strong emphasis on reasoning accuracy and factual correctness.

Category	Dataset	Purpose	Weight
Logic / Reasoning	`kyujinpy/KOpen-platypus`	Step-by-step reasoning	20%
QA	`kikikara/ko_QA_dataset`	General QA	20%
MRC	Custom iterable	Reading comprehension	25%
Fact-checking	Custom iterable	Factual verification / hallucination reduction	20%
Creative	`beomi/KoAlpaca-v1.1a`	Conversational & creative tasks	15%

📌 Fact-checking data was intentionally up-weighted to strengthen factual grounding.

Code Instruction Data (10%)

sahil2801 / CodeAlpaca-20k
- Subset: ~15K samples
- Code generation and explanation tasks

Data Processing

Unified SFT formatting across all datasets
Streaming mode to reduce memory usage
Final shuffle buffer size: 10,000
Fixed-probability interleaving to preserve domain balance

Training Strategy

This model was trained with parameter-group–specific optimization, explicitly separating shared parameters, expert FFNs, and router parameters.

Different learning rates and regularization settings were applied to encourage stable MoE specialization and balanced routing behavior.

Optimizer Parameter Groups

def get_moe_param_groups(model):
    attention_params = []
    expert_params = []
    router_params = []

    for name, param in model.named_parameters():
        if not param.requires_grad:
            continue

        # Router parameters
        if "mlp.router" in name:
            router_params.append(param)

        # Expert FFN parameters
        elif "mlp.experts" in name:
            expert_params.append(param)

        # Attention, embeddings, layer norms, etc.
        else:
            attention_params.append(param)

    return [
        {"params": attention_params, "lr": 2e-6, "weight_decay": 0.0},
        {"params": expert_params,    "lr": 1e-5, "weight_decay": 0.1},
        {"params": router_params,    "lr": 2e-5, "weight_decay": 0.0},
    ]

Fact-check Test

User Prompt

다음 요약문이 원문에 비추어 사실인지 판단하시오. 사실 여부와 판단 근거를 함께 제시하시오.

[원문] 올해부터 가맹본부 정보공개서 등록이 수개월에서 30일 이내로 단축될 전망이다. 서울시는 올해 1월1일부터 가맹보부 정보공개서 등록업무를 공정거래위원회로부터 이양받아 서울, 인천, 경기 등 3개 지자체가 처리한다고 16일 밝혔다. 가맹정보공개서는 가맹점 창업 희망자가 계약에 앞서 가맹본부의 정보를 확인할 수 있는 문서로, 계약체결 여부를 결정하는 데 큰 역할을 한다. 기존에 가맹정보공개서 등록이 공정위에서만 가능해 길게는 수개월식 소요됐다. 하지만 올해 지자체와의 업무 분담으로 등록기간이 30일 이내로 단축된다. 서울시는 본사의 현황을 파악할 수 있는 재무사항, 투자수익률 등이 포함돼 있는 자료이니만큼 좀 더 꼼꼼하게 체계적으로 심사하고 등록해 가맹점 창업 희망자들이 창업에 대한 합리적이고 신중한 결정을 할 수 있도록 돕겠다고 밝혔다. 서울소재 가맹본부 중 정보공개서 등록을 원하는 업체는 서울시 공정경제담당관 가맹정보팀으로 우편 또는 방문접수(중구 무교로 21 서울시청 무교별관 8층)하거나 공정거래위원회 가맹사업거래 홈페이지로 신청하면 된다. 한편 시는 오는 18일 오후 2시 서울시청 다목적홀에서 가맹본부 대상 정보공개서 등록업무 설명회를 개최한다.

[요약문] 서울시는 올해 1월1일부터 가맹보부 정보공개서 등록업무를 공정거래위원회로부터 이양받아 서울, 인천, 경기 등 3개 지자체가 처리한다고 16일 밝혔으며, 가맹정보공개서는 가맹점 창업 희망자가 계약에 앞서 가맹본부의 정보를 확인할 수 있는 문서로, 계약체결 여부를 결정하는 데 큰 역할을 한다.

Test Result (Model Output)

판단: 사실

오류 유형: 한글 맞춤법, 띄어쓰기 오류

근거: 요약문에는 '1월1일부터'라는 표현이 있으며, 이는 '1월 1일부터'로 수정되어야 한다. 해당 오류는 [1] 한글 맞춤법, 띄어쓰기 오류에 해당한다. 이러한 오류는 의미나 사실 관계에 영향을 주지 않는다.

QA Test

User Prompt

아르제 압축 프로그램의 복원 기능과 다중볼륨 압축 기능은 무엇인가요?

Test Result (Model Output)

아르제 압축 프로그램은 복원 기능과 다중볼륨 압축 기능을 제공합니다. 복원 기능은 손상된 압축 파일을 복원하여 원래의 압축 파일로 복원할 수 있는 기능을 제공하며, 다중볼륨 압축 기능은 압축 파일을 여러 개의 작은 볼륨으로 나누고, 각 볼륨에 대해 다른 압축 알고리즘을 적용하여 압축률을 높일 수 있는 기능을 제공합니다. 개인적으로, 이러한 기능들은 압축 파일의 손상 복원과 효율적인 압축을 가능하게 해주어 매우 유용하다고 생각합니다.

MRC Test

User Prompt

[뉴스/본문] 한국청소년단체협의회와 여성가족부는 22일부터 28일까지 서울과 충북 괴산에서 '국제청소년포럼'을 연다고 21일 밝혔다. 한국 미국 캐나다 호주 등 전 세계 32개국 75여명의 대학생, 청소년들이 모여 전 세계적 현안문제에 대한 대안과 해결책을 모색하는 자리다. 이번 포럼의 주제는 '청소년과 뉴미디어'다. 스마트폰 SNS 태블릿PC 등 새로운 커뮤니케이션 매체인 '뉴미디어'에 대한 성찰과 문제점에 대해 토론한다. 기조강연을 시작으로 국가별 주제관련 사례발표, 그룹 토론 및 전체총회, '청소년선언문' 작성 및 채택 등 다양한 프로그램을 운영한다. 개회식은 22일 서울 방화동에 있는 국제청소년센터 국제회의장에서 한다. 전 세계 32개국 대학생ㆍ청소년 참가자와 전국의 청소년기관단체장과 청소년지도자 여성가족부 주한외교사절 등 100여명이 참석할 예정이다. 23일에는 유엔미래포럼 박영숙 대표가 '뉴미디어의 균형 있는 발전을 위한 청소년의 역할'에 대해 기조강연을 한다. 뉴미디어의 올바른 활용방안과 청소년문화의 형성에 대해 설명할 계획이다. 27일 폐회식에서는 '청소년선언문'을 채택한다. 선언문에는 전 세계적으로 뉴미디어의 바람직한 발전을 촉구하며 각국 청년들이 함께 실천할 수 있는 내용 등이 담길 예정이다. 한국청소년단체협의회는 포럼이 끝난 뒤 UN 등 국제기구와 참가자 각국 정부 등 국제사회에 선언문을 전달할 예정이다.

[질의] 서울과 충북 괴산에서 '국제청소년포럼'을 여는 곳은?

Test Result (Model Output)

한국청소년단체협의회와 여성가족부

CODE Test

User Prompt

Create an array of length 5 which contains all even numbers between 1 and 10.

Test Result (Model Output)

arr = [2, 4, 6, 8, 10]

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "chyungwon/police-report-analysis-model-MoE-3X8",
    trust_remote_code=True,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(
    "chyungwon/police-report-analysis-model-MoE-3X8",
    trust_remote_code=True,
)

inputs = tokenizer("MoE(Mixture-of-Experts) 모델이 머야?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)

print('out : ', tokenizer.decode(outputs[0], skip_special_tokens=True))

out : 개인적으로, MoE(Mixture-of-Experts) 모델은 복잡한 문제를 해결하는 데 매우 효과적이라고 생각합니다. 하지만, 모델의 복잡성과 학습 데이터의 양이 많은 단점도 고려해야 합니다.

Model Card Authors

(주)인정보
홈페이지 : http://www.ijbinfo.com

정보통신산업진흥원의 지원을 받아서 진행했습니다.

Model Card Contact

(주)인정보
주소 : 서울시 금천구 가산동 60-5 갑을그레이트밸리A동 805호
연락처 : TEL : 02-3397-7765 FAX : 02-3397-7769 E-mail : sales@injungbo.co.kr
담당자 : 장형원(chyungwon@ijbinfo.com)

Downloads last month: 7

Safetensors

Model size

5B params

Tensor type

F32