JGOS-31B-Citizen / README.md
openfree's picture
docs: reposition as administrative/public-sector AI (remove regional specificity)
4cb832d verified
---
language:
- ko
license: gemma
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- darwin
- darwin-v8
- gemma4
- korean
- administrative-ai
- public-sector
- government
- multimodal
- image-text-to-text
- reasoning
- thinking
- conversational
- gpqa
- benchmark
- leaderboard
- k-ai
- k-ai-leaderboard
- vidraft
- jgos
- text-generation
- ffn-transfer
- model-merge
---
# JGOS-31B-Citizen
<p align="center">
<img src="https://huggingface.co/JGOS-Model/JGOS-31B-Citizen/resolve/main/k-ai.png" alt="#1 on the K-AI Leaderboard" width="780"/>
</p>
<p align="center">
🏆 <b>#1 on the K-AI Leaderboard</b> &middot; Korea's national Korean-language AI benchmark (<a href="https://leaderboard.aihub.or.kr/leaderboard">leaderboard.aihub.or.kr</a>)
</p>
**JGOS-31B-Citizen** is a Korean, multimodal large language model **specialized for administrative & public-sector AI services** — civil-complaint response, public-document understanding, and government-domain question answering.
## Overview
JGOS-31B-Citizen is built on VIDRAFT's **Darwin V8** platform.
- **Base + FFN transfer, breeding & evolution (Darwin V8).** Starting from our in-house **gemma4-31b** base, the **feed-forward network (FFN)** blocks of multiple source models are extracted and grafted, then bred (merged) and evolved across **multiple generations** through the Darwin V8 pipeline to accumulate capability.
- **Korean administrative-domain fine-tuning.** The evolved model is further trained on **Korean-specialized datasets** to strengthen Korean comprehension, reasoning, and **administrative/public-sector domain** performance.
> The set of grafted source models, the number of evolution generations, the breeding strategy, dataset composition, and training configuration are proprietary and not disclosed.
## Specifications
| Item | Value |
|------|-------|
| Parameters | ~31B (dense) |
| Modality | Text + Image (multimodal) |
| Context length | up to 256K tokens |
| Base family | gemma4-31b (Gemma-compatible architecture) |
| Focus | Administrative & public-sector AI services |
## Highlights
- 🏆 **#1 on the K-AI Leaderboard** — Korea's national Korean-language AI benchmark (KMMLU-Pro · CLIcK · HLE · MuSR · Com2)
- **GPQA Diamond: 84.34%**
## Evaluation
### GPQA Diamond (198 questions)
| Method (test-time compute) | Score |
|----------------------------|-------|
| maj@8 + tie-retry + DELPHI + near-miss maj@32-64 (weighted vote) | **84.34%** (167/198) |
## Training Datasets
JGOS-31B-Citizen was trained using large-scale Korean corpora sourced from the **Korean AI Hub (AIHub)** — Korea's national AI data repository operated by NIA. The following datasets were used to optimize performance on the **K-AI Leaderboard** benchmarks (KoMMLU-Pro, CLIcK, HLE, MuSR, Com2):
| # | Dataset Name | AIHub Link |
|---|---|---|
| 1 | Medical and Legal Professional Books Corpus | [71487](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71487) |
| 2 | Financial and Legal Document Machine Reading Comprehension | [71610](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71610) |
| 3 | Large-scale Web-based Korean Corpus | [624](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=624) |
| 4 | Large-scale Book-based Korean Corpus | [653](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=653) |
| 5 | National Records Large-scale AI Learning Corpus | [71788](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71788) |
| 6 | Korean Generation-based Common Sense Reasoning Dataset | [459](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=459) |
| 7 | Multi-session Dialogue Corpus | [pkg1](https://aihub.or.kr/aihubdata/data/view.do?currMenu=511&topMenu=100&aihubDataSe=dataPckage&dataPckageSn=1) |
| 8 | Essential Medical Knowledge Data (142GB) | [71875](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71875) |
| 9 | Specialized Medical Knowledge Data (206GB) | [71874](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71874) |
| 10 | Korean Dialogue Dataset | [272](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=272) |
> All datasets are publicly available via [AIHub](https://aihub.or.kr) (registration required).
## License
This model is built on a Gemma-family architecture and is distributed under the [**Gemma Terms of Use**](https://ai.google.dev/gemma/terms). By using this model, you agree to the Gemma license terms.