openfree commited on
Commit
43e2d55
·
verified ·
1 Parent(s): 0911fc6

feat: add Training Datasets section (AIHub K-AI optimized)

Browse files
Files changed (1) hide show
  1. README.md +19 -0
README.md CHANGED
@@ -70,6 +70,25 @@ JGOS-31B-Citizen is built on VIDRAFT's **Darwin V8** platform.
70
  |----------------------------|-------|
71
  | maj@8 + tie-retry + DELPHI + near-miss maj@32–64 (weighted vote) | **84.34%** (167/198) |
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  ## License
74
 
75
  This model is built on a Gemma-family architecture and is distributed under the [**Gemma Terms of Use**](https://ai.google.dev/gemma/terms). By using this model, you agree to the Gemma license terms.
 
70
  |----------------------------|-------|
71
  | maj@8 + tie-retry + DELPHI + near-miss maj@32–64 (weighted vote) | **84.34%** (167/198) |
72
 
73
+
74
+ ## Training Datasets
75
+
76
+ JGOS-31B-Citizen was trained using large-scale Korean corpora sourced from the **Korean AI Hub (AIHub)** — Korea's national AI data repository operated by NIA (National Intelligence Agency for IT). The following datasets were used to optimize performance on the **K-AI Leaderboard** benchmarks (KoMMLU-Pro, CliCK, HLE, MuSR, Com2):
77
+
78
+ | # | Dataset Name | AIHub Link |
79
+ |---|---|---|
80
+ | 1 | Medical and Legal Professional Books Corpus | [71487](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71487) |
81
+ | 2 | Financial and Legal Document Machine Reading Comprehension | [71610](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71610) |
82
+ | 3 | Large-scale Web-based Korean Corpus | [624](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=624) |
83
+ | 4 | Large-scale Book-based Korean Corpus | [653](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=653) |
84
+ | 5 | National Records Large-scale AI Learning Corpus | [71788](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71788) |
85
+ | 6 | Korean Generation-based Common Sense Reasoning Dataset | [459](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=459) |
86
+ | 7 | Multi-session Dialogue Corpus | [pkg1](https://aihub.or.kr/aihubdata/data/view.do?currMenu=511&topMenu=100&aihubDataSe=dataPckage&dataPckageSn=1) |
87
+ | 8 | Essential Medical Knowledge Data (142GB) | [71875](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71875) |
88
+ | 9 | Specialized Medical Knowledge Data (206GB) | [71874](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71874) |
89
+ | 10 | Korean Dialogue Dataset | [272](https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=272) |
90
+
91
+ > All datasets are publicly available via [AIHub](https://aihub.or.kr) (registration required).
92
  ## License
93
 
94
  This model is built on a Gemma-family architecture and is distributed under the [**Gemma Terms of Use**](https://ai.google.dev/gemma/terms). By using this model, you agree to the Gemma license terms.