---
license: gemma
language:
- ko
- en
base_model:
- google/gemma-3-12b-it
pipeline_tag: text-generation
tags:
- korean
- defense
- instruction-tuned
- domain-adaptive
library_name: transformers
---

# KorDef-LLM

**Korean Defense Domain Instruction-Tuned Language Model**

KorDef-LLM is a 12B-parameter language model fine-tuned from `google/gemma-3-12b-it` on a domain-specific instruction corpus drawn from publicly available, unclassified Korean defense administrative-rule (행정규칙) and educational PDFs.

This model accompanies the manuscript **"An Open Pipeline for Domain-Adaptive Instruction Tuning of Korean Defense Large Language Models"** (submitted to PeerJ Computer Science). It is released for **research and educational use** only, with the limitations and out-of-scope uses described below.

## Released Artifacts

| Component | Location |
|---|---|
| Model weights (this page) | [HuggingFace `graphuser/kordef-12b`](https://huggingface.co/graphuser/kordef-12b) |
| Instruction corpus + evaluation set | [Zenodo `10.5281/zenodo.20083055`](https://doi.org/10.5281/zenodo.20083055) |
| Inference and evaluation code | [GitHub `gshwan22/KorDef-LLM`](https://github.com/gshwan22/KorDef-LLM) |

## Model Description

- **Base model**: `google/gemma-3-12b-it` (Gemma-3, 12B parameters, instruction-tuned)
- **Fine-tuning**: Supervised instruction tuning (full SFT, FSDP distributed; not LoRA)
- **Domain**: Korean defense administrative rules, doctrine documents, and educational materials (all publicly available, unclassified)
- **Training corpus**: Combined prompt-generated and document-grounded instruction–response pairs; the prompt-generated subset (235,367 pairs) is publicly released via Zenodo
- **Training steps**: 7,875

## Intended Use

KorDef-LLM is intended for:

- Research on Korean professional-domain language modeling and domain adaptation
- Educational reference-style question answering over Korean defense administrative-rule documents
- Comparison studies and reproducibility evaluations in Korean NLP
- A base model for further research-oriented fine-tuning in related Korean professional domains

The model is **NOT** intended for:

- Autonomous decision-making in military operations, procurement, maintenance, targeting, or any safety-critical procedure
- Generation of classified, sensitive, or operationally restricted content
- Deployment in real-world high-stakes settings without institutional review, retrieval grounding, and human expert oversight
- Any use that violates applicable laws, regulations, or the Gemma Terms of Use

## Evaluation Summary

KorDef-LLM was evaluated on two complementary benchmarks; full details are reported in the accompanying paper.

### KMMLU (general Korean reasoning, 5-shot)

| Model | KMMLU (%) |
|---|---|
| A.X-4.0-Light | 55.7 |
| **KorDef-LLM (ours)** | **48.0** |
| Gemma-3-12B (base) | 46.0 |
| Qwen-2.5-7B-Instruct | 45.8 |
| EXAONE-3.5-7.8B-Instruct | 45.3 |
| Llama-3.1-8B-Instruct | 41.6 |

KorDef-LLM ranks second among six compared models on KMMLU, exceeding the base model and three additional open Korean/multilingual baselines, indicating that domain-adaptive instruction tuning preserves general Korean reasoning ability.

### Source-Grounded Evaluation (N=323, public defense PDFs)

Paired comparison against the base Gemma-3-12B under identical context, prompt, and decoding conditions:

| Metric | Gemma-3-12B | **KorDef-LLM** | Δ | p (Wilcoxon) |
|---|---|---|---|---|
| Token-F1 | 0.398 | **0.428** | +0.030 | < 1e-7 |
| ROUGE-L | 0.380 | **0.402** | +0.022 | < 1e-3 |
| Character 3-gram Jaccard | 0.258 | **0.281** | +0.023 | < 1e-4 |
| Evidence-token recall | 0.534 | 0.549 | +0.015 | 0.108 (n.s.) |
| Mean answer tokens | 45.2 | 41.2 | −4.0 | < 1e-11 |

Statistically significant improvements over the base model in three content-overlap metrics, with no significant change in evidence recall or refusal rate.

In a cross-model comparison against five baselines (Gemma-3-12B, EXAONE-3.5-7.8B, Qwen-2.5-7B, Llama-3.1-8B, A.X-4.0-Light) on the same evaluation set, **KorDef-LLM achieves the highest mean evidence-token recall**, the metric most directly tied to source faithfulness in source-grounded QA. The train/eval overlap audit confirms zero exact question, zero exact answer, and zero near-question (Jaccard ≥ 0.80) overlap between the training corpus and the evaluation set.

## Known Limitations

1. **Effect sizes are modest.** The improvements over the base model on a source-grounded evaluation are statistically significant but small in absolute magnitude (~3 percentage points on Token-F1). The model is not a substitute for retrieval-augmented generation or human expert review.

2. **Evidence recall and refusal rate are not significantly improved.** While source-grounded inference shows favorable trends on these source-faithfulness metrics, none reach statistical significance against the base model. Source faithfulness in the deployed system should be enforced via retrieval grounding and explicit citation requirements.

3. **The training corpus is partially released.** Only the prompt-generated subset of the training corpus is publicly available via Zenodo. The full released corpus, source manifest, segments, and evaluation set are available; the model weights are released here.

4. **No human expert evaluation.** Evaluation was conducted using automatic metrics. Future deployments in any operational or educational context should be validated by qualified Korean defense doctrine experts.

5. **Defense-domain language specificity.** The model is tuned for Korean defense administrative-rule and educational text style. It may produce overly formal or excessively verbose responses outside this domain.

6. **Hallucination risk.** Like all large language models, KorDef-LLM may generate plausible-sounding but factually incorrect content, especially when asked about topics not covered by its training corpus or when source context is incomplete.

## Safety Considerations

- **Dual-use awareness**: Defense-domain language modeling carries inherent dual-use considerations. The released model and corpus contain only publicly available administrative-rule and educational content, not operational, tactical, or classified information.

- **Recommended deployment pattern**: For any real-world use, we recommend retrieval-augmented generation with explicit source citation, deployment within controlled (e.g., air-gapped) infrastructure, and human expert review of outputs in any consequential workflow.

- **Memorization and data extraction**: The model has been trained on Korean defense administrative-rule text. While the training data is unclassified, users should still exercise caution regarding prompts that attempt to extract training data verbatim.

- **Prompt injection**: As with all instruction-tuned LLMs, the model may be vulnerable to prompt-injection attacks in deployed agentic settings. Defensive measures (input sanitization, instruction layering, output filtering) are recommended.

## How to Use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "graphuser/kordef-12b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="bfloat16",
    device_map={"": 0},  # single GPU; avoids CPU offload
)

# Source-grounded prompting (recommended pattern)
prompt = """다음 [출처]를 참고하여 [질문]에 정확히 답변하시오.

[출처]
(여기에 관련 행정규칙 또는 문서 발췌 삽입)

[질문]
(여기에 질문 작성)

[답변]"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=192,
    do_sample=False,
    repetition_penalty=1.05,
)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

## Citation

If you use this model, please cite the paper and the dataset:

```bibtex
@article{gwak2026kordef,
  title   = {An Open Pipeline for Domain-Adaptive Instruction Tuning of Korean Defense Large Language Models},
  author  = {Gwak, Sang-Hwan and Choi, Ji-Young and Jeong, Chang-Hoo and Lee, Gunwoo and Kim, Ina and Lee, Kyung-Ha},
  journal = {PeerJ Computer Science (submitted)},
  year    = {2026}
}

@dataset{kordef_corpus_2026,
  title     = {KorDef-LLM: Korean Defense Domain Instruction Corpus and Source-Grounded Evaluation Set},
  author    = {Gwak, Sang-Hwan and others},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20083055}
}
```

## License

- **Model weights**: Gemma Terms of Use (the model is fine-tuned from `google/gemma-3-12b-it`). Users must comply with the [Gemma Terms](https://ai.google.dev/gemma/terms).
- **Released corpus** (Zenodo): CC-BY-4.0
- **Code** (GitHub): MIT

## Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT and DAPA) (No. RS-2024-00452972).

## Contact

For questions about this model or the accompanying paper, please contact the corresponding author at `kyongha@kisti.re.kr` or open an issue on the [GitHub repository](https://github.com/gshwan22/KorDef-LLM).