gemma4-e2b-mud / DATASET_GUIDE.md
sangwon1472's picture
Add dataset guide and model card links
b7b9673
# Dataset Guide
์ด ๋ฌธ์„œ๋Š” `gemma4-e2b-mud` ์ €์žฅ์†Œ์— ํ•จ๊ป˜ ๋“ค์–ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹ ๋ฌถ์Œ์„ ์„ค๋ช…ํ•œ๋‹ค.
๋ชฉํ‘œ๋Š” ๋‹จ์ˆœํžˆ โ€œ์–ด๋–ค ํŒŒ์ผ์ด ์žˆ๋‹คโ€๋ฅผ ๋„˜์–ด์„œ, **๊ฐ ํŒŒ์ผ์ด ์–ด๋–ค ์—ญํ• ์„ ํ•˜๋Š”์ง€**, **์–ด๋–ป๊ฒŒ ์ •๋ฆฌ๋˜์—ˆ๋Š”์ง€**, **๋น„์Šทํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒˆ๋กœ ๋งŒ๋“ค๋ ค๋ฉด ๋ฌด์—‡์„ ์ง€์ผœ์•ผ ํ•˜๋Š”์ง€**๋ฅผ ํ•œ ๋ฒˆ์— ์ดํ•ดํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
์ด ์ €์žฅ์†Œ๋Š” ๋ชจ๋ธ ์ €์žฅ์†Œ์ด๋ฏ€๋กœ, ์—ฌ๊ธฐ์˜ ๋ฐ์ดํ„ฐ์…‹์€ ์—„๋ฐ€ํ•œ ๋…๋ฆฝ dataset repo ๋ผ๊ธฐ๋ณด๋‹ค **์žฌํ˜„๊ณผ ํ™•์žฅ ์‹คํ—˜์„ ์œ„ํ•œ companion dataset bundle** ์— ๊ฐ€๊น๋‹ค.
์ฆ‰, ๋ชจ๋ธ์„ ์ดํ•ดํ•˜๊ณ  ๋‹ค์‹œ ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ๋ณ€ํ˜•ํ•˜๊ณ  ์‹ถ์€ ์‚ฌ๋žŒ์—๊ฒŒ ์‹ค์šฉ์ ์ธ ์ถœ๋ฐœ์ ์„ ์ œ๊ณตํ•˜๋Š” ๋ฐ ์ดˆ์ ์„ ๋‘”๋‹ค.
## ํ•œ๋ˆˆ์— ๋ณด๊ธฐ
ํฌํ•จ๋œ ๋Œ€ํ‘œ ํŒŒ์ผ์€ ์•„๋ž˜ ๋„ค ๊ฐ€์ง€๋‹ค.
- `gemma-mud-colab-starter/dataset/combined_1000.jsonl`
- `gemma-mud-colab-starter/dataset/combined_1000.unsloth_chatml_dedup.jsonl`
- `gemma-mud-colab-starter/dataset/combined_1000.unsloth_gemma4_messages_dedup.jsonl`
- `gemma-mud-colab-starter/dataset/gemma4_mud_alpaca_100.jsonl`
์—ญํ• ์„ ์š”์•ฝํ•˜๋ฉด:
- `combined_1000.jsonl`
์›๋ณธ์— ๊ฐ€๊นŒ์šด 1000ํ–‰ ๋ฉ”์‹œ์ง€ ๋ฐ์ดํ„ฐ
- `combined_1000.unsloth_chatml_dedup.jsonl`
์ •ํ™• ์ค‘๋ณต ์ œ๊ฑฐ ํ›„ ChatML ์Šคํƒ€์ผ๋กœ ์ •๋ฆฌํ•œ 963ํ–‰ ํ•™์Šต๋ณธ
- `combined_1000.unsloth_gemma4_messages_dedup.jsonl`
์œ„์™€ ๊ฐ™์€ ๋‚ด์šฉ์ด์ง€๋งŒ Gemma 4 / Unsloth ์ชฝ์—์„œ ๋” ์•ˆ์ „ํ•˜๊ฒŒ ์“ฐ๊ธฐ ์œ„ํ•œ typed text messages ๋ฒ„์ „
- `gemma4_mud_alpaca_100.jsonl`
๊ฐ€์žฅ ์ž‘์€ ์˜ˆ์‹œ์šฉ 100ํ–‰ Alpaca ๊ณ„์—ด ๋ฐ์ดํ„ฐ
## ์‹ค์ œ ํฌํ•จ ์ˆ˜๋Ÿ‰
ํ˜„์žฌ ์ €์žฅ์†Œ์— ํฌํ•จ๋œ ํŒŒ์ผ ๊ธฐ์ค€:
- `combined_1000.jsonl`: `1000` ํ–‰
- `combined_1000.unsloth_chatml_dedup.jsonl`: `963` ํ–‰
- `combined_1000.unsloth_gemma4_messages_dedup.jsonl`: `963` ํ–‰
- `gemma4_mud_alpaca_100.jsonl`: `100` ํ–‰
์ฆ‰, ์›๋ณธ `1000` ํ–‰์—์„œ **์ •ํ™• ์ค‘๋ณต 37ํ–‰์ด ์ œ๊ฑฐ๋œ ์ •๋ฆฌ๋ณธ**์ด ํ˜„์žฌ ํ•™์Šต์šฉ ๊ธฐ์ค€๋ณธ์œผ๋กœ ์“ฐ์ธ๋‹ค.
## ๋ฐ์ดํ„ฐ ํ˜•์‹
## 1. `messages` ํ˜•์‹ ๋ฐ์ดํ„ฐ
`combined_1000.jsonl` ์™€ dedup ๋ฒ„์ „ 2์ข…์€ ๋ชจ๋‘ ๊ธฐ๋ณธ์ ์œผ๋กœ `messages` ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฅธ๋‹ค.
๊ฐœ๋ณ„ ์ƒ˜ํ”Œ์€ ๋ณดํ†ต ์•„๋ž˜ 3ํ„ด์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค.
1. `system`
2. `user`
3. `assistant`
์˜ˆ:
```json
{
"messages": [
{
"role": "system",
"content": "๋‹น์‹ ์€ ์šฐ์ฃผํ•ญํ–‰ ํ…์ŠคํŠธ MUD์˜ ํ•ญ๋กœ ์•ˆ๋‚ด์ž์ด์ž ์„ธ๊ณ„๊ด€ ํ•ด์„ค์ž๋‹ค. ๋‹ต๋ณ€์€ ํ•œ๊ตญ์–ด๋กœ ํ•˜๋ฉฐ, ์„œ์ •์„ฑ๊ณผ ์‹ค์šฉ์„ฑ์„ ํ•จ๊ป˜ ์ง€๋‹Œ๋‹ค. ํ”Œ๋ ˆ์ด์— ๋„์›€์ด ๋˜๋Š” ์ •๋ณด์™€ ๋ถ„์œ„๊ธฐ ๋ฌ˜์‚ฌ๋ฅผ ํ•จ๊ป˜ ์ค€๋‹ค."
},
{
"role": "user",
"content": "First Fire Horizon์ด ์–ด๋–ค ๊ณณ์ธ์ง€ ์„ค๋ช…ํ•ด์ค˜."
},
{
"role": "assistant",
"content": "First Fire Horizon์€ ์‹œ์ž‘ ๊ฑฐ์ ์œผ๋กœ, ๋ถ‰์€ ์ž”๊ด‘๊ณผ ์˜ค๋ž˜๋œ ๊ธฐ๋™์Œ์ด ๋‚จ์•„ ์žˆ๋Š” ์ถœํ•ญ์˜ ๋ถ„์œ„๊ธฐ๊ฐ€ ๊ฐ๋„๋Š” ์žฅ์†Œ๋‹ค."
}
]
}
```
### Typed text messages ๋ฒ„์ „
`combined_1000.unsloth_gemma4_messages_dedup.jsonl` ์—์„œ๋Š” `content` ๊ฐ€ ๋‹จ์ˆœ ๋ฌธ์ž์—ด์ด ์•„๋‹ˆ๋ผ ์•„๋ž˜์ฒ˜๋Ÿผ ๋“ค์–ด๊ฐ„๋‹ค.
```json
{
"role": "user",
"content": [
{
"type": "text",
"text": "First Fire Horizon์ด ์–ด๋–ค ๊ณณ์ธ์ง€ ์„ค๋ช…ํ•ด์ค˜."
}
]
}
```
์ด ๋ฒ„์ „์€ Gemma 4 ๊ณ„์—ด์˜ ํ…œํ”Œ๋ฆฟ/processor ๊ฒฝ๋กœ์—์„œ `content` ํƒ€์ž… ํ•ด์„์ด ๋” ๋ช…ํ™•ํ•ด์ง€๋„๋ก ์ค€๋น„ํ•œ ํŒŒ์ผ์ด๋‹ค.
## 2. Alpaca ๊ณ„์—ด ์˜ˆ์‹œ ๋ฐ์ดํ„ฐ
`gemma4_mud_alpaca_100.jsonl` ์€ ์•„๋ž˜ ํ•„๋“œ๋ฅผ ๊ฐ€์ง„ ์†Œํ˜• ์˜ˆ์‹œ ๋ฐ์ดํ„ฐ๋‹ค.
- `instruction`
- `input`
- `output`
- `text`
์˜ˆ:
```json
{
"instruction": "look",
"input": "",
"output": "๋‹น์‹ ์€ ์ฒซ ๋ถˆ์˜ ์ง€ํ‰์˜ ๊ด€์ธก ๊ฐ‘ํŒ์— ์„œ ์žˆ๋‹ค. ..."
}
```
์ด ํŒŒ์ผ์€ ๊ตฌ์กฐ๊ฐ€ ๋‹จ์ˆœํ•ด์„œ ์„ค๋ช…๊ณผ ์ฒซ ์‹คํ—˜์—๋Š” ์ข‹์ง€๋งŒ, Gemma 4 / Unsloth ๊ธฐ์ค€์˜ ์‹ค์ œ ์šด์˜ ์‹คํ—˜์—์„œ๋Š” `messages` ํ˜•์‹ ์ชฝ์ด ๋” ์ž์—ฐ์Šค๋Ÿฝ๋‹ค.
## ๋ฐ์ดํ„ฐ์˜ ๊ฒฐ
ํ˜„์žฌ `combined_1000.unsloth_chatml_dedup.jsonl` ๊ธฐ์ค€์˜ ๊ด€์ฐฐ๊ฐ’์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
- ์ด ํ–‰ ์ˆ˜: `963`
- ์—ญํ•  ์ˆ˜:
- `system`: `963`
- `user`: `963`
- `assistant`: `963`
- ํ‰๊ท  user ๊ธธ์ด: ์•ฝ `33.0` ๋ฌธ์ž
- ํ‰๊ท  assistant ๊ธธ์ด: ์•ฝ `114.6` ๋ฌธ์ž
- ์ตœ๋Œ€ user ๊ธธ์ด: `60`
- ์ตœ๋Œ€ assistant ๊ธธ์ด: `362`
- ๊ณ ์œ  system prompt ์ˆ˜: `4`
์ฆ‰, ์ด ๋ฐ์ดํ„ฐ์…‹์€ ๊ธธ๊ณ  ๋ณต์žกํ•œ multi-turn ๋Œ€ํ™”๊ฐ€ ์•„๋‹ˆ๋ผ, **์งง์€ ๋‹จ์ผ ์š”์ฒญ -> ๋น„๊ต์  ์งง์€ ๋Œ€๋‹ต** ๊ตฌ์กฐ๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ํ•œ๋‹ค.
## system prompt ๋ถ„ํฌ
์ •๋ฆฌ๋ณธ ๊ธฐ์ค€์œผ๋กœ system prompt ๋Š” 4์ข…์ด๋‹ค.
1. ํ•ญ๋กœ ์•ˆ๋‚ด์ž / ์„ธ๊ณ„๊ด€ ํ•ด์„ค์ž: `250`
2. NPC ๋Œ€์‚ฌ ์ž‘์„ฑ์ž: `250`
3. ํ€˜์ŠคํŠธ ๋ฐ ์‹œ์Šคํ…œ ๋ฉ”์‹œ์ง€ ์ž‘์„ฑ์ž: `250`
4. ํ™˜๊ฒฝ ์„œ์ˆ ๊ฐ€: `213`
์ด ๋ถ„ํฌ๋ฅผ ๋ณด๋ฉด ๋ฐ์ดํ„ฐ์…‹์ด ๋‹จ์ผ ๋ฌธ์ฒด๋งŒ ๋ฐ€์–ด๋ถ™์ด๋Š” ํ˜•ํƒœ๋Š” ์•„๋‹ˆ๋‹ค.
๋Œ€์‹  ์•„๋ž˜ ๋„ค ์ธต์„ ํ•จ๊ป˜ ํ•™์Šตํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ๋‹ค.
- lore / setting explanation
- NPC voice
- quest / system flavor text
- environment / look / scan narration
## ์‚ฌ์šฉ์ž ์ž…๋ ฅ ๊ฒฝํ–ฅ
์ •๋ฆฌ๋ณธ์—์„œ ์ž์ฃผ ๋ณด์ด๋Š” ์ž…๋ ฅ ํ‚ค์›Œ๋“œ ์˜ˆ์‹œ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
- `์‹ ํ˜ธ`: `60`
- `ํ€˜์ŠคํŠธ`: `50`
- `์†Œ๋ฌธ`: `50`
- `์—ฐ๋ฃŒ`: `48`
- `ํ•ญ๋กœ`: `33`
- `ํ‘œ์ง€๋“ฑ`: `22`
- `๋„ํ‚น`: `20`
- `ํ‰ํŒ`: `18`
- `์ˆ˜๋ฆฌ`: `14`
- `๊ฑฐ๋ž˜`: `14`
- `์ „ํˆฌ`: `10`
์ด ๋ถ„ํฌ๋Š” ์ด ๋ฐ์ดํ„ฐ์…‹์ด ๋‹จ์ง€ โ€œ๋กœ์–ด ์„ค๋ช…โ€๋งŒ ๋ชจ์•„๋‘” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์‹ค์ œ ํ…์ŠคํŠธ MUD์—์„œ ์ž์ฃผ ๋‚˜์˜ค๋Š” ์•„๋ž˜ ์ž…๋ ฅ๊ตฐ์„ ํ•จ๊ป˜ ํฌํ•จํ•œ๋‹ค๋Š” ๋œป์ด๋‹ค.
- ์žฅ์†Œ ์„ค๋ช…
- ์‹ ํ˜ธ/์Šค์บ”
- ํ€˜์ŠคํŠธ์™€ ์‹œ์Šคํ…œ ํ…์ŠคํŠธ
- ํ•ญ๋กœ/๋„ํ‚น/์—ฐ๋ฃŒ ๊ฐ™์€ ํ”Œ๋ ˆ์ด ๊ด€๋ จ ํ‘œํ˜„
- ํ‰ํŒ/์†Œ๋ฌธ/๊ฑฐ๋ž˜ ๋“ฑ ์„ธ๊ณ„ ์ƒํ˜ธ์ž‘์šฉ
## ์ด ๋ฐ์ดํ„ฐ์…‹์ด ์ž˜ํ•˜๋Š” ๊ฒƒ
- ์งง๊ณ  ๋ถ„์œ„๊ธฐ ์žˆ๋Š” ์‘๋‹ต
- ํ•œ๊ตญ์–ด ์„ธ๊ณ„๊ด€ ์„ค๋ช…
- MUD ํ†ค์˜ NPC ๋Œ€ํ™”
- room look / signal / rumor ๊ฐ™์€ ์งง์€ ๋ช…๋ น ๋ฐ˜์‘
- ์„ค์ • ๋ฌธ์žฅ๊ณผ ๊ธฐ๋Šฅ ์„ค๋ช…์ด ์„ž์ธ ํ…์ŠคํŠธ
## ์ด ๋ฐ์ดํ„ฐ์…‹๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑํ•œ ๊ฒƒ
์ด ๋ฐ์ดํ„ฐ์…‹์€ โ€œ์—”์ง„ ๋Œ€์ฒดโ€๊ฐ€ ์•„๋‹ˆ๋ผ โ€œ์„œ์‚ฌ ๋ ˆ์ด์–ด ๋ณด๊ฐ•โ€์„ ๋ชฉํ‘œ๋กœ ํ•˜๋ฏ€๋กœ, ์•„๋ž˜๋Š” ์˜๋„์ ์œผ๋กœ ์•ฝํ•˜๊ฑฐ๋‚˜ ๋น„์›Œ์ ธ ์žˆ๋Š” ์˜์—ญ์ด๋‹ค.
- ์ •๋ฐ€ํ•œ ์ƒํƒœ ์ถ”์ 
- ์ˆ˜์น˜ ๊ธฐ๋ฐ˜ ์ „ํˆฌ ๊ทœ์น™
- ๋ณด์ƒ ๊ณ„์‚ฐ
- ํ€˜์ŠคํŠธ ์™„๋ฃŒ ํŒ์ •
- ์žฅ๊ธฐ multi-turn ๋ฉ”๋ชจ๋ฆฌ
- ์ผ๋ฐ˜ ์ƒ์‹ QA
์ฆ‰, ๋ชจ๋ธ์ด ์•„๋ž˜๋ฅผ ์ž˜ํ•˜๊ฒŒ ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ๋ณ„๋„ ๋ฐ์ดํ„ฐ๊ฐ€ ๋” ํ•„์š”ํ•˜๋‹ค.
- ์žฅ๊ธฐ ๋Œ€ํ™”
- ์ƒ์„ธ ์„ธ๊ณ„ ๊ทœ์น™ ์ค€์ˆ˜
- ์—„๊ฒฉํ•œ JSON ์ถœ๋ ฅ
- ํˆด ํ˜ธ์ถœ / ํ•จ์ˆ˜ ํ˜ธ์ถœ
- ์ „ํˆฌ ๋กœ๊ทธ ํ•ด์„ค
## ์™œ dedup ๋ฒ„์ „์ด ํ•„์š”ํ•œ๊ฐ€
์›๋ณธ `1000` ํ–‰์„ ๊ทธ๋Œ€๋กœ ์“ฐ๋ฉด, ์ •ํ™•ํžˆ ๊ฐ™์€ ์ค„์ด ๋ฐ˜๋ณต๋˜๋Š” ์ƒ˜ํ”Œ์ด ๋“ค์–ด๊ฐ€ ์žˆ๋‹ค.
์ ์€ ๊ทœ๋ชจ์˜ ๋ฏธ์„ธ์กฐ์ •์—์„œ๋Š” ์ด๋Ÿฐ ์ค‘๋ณต์ด ํŠน์ • ํ‘œํ˜„์„ ๊ณผํ•˜๊ฒŒ ๊ฐ•ํ™”ํ•ด์„œ, ๋ชจ๋ธ์ด ์‹ค์ œ๋ณด๋‹ค ๋” ์ข์€ ๋ฌธ์ฒด๋ฅผ ๋‚ด๋„๋ก ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.
๊ทธ๋ž˜์„œ ์ •๋ฆฌ๋ณธ์—์„œ๋Š”:
- exact duplicate ์ œ๊ฑฐ
- ๊ณต๋ฐฑ/์ค„๋ฐ”๊ฟˆ ์ •๋ฆฌ
- role/content ๊ตฌ์กฐ ๊ฒ€์ฆ
- Gemma 4์šฉ typed text messages ๋ฒ„์ „ ์ถ”๊ฐ€
๋ฅผ ๊ฑฐ์นœ `963` ํ–‰์„ ๊ธฐ์ค€๋ณธ์œผ๋กœ ์‚ผ์•˜๋‹ค.
## ์™œ ChatML ๋ฒ„์ „๊ณผ Gemma 4 typed ๋ฒ„์ „์„ ๋‘˜ ๋‹ค ๋‘๋Š”๊ฐ€
๋‘˜์˜ ๋‚ด์šฉ์€ ์‚ฌ์‹ค์ƒ ๊ฐ™์ง€๋งŒ, ์šฉ๋„๊ฐ€ ๋‹ค๋ฅด๋‹ค.
### `combined_1000.unsloth_chatml_dedup.jsonl`
- ์‚ฌ๋žŒ์ด ์ฝ๊ธฐ ํŽธํ•˜๋‹ค
- ์ผ๋ฐ˜์ ์ธ chat fine-tuning ํฌ๋งท๊ณผ ๊ฐ€๊น๋‹ค
- ๋””๋ฒ„๊น…์ด ์‰ฝ๋‹ค
### `combined_1000.unsloth_gemma4_messages_dedup.jsonl`
- Gemma 4 processor/template ๊ฒฝ๋กœ์—์„œ ๋” ์•ˆ์ „ํ•˜๋‹ค
- `content: [{type: "text", text: ...}]` ๊ตฌ์กฐ๋ฅผ ๊ฐ•์ œํ•  ์ˆ˜ ์žˆ๋‹ค
- multimodal ๊ณ„์—ด ํ…œํ”Œ๋ฆฟ๊ณผ์˜ ์ถฉ๋Œ์„ ์ค„์ด๊ธฐ ์‰ฝ๋‹ค
์‹ค์ „์—์„œ๋Š” typed messages ๋ฒ„์ „์„ ๊ธฐ์ค€์œผ๋กœ ๋‘๊ณ , ํ•„์š”ํ•˜๋ฉด ์ตœ์ข… `text` ์ปฌ๋Ÿผ์„ ๋งŒ๋“ค์–ด ํ•™์Šตํ•˜๋Š” ํŽธ์ด ์•ˆ์ •์ ์ด๋‹ค.
## ์ด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ƒˆ ์ƒ˜ํ”Œ์„ ๋งŒ๋“ค ๋•Œ์˜ ๊ทœ์น™
์ƒˆ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•  ๋•Œ๋Š” ์•„๋ž˜ ๊ทœ์น™์„ ์ถ”์ฒœํ•œ๋‹ค.
### 1. ๊ธฐ๋ณธ ๋‹จ์œ„๋Š” 1์ƒ˜ํ”Œ = 1system + 1user + 1assistant
๊ฐ€๋Šฅํ•˜๋ฉด ์•„๋ž˜ 3ํ„ด ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•œ๋‹ค.
- `system`: ์—ญํ• ๊ณผ ๋งํˆฌ
- `user`: ์งง๊ณ  ๋ช…ํ™•ํ•œ ์ž…๋ ฅ
- `assistant`: ์งง๊ณ  ํ†ค์ด ์‚ด์•„ ์žˆ๋Š” ๋‹ต๋ณ€
### 2. user๋Š” ์งง๊ฒŒ, assistant๋Š” ์‘์ถ•๋˜๊ฒŒ
์ด ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฐ•์ ์€ โ€œ์งง์€ ์ž…๋ ฅ์— ์งง๊ณ  ์„ ๋ช…ํ•œ ์ถœ๋ ฅโ€์ด๋‹ค.
๋”ฐ๋ผ์„œ ์ƒˆ ์ƒ˜ํ”Œ๋„ ์•„๋ž˜ ๊ฒฐ์„ ์œ ์ง€ํ•˜๋Š” ํŽธ์ด ์ข‹๋‹ค.
- user: ์งง์€ ๋ช…๋ น, ์งง์€ ์š”์ฒญ, ํ•œ ๋ฌธ์žฅ ์งˆ๋ฌธ
- assistant: ๋‘์„ธ ๋ฌธ๋‹จ ์ดํ•˜, ๊ฐ€๋Šฅํ•˜๋ฉด ์งง์€ ์„œ์ˆ  ์ค‘์‹ฌ
### 3. ์‹œ์Šคํ…œ ์—ญํ• ์„ ์„ž์–ด๋ผ
ํ˜„์žฌ ๋ถ„ํฌ์ฒ˜๋Ÿผ ์•„๋ž˜ ๋„ค ์—ญํ• ์„ ๊ณ ๋ฅด๊ฒŒ ์„ž๋Š” ํŽธ์ด ์ข‹๋‹ค.
- ์„ธ๊ณ„๊ด€ ํ•ด์„ค์ž
- NPC ๋Œ€์‚ฌ ์ž‘์„ฑ์ž
- ํ€˜์ŠคํŠธ/์‹œ์Šคํ…œ ๋ฉ”์‹œ์ง€ ์ž‘์„ฑ์ž
- ํ™˜๊ฒฝ ์„œ์ˆ ๊ฐ€
ํ•œ ์—ญํ• ๋งŒ ๋ชฐ์•„ ๋„ฃ์œผ๋ฉด ๋ฌธ์ฒด๊ฐ€ ํ•œ์ชฝ์œผ๋กœ๋งŒ ์ ๋ฆฌ๊ธฐ ์‰ฝ๋‹ค.
### 4. ์—”์ง„ ํŒ์ •์€ ์ง์ ‘ ์„ ์–ธํ•˜์ง€ ๋ง๋ผ
๋ชจ๋ธ์ด ์•„๋ž˜๋ฅผ ์ง์ ‘ ์„ ์–ธํ•˜๋Š” ์ƒ˜ํ”Œ์€ ๊ฐ€๊ธ‰์  ํ”ผํ•œ๋‹ค.
- ํ€˜์ŠคํŠธ ์™„๋ฃŒ
- ๋ณด์ƒ ํš๋“
- ๋ฌธ ์—ด๋ฆผ
- ์ „ํˆฌ ์Šน๋ฆฌ ํ™•์ •
์ด๋Ÿฐ ๋ฌธ์žฅ์€ ๋‚˜์ค‘์— ๋ชจ๋ธ์ด ๊ฒŒ์ž„ ์—”์ง„ ์—ญํ• ๊นŒ์ง€ ๋„˜๋ณด๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.
### 5. ํ˜„์‹ค ์ƒ์‹ ๊ฐ•์˜๋ณด๋‹ค ์„ธ๊ณ„ ์•ˆ์˜ ๋ฐ˜์‘์„ ์šฐ์„ ํ•˜๋ผ
์ด ๋ฐ์ดํ„ฐ์…‹์˜ ๋ชฉ์ ์€ ๋ฒ”์šฉ ๋ฐฑ๊ณผ์‚ฌ์ „ QA๊ฐ€ ์•„๋‹ˆ๋‹ค.
๋”ฐ๋ผ์„œ `Oracle`, `signal`, `route` ๊ฐ™์€ ๋‹จ์–ด๊ฐ€ ๋‚˜์™€๋„ ํ˜„์‹ค ๊ธฐ์ˆ  ๋ฌธ์„œํ’ ์„ค๋ช…์ด ์•„๋‹ˆ๋ผ **์„ธ๊ณ„๊ด€ ๋‚ด๋ถ€ ๋ฐ˜์‘**์ด ๋‚˜์˜ค๋„๋ก ์ž‘์„ฑํ•˜๋Š” ํŽธ์ด ์ข‹๋‹ค.
## ์ถ”์ฒœ ์ž‘์„ฑ ํ…œํ”Œ๋ฆฟ
๊ฐ€์žฅ ๋ฌด๋‚œํ•œ ์ž‘์„ฑ ํ…œํ”Œ๋ฆฟ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
```json
{
"messages": [
{
"role": "system",
"content": "๋‹น์‹ ์€ ์šฐ์ฃผํ•ญํ–‰ ํ…์ŠคํŠธ MUD์˜ ํ™˜๊ฒฝ ์„œ์ˆ ๊ฐ€๋‹ค. ๋‹ต๋ณ€์€ ํ•œ๊ตญ์–ด๋กœ ํ•˜๋ฉฐ, room look, scan, ์ด๋™ ๋ฉ”์‹œ์ง€, ์œ„ํ—˜ ๊ฒฝ๊ณ ๋ฅผ ์„ ๋ช…ํ•˜๊ณ  ํ”Œ๋ ˆ์ด ์นœํ™”์ ์œผ๋กœ ์ž‘์„ฑํ•œ๋‹ค."
},
{
"role": "user",
"content": "scan signal"
},
{
"role": "assistant",
"content": "์ˆ˜์‹ ๊ธฐ์—๋Š” ๋ถ€์„œ์ง„ ๋งฅ๋™ ํ•˜๋‚˜๊ฐ€ ๋‚ฎ๊ฒŒ ๊ฑธ๋ฆฐ๋‹ค. ์‹ ํ˜ธ๋Š” ์˜ค๋ž˜์ „์— ๋Š๊ฒผ์ง€๋งŒ, ๋์ž๋ฝ์—๋Š” ์•„์ง๋„ ๋ˆ„๊ตฐ๊ฐ€์˜ ํ˜ธ์ถœ ์Šต๊ด€ ๊ฐ™์€ ๊ทœ์น™์ด ๋‚จ์•„ ์žˆ๋‹ค."
}
]
}
```
Gemma 4 typed messages ๋ฒ„์ „์œผ๋กœ ๋ฐ”๋กœ ๋งŒ๋“ค๊ณ  ์‹ถ์œผ๋ฉด:
```json
{
"messages": [
{
"role": "system",
"content": [{"type": "text", "text": "๋‹น์‹ ์€ ์šฐ์ฃผํ•ญํ–‰ ํ…์ŠคํŠธ MUD์˜ NPC ๋Œ€์‚ฌ ์ž‘์„ฑ์ž๋‹ค. ๋‹ต๋ณ€์€ ํ•œ๊ตญ์–ด๋กœ ํ•˜๋ฉฐ, ์—ญํ• ์— ๋งž๋Š” ๊ฐœ์„ฑ๊ณผ ๊ฒŒ์ž„ ํ”Œ๋ ˆ์ด ๋‹จ์„œ๋ฅผ ํ•จ๊ป˜ ๋‹ด๋Š”๋‹ค."}]
},
{
"role": "user",
"content": [{"type": "text", "text": "talk oracle"}]
},
{
"role": "assistant",
"content": [{"type": "text", "text": "์˜ˆ์–ธ์ž๋Š” ์ˆจ์„ ๊ณ ๋ฅธ ๋’ค ๋‚ฎ๊ฒŒ ๋งํ•œ๋‹ค. ์ฒซ ๋ถˆ์€ ๊ธธ์„ ๋ฐํžˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋Œ์•„์˜ฌ ์ž๋ฆฌ๋ฅผ ๊ธฐ์–ตํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค."}]
}
]
}
```
## ์ถ”์ฒœ ์ œ์ž‘ ์ˆœ์„œ
์ƒˆ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ค ๋•Œ๋Š” ์•„๋ž˜ ์ˆœ์„œ๋ฅผ ๊ถŒํ•œ๋‹ค.
1. ์—ญํ• ๊ตฐ์„ ๋จผ์ € ๋‚˜๋ˆˆ๋‹ค
- lore / NPC / quest-system / environment
2. user ์ž…๋ ฅ๊ตฐ์„ ๋จผ์ € ์ •ํ•œ๋‹ค
- ๋ช…๋ นํ˜•, ์„ค๋ช…ํ˜•, ํžŒํŠธํ˜•
3. assistant ๊ธธ์ด๋ฅผ ์ œํ•œํ•œ๋‹ค
- ๋„ˆ๋ฌด ๊ธธ์–ด์ง€์ง€ ์•Š๊ฒŒ
4. exact duplicate ๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค
5. ChatML ๋ฒ„์ „๊ณผ Gemma 4 typed ๋ฒ„์ „์„ ๋ชจ๋‘ ๋งŒ๋“ ๋‹ค
6. ์‹ค์ œ ์ถ”๋ก  ํ…Œ์ŠคํŠธ๋ฅผ ๋จผ์ € ํ•ด๋ณธ๋‹ค
- `talk oracle`
- `look`
- `scan signal`
- `First Fire Horizon์ด ์–ด๋–ค ๊ณณ์ธ์ง€ ์„ค๋ช…ํ•ด์ค˜.`
## ํ’ˆ์งˆ ์ ๊ฒ€ ์ฒดํฌ๋ฆฌ์ŠคํŠธ
๋ฐ์ดํ„ฐ์…‹์„ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ ์ˆ˜์ •ํ•  ๋•Œ๋Š” ์•„๋ž˜๋ฅผ ๋ณด๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.
- ํ•œ๊ตญ์–ด ํ†ค์ด ์ผ๊ด€์ ์ธ๊ฐ€
- system ์—ญํ• ์ด ๋„ˆ๋ฌด ํ•œ์ชฝ์œผ๋กœ ์ ๋ฆฌ์ง€ ์•Š๋Š”๊ฐ€
- user ์ž…๋ ฅ์ด ์ง€๋‚˜์น˜๊ฒŒ ๊ธธ์ง€ ์•Š์€๊ฐ€
- assistant ๋‹ต๋ณ€์ด ์žฅํ™ฉํ•ด์ง€์ง€ ์•Š๋Š”๊ฐ€
- ํ˜„์‹ค ์ƒ์‹ ๊ฐ•์˜๋กœ ์ƒˆ๋Š” ์ƒ˜ํ”Œ์ด ์„ž์ด์ง€ ์•Š์•˜๋Š”๊ฐ€
- ๊ฒŒ์ž„ ์—”์ง„ ํŒ์ • ๋ฌธ์žฅ์ด ๋“ค์–ด๊ฐ€ ์žˆ์ง€ ์•Š์€๊ฐ€
- exact duplicate ๊ฐ€ ๋‚จ์•„ ์žˆ์ง€ ์•Š์€๊ฐ€
- Gemma 4 ํ…œํ”Œ๋ฆฟ ๊ฒฝ๋กœ์—์„œ `content` ํƒ€์ž… ์ถฉ๋Œ์ด ์—†๋Š”๊ฐ€
## ์ด ์ €์žฅ์†Œ์—์„œ ์‹ค์ œ๋กœ ์“ฐ๋Š” ์ค€๋น„ ์Šคํฌ๋ฆฝํŠธ
starter package ์•ˆ์—๋Š” ๋ฐ์ดํ„ฐ ์ค€๋น„ ์Šคํฌ๋ฆฝํŠธ๋„ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.
- [`gemma-mud-colab-starter/scripts/prepare_unsloth_gemma4_dataset.py`](./gemma-mud-colab-starter/scripts/prepare_unsloth_gemma4_dataset.py)
์ด ์Šคํฌ๋ฆฝํŠธ๋Š” ๋ฉ”์‹œ์ง€ ์ •๋ฆฌ, dedup, Unsloth/Gemma 4์šฉ ํฌ๋งท ์ƒ์„ฑ ๊ฐ™์€ ๋ฐ˜๋ณต ์ž‘์—…์„ ๋‹ค์‹œ ํ•˜๊ธฐ ์‰ฝ๊ฒŒ ๋‚จ๊ฒจ ๋‘” ๊ฒƒ์ด๋‹ค.
## ์•ž์œผ๋กœ ํ™•์žฅํ•˜๋ ค๋ฉด
์ง€๊ธˆ ๋ฐ์ดํ„ฐ์…‹์€ โ€œ์Šคํƒ€์ผ๊ณผ ๊ตฌ์กฐ๋ฅผ ์ตํžˆ๋Š” ์ฒซ ๋ฌถ์Œโ€์œผ๋กœ๋Š” ์ถฉ๋ถ„ํ•˜์ง€๋งŒ, ๋” ๋†’์€ ํ’ˆ์งˆ์„ ์›ํ•˜๋ฉด ์•„๋ž˜ ๋ฐฉํ–ฅ์ด ์ข‹๋‹ค.
- NPC๋ณ„ ์ „์šฉ ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ
- ์ง€์—ญ/๊ฑฐ์ ๋ณ„ ์„ธ๊ณ„๊ด€ ์‚ฌ์ „ ํ™•์žฅ
- quest-system ๋ฉ”์‹œ์ง€ ๋ถ„๋ฆฌ
- `look` / `rumor` / `scan` ๊ณ„์—ด์„ ๋” ๋งŽ์ด ํ™•๋ณด
- ์žฅ๊ธฐ ๋Œ€ํ™”ํ˜• ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€
- ์—”์ง„ ํ›„์ฒ˜๋ฆฌ ๊ทœ์น™๊ณผ ํ•จ๊ป˜ ํ‰๊ฐ€
## ๋ชจ๋ธ ์นด๋“œ์—์„œ ์ด ๋ฌธ์„œ๋ฅผ ์™œ ๋”ฐ๋กœ ๋‘๋Š”๊ฐ€
๋ชจ๋ธ ์นด๋“œ๋Š” ๋ณดํ†ต ์•„๋ž˜๋ฅผ ๋น ๋ฅด๊ฒŒ ๋ณด์—ฌ ์ค˜์•ผ ํ•œ๋‹ค.
- ๋ฌด์—‡์„ ํ•˜๋Š” ๋ชจ๋ธ์ธ๊ฐ€
- ์–ด๋–ป๊ฒŒ ์“ฐ๋Š”๊ฐ€
- ์–ด๋–ค ํŒŒ์ผ์ด ์žˆ๋Š”๊ฐ€
๋ฐ˜๋ฉด ๋ฐ์ดํ„ฐ์…‹ ๋ถ„์„์€ ์•„๋ž˜๊ฐ€ ํ•„์š”ํ•˜๋‹ค.
- ์ •์ œ ๊ณผ์ •
- ํฌ๋งท ์ฐจ์ด
- ์ค‘๋ณต ์ œ๊ฑฐ ์ด์œ 
- ์ž‘์„ฑ ๊ทœ์น™
- ํ™•์žฅ ๋ฐฉํ–ฅ
๋‘˜์„ ํ•œ ๋ฌธ์„œ์— ๋ชจ๋‘ ๋ฐ€์–ด ๋„ฃ์œผ๋ฉด ๋ชจ๋ธ ์นด๋“œ๊ฐ€ ์ง€๋‚˜์น˜๊ฒŒ ๊ธธ์–ด์ง€๊ณ , ํ•ต์‹ฌ ์‚ฌ์šฉ์ž๊ฐ€ ๋นจ๋ฆฌ ์ •๋ณด๋ฅผ ์ฐพ๊ธฐ ์–ด๋ ค์›Œ์ง„๋‹ค.
๊ทธ๋ž˜์„œ ์ด ์ €์žฅ์†Œ๋Š” **๋ชจ๋ธ ์นด๋“œ์™€ ๋ฐ์ดํ„ฐ์…‹ ๊ฐ€์ด๋“œ๋ฅผ ๋ถ„๋ฆฌ**ํ•˜๋Š” ์ชฝ์ด ๋” ์ ํ•ฉํ•˜๋‹ค.