Megumin-chat / docs /data-collection-spec.md
Junhoee's picture
Upload 3 files
eae04fe verified
# Data Collection Spec
## ๊ฐœ์š”
์ด ํ”„๋กœ์ ํŠธ๋Š” ๋ฉ”๊ตฌ๋ฐ ์ฑ—๋ด‡์˜ RAG ์ž์›์„ ๋‘ ๊ณ„์ธต์œผ๋กœ ๊ด€๋ฆฌํ•œ๋‹ค.
- ์Šคํƒ€์ผ/ํŽ˜๋ฅด์†Œ๋‚˜์šฉ ๋ฐ์ดํ„ฐ
- ์˜ˆ: `megumin_qa_dataset.json`
- ์‚ฌ์‹ค/์„ค์ •์šฉ ๋ฐ์ดํ„ฐ
- ์˜ˆ: `namuwiki_qa.json`
์‚ฌ์‹ค/์„ค์ •์šฉ ๋ฐ์ดํ„ฐ๋Š” ๋‚˜๋ฌด์œ„ํ‚ค ๋ฌธ์„œ๋ฅผ ์„ ๋ณ„ ์ˆ˜์ง‘ํ•ด QA JSON์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ์ดํ›„ FAISS ์ธ๋ฑ์Šค๋กœ ๊ฒ€์ƒ‰ํ•œ๋‹ค.
## ์ˆ˜์ง‘ ๋Œ€์ƒ
๋‚˜๋ฌด์œ„ํ‚ค ๋ฌธ์„œ๋Š” ๋ณ„์นญ์œผ๋กœ ์ž…๋ ฅํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์‹ค์ œ ๊ฒ€์ƒ‰ ์‹œ์—๋Š” ์ •์‹ ๋ฌธ์„œ๋ช…์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
์˜ˆ:
- `์นด์ฆˆ๋งˆ` -> `์‚ฌํ†  ์นด์ฆˆ๋งˆ`
- `์•„์ฟ ์•„` -> `์•„์ฟ ์•„(์ด ๋ฉ‹์ง„ ์„ธ๊ณ„์— ์ถ•๋ณต์„!)`
- `๋‹คํฌ๋‹ˆ์Šค` -> `๋‹คํฌ๋‹ˆ์Šค(์ด ๋ฉ‹์ง„ ์„ธ๊ณ„์— ์ถ•๋ณต์„!)`
- `์„ธ๊ณ„๊ด€` -> `์ด ๋ฉ‹์ง„ ์„ธ๊ณ„์— ์ถ•๋ณต์„!/์„ค์ •`
- `์ง€์—ญ` -> `์ด ๋ฉ‹์ง„ ์„ธ๊ณ„์— ์ถ•๋ณต์„!/์ง€์—ญ`
์ €์žฅ ์‹œ `source.title`์€ ์งง์€ ํ‘œ์‹œ๋ช…์œผ๋กœ ์ •๊ทœํ™”ํ•œ๋‹ค.
์˜ˆ:
- `์‚ฌํ†  ์นด์ฆˆ๋งˆ` -> `์นด์ฆˆ๋งˆ`
- `์•„์ฟ ์•„(์ด ๋ฉ‹์ง„ ์„ธ๊ณ„์— ์ถ•๋ณต์„!)` -> `์•„์ฟ ์•„`
- `์ด ๋ฉ‹์ง„ ์„ธ๊ณ„์— ์ถ•๋ณต์„!/์„ค์ •` -> `์„ธ๊ณ„๊ด€`
- `์ด ๋ฉ‹์ง„ ์„ธ๊ณ„์— ์ถ•๋ณต์„!/์ง€์—ญ` -> `์„ธ๊ณ„๊ด€`
## ๋ณ€ํ™˜ ๊ทœ์น™
- ์ถœ๋ ฅ ํฌ๋งท์€ QA JSON ์œ ์ง€
- `question`์€ ์‚ฌ์šฉ์ž ์งˆ๋ฌธํ˜•์ด ์•„๋‹ˆ๋ผ ๊ฒ€์ƒ‰์šฉ ์†Œ์ œ๋ชฉ ์š”์•ฝ
- `answer`๋Š” ์ค‘๋ฆฝ ์š”์•ฝํ˜• ๋ณธ๋ฌธ
- ํ‘œ, ์ด๋ฏธ์ง€, ๋ถˆํ•„์š”ํ•œ ์žฅ์‹ ์š”์†Œ๋Š” ์ œ์™ธ
- chunk ๊ธธ์ด๋Š” ์•ฝ 200์ž ๋‚ด์™ธ
- chunk overlap์€ 1~2๋ฌธ์žฅ
## ์ €์žฅ ํŒŒ์ผ
- `data/processed/namuwiki_qa.json`
- ๋‚˜๋ฌด์œ„ํ‚ค ๊ธฐ๋ฐ˜ ํ†ตํ•ฉ QA ๋ฐ์ดํ„ฐ
- `data/processed/megumin_qa_dataset.json`
- ๊ธฐ์กด ๋ฉ”๊ตฌ๋ฐ ์Šคํƒ€์ผ/ํŽ˜๋ฅด์†Œ๋‚˜ QA ๋ฐ์ดํ„ฐ
- `data/processed/megumin_questions.faiss`
- ์Šคํƒ€์ผ ๋ฐ์ดํ„ฐ question ์ธ๋ฑ์Šค
- `data/processed/megumin_question_answer.faiss`
- ์Šคํƒ€์ผ ๋ฐ์ดํ„ฐ question+answer ์ธ๋ฑ์Šค
- `data/processed/megumin_questions_meta.json`
- ์Šคํƒ€์ผ ๋ฐ์ดํ„ฐ ์ธ๋ฑ์Šค์™€ ์›๋ฌธ ๋ ˆ์ฝ”๋“œ ๋งคํ•‘
- `data/processed/namuwiki_questions.faiss`
- ๋‚˜๋ฌด์œ„ํ‚ค ๋ฐ์ดํ„ฐ question ์ธ๋ฑ์Šค
- `data/processed/namuwiki_question_answer.faiss`
- ๋‚˜๋ฌด์œ„ํ‚ค ๋ฐ์ดํ„ฐ question+answer ์ธ๋ฑ์Šค
- `data/processed/namuwiki_questions_meta.json`
- ๋‚˜๋ฌด์œ„ํ‚ค ๋ฐ์ดํ„ฐ ์ธ๋ฑ์Šค์™€ ์›๋ฌธ ๋ ˆ์ฝ”๋“œ ๋งคํ•‘
## ๋ณ‘ํ•ฉ ์ €์žฅ ๊ทœ์น™
`crawl_namuwiki_to_qa.py`๋Š” ์ถœ๋ ฅ ํŒŒ์ผ์ด ์ด๋ฏธ ์กด์žฌํ•˜๋ฉด ๊ธฐ์กด `items`์™€ ์ƒˆ ๊ฒฐ๊ณผ๋ฅผ ๋ณ‘ํ•ฉํ•œ๋‹ค.
- ์šฐ์„  ์ค‘๋ณต ๊ธฐ์ค€: `chunk_id`
- ๋ณด์กฐ ๊ธฐ์ค€: `source.url + chunk_index + question`
๋ณ‘ํ•ฉ ํ›„ ์‹๋ณ„์ž๋Š” ๋‹ค์Œ ๊ทœ์น™์œผ๋กœ ์ •๊ทœํ™”ํ•œ๋‹ค.
- `chunk_id`
- title๋ณ„ ์—ฐ์† ๋ฒˆํ˜ธ ์œ ์ง€
- ์˜ˆ: `๋ฉ”๊ตฌ๋ฐ_0000`, `์นด์ฆˆ๋งˆ_0185`
- `chunk_index`
- ์ „์ฒด ํŒŒ์ผ ๊ธฐ์ค€ ์—ฐ์† ๋ฒˆํ˜ธ
- ์˜ˆ: `0`, `135`, `320`
## ๊ด€๋ จ ์Šคํฌ๋ฆฝํŠธ
- `scripts/crawl_namuwiki_to_qa.py`
- ๋‚˜๋ฌด์œ„ํ‚ค ๋ฌธ์„œ๋ฅผ QA JSON์œผ๋กœ ๋ณ€ํ™˜
- `scripts/build_faiss_index.py`
- ์Šคํƒ€์ผ/๋‚˜๋ฌด์œ„ํ‚ค ๋ฐ์ดํ„ฐ ๊ฐ๊ฐ์— ๋Œ€ํ•ด question ์ธ๋ฑ์Šค์™€ question+answer ์ธ๋ฑ์Šค๋ฅผ ์ƒ์„ฑ
- `scripts/expand_persona_from_namuwiki.py`
- ์‚ฌ์‹ค/์„ค์ • QA๋ฅผ ๋ฉ”๊ตฌ๋ฐ ์Šคํƒ€์ผ QA ์ดˆ์•ˆ์œผ๋กœ ํ™•์žฅ
## ์šด์˜ ๋ฉ”๋ชจ
- ๋‚˜๋ฌด์œ„ํ‚ค ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ๋Š” ์‚ฌ์‹ค/์„ค์ • ๋ณด๊ฐ•์šฉ์ด๋‹ค.
- ๊ธฐ์กด `megumin_qa_dataset.json`์€ ๋ฉ”๊ตฌ๋ฐ ๋งํˆฌ์™€ ๊ฐ์ •์„  ์œ ์ง€์— ๋” ์ค‘์š”ํ•˜๋‹ค.
- ์ตœ์ข… Agent๋Š” ๋‘ ๋ฐ์ดํ„ฐ์›์„ ํ•จ๊ป˜ ๊ฒ€์ƒ‰ํ•ด ์‚ฌ์šฉํ•œ๋‹ค.