clovax-tax-chatbot / RAG_TESTING.md
bissal's picture
large modification
3a15ede

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

RAG ์‹œ์Šคํ…œ ํ…Œ์ŠคํŠธ ๋ฐ ํŠœ๋‹ ๊ฐ€์ด๋“œ

์ด ๋ฌธ์„œ๋Š” ํ—ˆ๊น…ํŽ˜์ด์Šค ์ŠคํŽ˜์ด์Šค์šฉ ์„ธ๋ฌด ์ฑ—๋ด‡ RAG ์‹œ์Šคํ…œ์˜ ํ…Œ์ŠคํŠธ ๋ฐ ํŠœ๋‹ ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“‹ ๊ฐœ์š”

RAG (Retrieval-Augmented Generation) ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ ๋„๊ตฌ๋“ค์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • test_rag.py: ๊ธฐ๋ณธ RAG ํ…Œ์ŠคํŠธ ๋ฐ ์„ฑ๋Šฅ ํ‰๊ฐ€
  • tune_rag_config.py: ์ž๋™ ์„ค์ • ํŠœ๋‹
  • run_rag_tests.py: ํ†ตํ•ฉ ํ…Œ์ŠคํŠธ ์‹คํ–‰ ์Šคํฌ๋ฆฝํŠธ

๐Ÿš€ ๋น ๋ฅธ ์‹œ์ž‘

1. ์ „์ฒด ํ…Œ์ŠคํŠธ ์‹คํ–‰

python run_rag_tests.py

2. RAG ์‹œ์Šคํ…œ ์žฌ๊ตฌ์ถ• ํ›„ ํ…Œ์ŠคํŠธ

python run_rag_tests.py --rebuild

3. ๋น ๋ฅธ ํ…Œ์ŠคํŠธ๋งŒ ์‹คํ–‰

python run_rag_tests.py --quick

4. ์„ค์ • ํŠœ๋‹๋งŒ ์‹คํ–‰

python run_rag_tests.py --tune

5. ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ๋งŒ ์‹คํ–‰

python run_rag_tests.py --benchmark

๐Ÿ“ ํŒŒ์ผ ๊ตฌ์กฐ

clovax-tax-chatbot/
โ”œโ”€โ”€ rag_system.py          # RAG ์‹œ์Šคํ…œ ๋ฉ”์ธ ๋ชจ๋“ˆ
โ”œโ”€โ”€ config.py              # ์„ค์ • ํŒŒ์ผ
โ”œโ”€โ”€ law_fetcher.py         # ๋ฒ•๋ น ๋ฐ์ดํ„ฐ ํŽ˜์ฒ˜
โ”œโ”€โ”€ test_rag.py           # RAG ํ…Œ์ŠคํŠธ ์Šคํฌ๋ฆฝํŠธ
โ”œโ”€โ”€ tune_rag_config.py    # ์„ค์ • ํŠœ๋‹ ์Šคํฌ๋ฆฝํŠธ
โ”œโ”€โ”€ run_rag_tests.py      # ํ†ตํ•ฉ ์‹คํ–‰ ์Šคํฌ๋ฆฝํŠธ
โ”œโ”€โ”€ RAG_TESTING.md        # ์ด ๊ฐ€์ด๋“œ ๋ฌธ์„œ
โ”‚
โ”œโ”€โ”€ vector_db.faiss       # ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค (์ƒ์„ฑ๋จ)
โ”œโ”€โ”€ documents.pkl         # ๋ฌธ์„œ ๋ฐ์ดํ„ฐ (์ƒ์„ฑ๋จ)
โ”œโ”€โ”€ metadata.json         # ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ (์ƒ์„ฑ๋จ)
โ”‚
โ”œโ”€โ”€ law_cache/            # ๋ฒ•๋ น ๋ฐ์ดํ„ฐ ์บ์‹œ ๋””๋ ‰ํ† ๋ฆฌ
โ”‚   โ”œโ”€โ”€ cache_info.json
โ”‚   โ”œโ”€โ”€ ์ง€๋ฐฉ์„ธ๋ฒ•.json
โ”‚   โ””โ”€โ”€ ์ง€๋ฐฉ์„ธ๋ฒ•์‹œํ–‰๋ น.json
โ”‚
โ””โ”€โ”€ test_results_*.json   # ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ ํŒŒ์ผ๋“ค (์ƒ์„ฑ๋จ)

๐Ÿ”ง ์ƒ์„ธ ์‚ฌ์šฉ๋ฒ•

test_rag.py - ๊ธฐ๋ณธ RAG ํ…Œ์ŠคํŠธ

๊ธฐ๋Šฅ

  • ๊ธฐ๋ณธ ์งˆ์˜์‘๋‹ต ํ…Œ์ŠคํŠธ (12๊ฐœ ํ…Œ์ŠคํŠธ ์ฟผ๋ฆฌ)
  • ๊ฒ€์ƒ‰ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ํ…Œ์ŠคํŠธ
  • ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฐ€์ค‘์น˜ ํŠœ๋‹ ํ…Œ์ŠคํŠธ
  • ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ
  • ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ ์ €์žฅ

์‹คํ–‰ ๋ฐฉ๋ฒ•

# ๊ธฐ์กด ๋ฒกํ„ฐ DB ์‚ฌ์šฉ
python test_rag.py

# RAG ์‹œ์Šคํ…œ ์žฌ๊ตฌ์ถ• ํ›„ ํ…Œ์ŠคํŠธ
python test_rag.py --rebuild

ํ…Œ์ŠคํŠธ ์ฟผ๋ฆฌ ์˜ˆ์‹œ

  • "์ทจ๋“์„ธ์œจ์ด ์–ผ๋งˆ์ธ๊ฐ€์š”?"
  • "์ฃผํƒ ์ทจ๋“์„ธ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•์„ ์•Œ๋ ค์ฃผ์„ธ์š”"
  • "1์„ธ๋Œ€ 1์ฃผํƒ์ž ์ทจ๋“์„ธ ๊ฐ๋ฉด ํ˜œํƒ์€?"
  • "์‹ ํ˜ผ๋ถ€๋ถ€ ์ทจ๋“์„ธ ๊ฐ๋ฉด ์กฐ๊ฑด์€?"
  • "๋†์ง€ ์ทจ๋“์„ธ ๊ฐ๋ฉด ๊ทœ์ •์„ ์•Œ๋ ค์ฃผ์„ธ์š”"

tune_rag_config.py - ์ž๋™ ์„ค์ • ํŠœ๋‹

๊ธฐ๋Šฅ

  • ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์ตœ์ ํ™”
  • ๊ฒ€์ƒ‰ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™” (top_k, similarity_threshold)
  • ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฐ€์ค‘์น˜ ์ตœ์ ํ™” (vector vs tfidf)
  • ๋ฐฐ์น˜ ํฌ๊ธฐ ์ตœ์ ํ™”
  • ์ตœ์ ํ™”๋œ ์„ค์ • ํŒŒ์ผ ์ƒ์„ฑ

์‹คํ–‰ ๋ฐฉ๋ฒ•

python tune_rag_config.py

ํ‰๊ฐ€ ๊ธฐ์ค€

  • ์‘๋‹ต ํ’ˆ์งˆ (40์ ): ํ‚ค์›Œ๋“œ ๋งค์นญ๋ฅ 
  • ์†Œ์Šค ๋ฌธ์„œ ์ˆ˜ (20์ ): ๊ด€๋ จ ๋ฌธ์„œ ๊ฐœ์ˆ˜
  • ์‘๋‹ต ์†๋„ (20์ ): 3์ดˆ ์ด๋‚ด ๋งŒ์ 
  • ์†Œ์Šค ๋ฌธ์„œ ํ’ˆ์งˆ (20์ ): ๊ด€๋ จ๋„ ์ ์ˆ˜

run_rag_tests.py - ํ†ตํ•ฉ ์‹คํ–‰ ์Šคํฌ๋ฆฝํŠธ

์ฃผ์š” ์˜ต์…˜

python run_rag_tests.py [์˜ต์…˜]

์˜ต์…˜:
  --rebuild     RAG ์‹œ์Šคํ…œ ์žฌ๊ตฌ์ถ• ํ›„ ํ…Œ์ŠคํŠธ
  --tune        ์„ค์ • ํŠœ๋‹๋งŒ ์‹คํ–‰
  --benchmark   ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ๋งŒ ์‹คํ–‰  
  --quick       ๋น ๋ฅธ ํ…Œ์ŠคํŠธ (3๊ฐœ ์ฟผ๋ฆฌ๋งŒ)
  --info        ์‹œ์Šคํ…œ ์ •๋ณด๋งŒ ํ‘œ์‹œ

๐Ÿ“Š ๊ฒฐ๊ณผ ํŒŒ์ผ

ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ ํŒŒ์ผ (test_results_YYYYMMDD_HHMMSS.json)

{
  "timestamp": "2024-01-01T12:00:00",
  "config": { ... },
  "test_results": [
    {
      "query": "์ทจ๋“์„ธ์œจ์ด ์–ผ๋งˆ์ธ๊ฐ€์š”?",
      "answer": "...",
      "source_count": 3,
      "response_time": 1.23,
      "success": true
    }
  ],
  "summary": {
    "total_tests": 12,
    "successful_tests": 12
  }
}

ํŠœ๋‹ ๊ฒฐ๊ณผ ํŒŒ์ผ (rag_tuning_results_YYYYMMDD_HHMMSS.json)

{
  "timestamp": "2024-01-01T12:00:00",
  "baseline_config": { ... },
  "best_config": { ... },
  "best_score": 85.2,
  "tuning_history": [ ... ]
}

์ตœ์ ํ™”๋œ ์„ค์ • ํŒŒ์ผ (optimized_config.py)

# optimized_config.py - ํŠœ๋‹๋œ RAG ์„ค์ •
OPTIMIZED_RAG_CONFIG = {
    "embedding_models": ["jhgan/ko-sroberta-multitask"],
    "batch_size": 32,
    "top_k": 5,
    "similarity_threshold": 0.2,
    "hybrid_weights": {"vector": 0.7, "tfidf": 0.3}
}

โš™๏ธ ์„ค์ • ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค๋ช…

config.py์˜ RAG_CONFIG

RAG_CONFIG = {
    # ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ (์šฐ์„ ์ˆœ์œ„๋Œ€๋กœ ์‹œ๋„)
    'embedding_models': [
        'jhgan/ko-sroberta-multitask',                    # ํ•œ๊ตญ์–ด ํŠนํ™”
        'sentence-transformers/paraphrase-multilingual-mpnet-base-v2',
        'paraphrase-multilingual-MiniLM-L12-v2'
    ],
    
    # ๋ฐฐ์น˜ ํฌ๊ธฐ (๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ vs ์†๋„)
    'batch_size': 32,
    
    # ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ๊ฐœ์ˆ˜
    'top_k': 5,
    
    # ์œ ์‚ฌ๋„ ์ž„๊ณ„๊ฐ’ (๋‚ฎ์„์ˆ˜๋ก ๋” ๋งŽ์€ ๋ฌธ์„œ ํฌํ•จ)
    'similarity_threshold': 0.2,
    
    # ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฒ€์ƒ‰ ๊ฐ€์ค‘์น˜
    'hybrid_weights': {
        'vector': 0.7,  # ์˜๋ฏธ์  ์œ ์‚ฌ๋„
        'tfidf': 0.3    # ํ‚ค์›Œ๋“œ ๋งค์นญ
    }
}

๐ŸŽฏ ํŠœ๋‹ ๊ฐ€์ด๋“œ๋ผ์ธ

1. ์„ฑ๋Šฅ ์šฐ์„  (ํ—ˆ๊น…ํŽ˜์ด์Šค ์ŠคํŽ˜์ด์Šค)

  • batch_size: 16-32 (๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ)
  • top_k: 3-5 (์‘๋‹ต ์†๋„)
  • similarity_threshold: 0.15-0.25

2. ์ •ํ™•๋„ ์šฐ์„  (๋กœ์ปฌ ํ™˜๊ฒฝ)

  • batch_size: 64-128
  • top_k: 7-10
  • similarity_threshold: 0.1-0.2

3. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฐ€์ค‘์น˜ ์กฐ์ •

  • ํ‚ค์›Œ๋“œ ๊ฒ€์ƒ‰ ์ค‘์‹ฌ: vector: 0.5, tfidf: 0.5
  • ์˜๋ฏธ ๊ฒ€์ƒ‰ ์ค‘์‹ฌ: vector: 0.8, tfidf: 0.2
  • ๊ท ํ˜•: vector: 0.7, tfidf: 0.3 (๊ถŒ์žฅ)

๐Ÿ” ๋ฌธ์ œ ํ•ด๊ฒฐ

Q1: "๋ฒกํ„ฐ DB ํŒŒ์ผ์ด ์—†์Šต๋‹ˆ๋‹ค" ์˜ค๋ฅ˜

# ํ•ด๊ฒฐ: RAG ์‹œ์Šคํ…œ ์žฌ๊ตฌ์ถ•
python run_rag_tests.py --rebuild

Q2: ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ์˜ค๋ฅ˜

# config.py์—์„œ batch_size ์ค„์ด๊ธฐ
'batch_size': 16  # 32 โ†’ 16

Q3: ์‘๋‹ต ์†๋„๊ฐ€ ๋А๋ฆผ

# config.py์—์„œ top_k ์ค„์ด๊ธฐ
'top_k': 3,  # 5 โ†’ 3
'similarity_threshold': 0.25  # ์ž„๊ณ„๊ฐ’ ๋†’์ด๊ธฐ

Q4: ๊ฒ€์ƒ‰ ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์Œ

# config.py์—์„œ ๋” ํฌ๊ด„์ ์œผ๋กœ ์„ค์ •
'top_k': 7,  # ๋” ๋งŽ์€ ๋ฌธ์„œ ๊ฒ€์ƒ‰
'similarity_threshold': 0.15,  # ์ž„๊ณ„๊ฐ’ ๋‚ฎ์ถ”๊ธฐ
'hybrid_weights': {'vector': 0.6, 'tfidf': 0.4}  # TF-IDF ๊ฐ€์ค‘์น˜ ์ฆ๊ฐ€

๐Ÿ“ˆ ์„ฑ๋Šฅ ๋ชจ๋‹ˆํ„ฐ๋ง

์ฃผ์š” ์ง€ํ‘œ

  • ์‘๋‹ต ์‹œ๊ฐ„: < 3์ดˆ (๋ชฉํ‘œ)
  • ๊ฒ€์ƒ‰ ์ •ํ™•๋„: ํ‚ค์›Œ๋“œ ๋งค์นญ๋ฅ  > 80%
  • ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰: < 2GB (ํ—ˆ๊น…ํŽ˜์ด์Šค ์ œํ•œ)
  • ์†Œ์Šค ๋ฌธ์„œ ๊ด€๋ จ๋„: > 0.3 (hybrid_score)

๋ชจ๋‹ˆํ„ฐ๋ง ๋ช…๋ น์–ด

# ์‹œ์Šคํ…œ ์ •๋ณด ํ™•์ธ
python run_rag_tests.py --info

# ๋น ๋ฅธ ์„ฑ๋Šฅ ์ฒดํฌ
python run_rag_tests.py --quick

# ์ƒ์„ธ ๋ฒค์น˜๋งˆํฌ
python run_rag_tests.py --benchmark

๐Ÿš€ ํ—ˆ๊น…ํŽ˜์ด์Šค ์ŠคํŽ˜์ด์Šค ์ตœ์ ํ™”

1. ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”

  • ๋ฒกํ„ฐ DB๋ฅผ ๋ฏธ๋ฆฌ ๊ตฌ์ถ•ํ•˜์—ฌ ์—…๋กœ๋“œ
  • ์บ์‹œ๋œ ๋ฒ•๋ น ๋ฐ์ดํ„ฐ ํ™œ์šฉ
  • ์ž‘์€ ๋ฐฐ์น˜ ํฌ๊ธฐ ์‚ฌ์šฉ (16-32)

2. ์‘๋‹ต ์†๋„ ์ตœ์ ํ™”

  • ์‹ฑ๊ธ€ํ†ค ํŒจํ„ด์œผ๋กœ RAG ์ธ์Šคํ„ด์Šค ์žฌ์‚ฌ์šฉ
  • ์ ์ ˆํ•œ top_k ์„ค์ • (3-5)
  • ์ž„๊ณ„๊ฐ’ ์กฐ์ • (0.2-0.3)

3. ๋ชจ๋ธ ์ตœ์ ํ™”

  • ํ•œ๊ตญ์–ด ํŠนํ™” ๋ชจ๋ธ ์šฐ์„  ์‚ฌ์šฉ
  • ๊ฒฝ๋Ÿ‰ํ™”๋œ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์„ ํƒ
  • GPU ์‚ฌ์šฉ ๊ฐ€๋Šฅ์‹œ ์ž๋™ ๊ฐ์ง€

๐Ÿ“ž ์ถ”๊ฐ€ ์ง€์›

๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ฑฐ๋‚˜ ์ถ”๊ฐ€ ๊ธฐ๋Šฅ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ:

  1. ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ ํŒŒ์ผ ํ™•์ธ (test_results_*.json)
  2. ํŠœ๋‹ ํžˆ์Šคํ† ๋ฆฌ ๋ถ„์„ (rag_tuning_results_*.json)
  3. ์‹œ์Šคํ…œ ์ •๋ณด ํ™•์ธ (python run_rag_tests.py --info)

์—…๋ฐ์ดํŠธ: 2024๋…„ 8์›” ํ—ˆ๊น…ํŽ˜์ด์Šค ์ŠคํŽ˜์ด์Šค ์ตœ์ ํ™” ๋ฒ„์ „