fabagent / experiments /README.md
hee_!J
docs: D11 Conductor ์ฑ„ํƒ + Journey 11๋‹จ๊ณ„ ์ถ”๊ฐ€
d584116
|
Raw
History Blame Contribute Delete
13.3 kB
# FabAgent Experiments
ํ•ต์‹ฌ ์˜์‚ฌ๊ฒฐ์ •๋งˆ๋‹ค ์ •๋Ÿ‰ ๋น„๊ต ์‹คํ—˜ + ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ ํ‘œ + ์‹œํ–‰์ฐฉ์˜ค ๊ธฐ๋ก์„ ๋‚จ๊ฒจ,
"์™œ ์ด ์„ ํƒ์„ ํ–ˆ๋Š”๊ฐ€"๋ฅผ ์ธก์ •๋œ ๊ทผ๊ฑฐ๋กœ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
## ์‹คํ—˜ ๋ชฉ๋ก
| ID | ์‹คํ—˜ | ๋น„๊ต ๋Œ€์ƒ | ๊ฒฐ์ • | ๊ฒฐ๊ณผ ๋ฌธ์„œ |
|---|---|---|---|---|
| **D1** | Tier 1 ์ด์ƒ ํƒ์ง€ ๋ชจ๋ธ | IsoForest / LOF / OC-SVM / baseline | IsolationForest | [tier1_detection/results.md](tier1_detection/results.md) |
| **D2** | Retrieval ๋ฐฑ์—”๋“œ latency | keyword / FAISS / hybrid / +rerank | hybrid_rerank (latency) | [retrieval_compare/results.md](retrieval_compare/results.md) |
| **D5** | ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ vs Single LLM | ๋ถ„๋ฆฌยท์ „๋ฌธํ™” vs ํ†ตํ•ฉ ํ˜ธ์ถœ | Multi-Agent | [multi_vs_single/results.md](multi_vs_single/results.md) |
| RAG-eval | RAGAS (hybrid vs hybrid_rerank) | 2 backend RAGAS ๋น„๊ต | (D6์— ํ†ตํ•ฉ) | [rag_eval/results.md](rag_eval/results.md) |
| **D6** | RAG paradigm 5๋‹จ๊ณ„ ablation | No RAG / Naive / FAISS / Hybrid / +Rerank | **Hybrid** | [rag_paradigm/results.md](rag_paradigm/results.md) |
| **D7** | Workflow vs Agentic | ๋‹จ์ผ LLM ํ˜ธ์ถœ vs tool-using ๋ฃจํ”„ | **Agentic** | [agentic_vs_workflow/results.md](agentic_vs_workflow/results.md) |
| **D8** | CRAG ON vs OFF | self-correction (grader + refinement) | CRAG **ํ™œ์„ฑ ์œ ์ง€** (๊ด€์ธก ๊ฐ€์น˜) | [crag_eval/results.md](crag_eval/results.md) |
| **D9** | ํ•œ๊ตญ์–ด reranker (Dongjin-kr/ko-reranker) | vs BAAI(์˜์–ด) vs hybrid (12 docs) | ๋‘˜ ๋‹ค hybrid์— ๋ฏธ๋‹ฌ | [reranker_compare/results.md](reranker_compare/results.md) |
| **D10** | D9 ํ›„์†: ์ฝ”ํผ์Šค 12โ†’34 ํ™•์žฅ ํ›„ reranker ์žฌํ‰๊ฐ€ | hybrid / BAAI / ko-reranker (34 docs) | **๊ฐ€์„ค ๊ฒ€์ฆ - ํšจ๊ณผ ์™„์ „ ๋ฐ˜์ „** | [reranker_compare/results.md](reranker_compare/results.md) |
| **D11** | Conductor (Plan-and-Execute) vs Autonomous | 4 LLM call vs 10 LLM call | **Conductor ์ฑ„ํƒ** (์†๋„ยท๋น„์šฉ ์šฐ์œ„, ํ’ˆ์งˆ ๋™๋“ฑ) | [conductor_vs_autonomous/results.md](conductor_vs_autonomous/results.md) |
## ํ•ต์‹ฌ ๊ฒฐ์ • ์š”์•ฝ
### D1. ์ด์ƒ ํƒ์ง€ ๋ชจ๋ธ โ†’ **IsolationForest**
| ๋ชจ๋ธ | ROC-AUC | PR-AUC |
|---|---|---|
| **IsolationForest** | 0.600 | **0.129** |
| baseline | 0.565 | 0.119 |
| OC-SVM | 0.547 | 0.098 |
| LOF | 0.530 | 0.089 |
- SECOM์€ ๋น„์ง€๋„ ์ด์ƒ ํƒ์ง€๊ฐ€ ์–ด๋ ค์šด ํ‘œ์ค€ ๋ฒค์น˜๋งˆํฌ (๋ฌธํ—Œ ROC-AUC ~0.6 ๋ฒ”์œ„)
- ํŠธ๋ ˆ์ด๋“œ์˜คํ”„: Autoencoder/LSTM์€ ๋” ๋ณต์žกํ•œ ํŒจํ„ด ๊ฐ€๋Šฅํ•˜๋‚˜ ํ•™์Šต ๋ฐ์ดํ„ฐยท์‹œ๊ฐ„ ๋น„์šฉ ํผ
### D2. Retrieval ๋ฐฑ์—”๋“œ latency โ†’ **4 backend ๊ฒ€์ฆ (๋‹จ์ˆœ latency ๋น„๊ต)**
| ๋ฐฑ์—”๋“œ | ํ‰๊ท  latency |
|---|---|
| keyword | 0.5 ms |
| FAISS | ~60 ms |
| **hybrid** | ~54 ms |
| hybrid + rerank | ~326 ms |
- ์˜๋ฏธ ์šฐํšŒ ์ฟผ๋ฆฌ์—์„œ FAISSยทhybrid๊ฐ€ keyword ์••๋„
- ์ •๋ฐ€๋„ ํ‰๊ฐ€๋Š” D6 (RAG paradigm ablation)์—์„œ RAGAS๋กœ ๋ณ„๋„ ์ง„ํ–‰
### D5. ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ vs Single LLM โ†’ **Multi-Agent**
| ์˜์—ญ | ์šฐ์œ„ |
|---|---|
| ์†๋„ยท๋น„์šฉ | Single (2.6x ๋น ๋ฆ„, 2.2x ์ €๋ ด) |
| ์‘๋‹ต ๊นŠ์ด (์กฐ์น˜ ๊ถŒ๊ณ  ์ˆ˜) | Multi (1.6~1.9x detailed) |
| ๋ชจ๋“ˆํ™”ยทํ™•์žฅ์„ฑยท์ž๊ฐ€ํ•™์Šต | Multi |
| schemaยทcitation ์ •ํ™•๋„ | ๋™๋“ฑ (์–‘์ชฝ strict JSON 100%) |
- ๋น„์šฉ ์ ˆ๋Œ€๊ฐ’์ด ๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ $0.01~0.02๋กœ ๋ฏธ๋ฏธํ•ด cost-awareํ•  ํ•„์š” ์—†์Œ
- ์šด์˜ ํ™˜๊ฒฝ(์‚ฌ์—…๋ถ€๋ณ„ Tier ์ฑ…์ž„์ž ๋ถ„๋ฆฌ, ์ƒˆ step ํ™•์žฅ)์—์„œ๋Š” Multi์˜ ๋ชจ๋“ˆํ™”๊ฐ€ ๊ฒฐ์ •์ 
### D6. RAG paradigm 5๋‹จ๊ณ„ ablation โ†’ **Hybrid (BM25+FAISS+RRF)**
| Paradigm | Faithfulness | Answer Relevancy | Context Precision | Total ms |
|---|---|---|---|---|
| No RAG | 0.321 | 0.297 | 1.000 | 13,084 |
| Naive RAG (keyword) | 0.764 | 0.388 | 1.000 | 15,192 |
| Vector RAG (FAISS) | 0.784 | 0.146 | 1.000 | 12,267 |
| **Hybrid (BM25+FAISS+RRF)** | **0.821** | **0.394** | **1.000** | **10,977** |
| Hybrid + Rerank | 0.819 | 0.167 | 1.000 | 11,306 |
![RAG Paradigm Evolution](rag_paradigm/charts/ragas_comparison.png)
![Quality vs Latency Trade-off](rag_paradigm/charts/tradeoff.png)
**ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ**:
1. **RAG ๋„์ž… ํšจ๊ณผ๊ฐ€ ๊ฒฐ์ •์ ** - No RAG ๋Œ€๋น„ ์–ด๋–ค paradigm์„ ๋ถ™์—ฌ๋„ faithfulness 2.5๋ฐฐโ†‘
2. **Hybrid๊ฐ€ ๋ณธ ์ฝ”ํผ์Šค์—์„œ ๋ชจ๋“  ์ง€ํ‘œ 1์œ„** - quality + latency ๋ชจ๋‘ ์šฐ์œ„. production ํ‘œ์ค€ ํŒจํ„ด (Microsoft Azure AI Search, LlamaIndex ๊ถŒ๊ณ )
3. **Cross-encoder Rerank๋Š” ๋ณธ ์ฝ”ํผ์Šค์—์„œ ์—ญํšจ๊ณผ**
- faithfulness ๋™๊ธ‰, answer_relevancy 0.394 โ†’ 0.167 ๋‚™ํญ
- ์›์ธ ์ถ”์ •: โ‘  ์ฝ”ํผ์Šค ~10๋ฌธ์„œ๋กœ Hybrid top-3์ด ์ด๋ฏธ ์ถฉ๋ถ„ โ‘ก `BAAI/bge-reranker-base`๊ฐ€ ์˜์–ด ํ•™์Šต โ†’ ํ•œ๊ตญ์–ด ๋„๋ฉ”์ธ ํ…์ŠคํŠธ ์ ์ˆ˜ ์‹ ํ˜ธ ์žก์Œ
- ์ฝ”ํผ์Šค 100+ ํ™•์žฅ ๋˜๋Š” ํ•œ๊ตญ์–ด reranker(`dongjin-kr/ko-reranker`)๋กœ ์žฌํ‰๊ฐ€ ๊ถŒ์žฅ
**์‹œํ–‰์ฐฉ์˜ค**: ์ฒ˜์Œ์—” "rerank๊ฐ€ production ํ‘œ์ค€์ด๋‹ˆ๊นŒ" ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์ฑ„ํƒํ–ˆ์œผ๋‚˜, RAGAS ํ‰๊ฐ€์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •๋ฐ˜๋Œ€ ์‹ ํ˜ธ๋ฅผ ๋ณด๋‚ด ๊ธฐ๋ณธ๊ฐ’ ๋ณ€๊ฒฝ (`hybrid_rerank` โ†’ `hybrid`). **์ •๋Ÿ‰ ํ‰๊ฐ€ ์—†์œผ๋ฉด ํ†ต๋…์„ ๊ทธ๋Œ€๋กœ ๋Œ๊ณ  ๊ฐˆ ๋ป”ํ•œ ์‚ฌ๋ก€**.
### D7. Workflow vs Agentic โ†’ **Agentic** (tool-using agent)
| ์ง€ํ‘œ | Workflow | Agentic | ๋ฐฐ์ˆ˜ |
|---|---|---|---|
| LLM ํ˜ธ์ถœ / ์•Œ๋žŒ | 3 | 9 | x3.0 |
| Tool ํ˜ธ์ถœ / ์•Œ๋žŒ | 0 | 13 | - |
| ์œ ๋‹ˆํฌ ์ธ์šฉ / ์•Œ๋žŒ | 4 | 5 | x1.25 |
| ์ž…๋ ฅ ํ† ํฐ / ์•Œ๋žŒ | 5,890 | 20,474 | x3.5 |
| ์ถœ๋ ฅ ํ† ํฐ / ์•Œ๋žŒ | 5,174 | 12,574 | x2.4 |
| Latency (T2~T4) | 83s | 194s | x2.3 |
| ๋น„์šฉ / 1000์•Œ๋žŒ | $11.82 | $30.27 | x2.6 |
![Workflow vs Agentic - ํ˜ธ์ถœ ํšŸ์ˆ˜ยท์ธ์šฉ ๊นŠ์ด](agentic_vs_workflow/charts/calls_citations.png)
![Tier๋ณ„ Latency](agentic_vs_workflow/charts/latency_per_tier.png)
![๋น„์šฉ](agentic_vs_workflow/charts/cost.png)
**์ฑ„ํƒ ๊ทผ๊ฑฐ**:
1. **Tool ํ˜ธ์ถœ ๋กœ๊ทธ = reasoning trace** - "์™œ ์ด ๊ถŒ๊ณ ๊ฐ€ ๋‚˜์™”๋Š”๊ฐ€"์˜ audit trail ํ™•๋ณด (fab ์•ˆ์ „์„ฑ ๊ฒฐ์ •์ )
2. **Multi-source ๊ทผ๊ฑฐ ๊ฒฐํ•ฉ** - INCยทFMEAยทSOPยทincident DBยทPM ์ด๋ ฅ์„ LLM์ด ์ž์œจ์ ์œผ๋กœ ๊ฒฐํ•ฉ
3. **๋น„์šฉ +$0.018/์•Œ๋žŒ** - ์ผ ์ˆ˜๋ฐฑ ์•Œ๋žŒ์—๋„ ์ผ $5 ๋ฏธ๋งŒ, ์‚ฌ์—…์  ์˜ํ–ฅ ๋ฌด์‹œ ๊ฐ€๋Šฅ
4. **Latency 2.3๋ฐฐ๋Š” ๋กœ๋”ฉ UI๋กœ ํก์ˆ˜** - ์ด๋ฏธ 4-Tier cascade ๋“ฑ์žฅ UI ๊ตฌํ˜„
**์‹œํ–‰์ฐฉ์˜ค**: "4-Tier๊ฐ€ LLM ํ˜ธ์ถœํ•˜๋‹ˆ๊นŒ multi-agent๋‹ค"๋ผ๊ณ  ์ฃผ์žฅํ–ˆ๋‹ค๊ฐ€, Anthropic์˜ [Building Effective Agents](https://www.anthropic.com/engineering/building-effective-agents) ์ •์˜๋กœ ์ž๊ธฐ ๊ฒ€์ฆํ•˜๋‹ˆ **workflow์˜€์Œ** (๊ฐ Tier๊ฐ€ ์‚ฌ์ „ RAG 1ํšŒ + LLM 1ํšŒ). Tool-using ํŒจํ„ด์œผ๋กœ ์ „ํ™˜ ํ›„ ์ธ์šฉ ๊นŠ์ด๋Š” +25% ์ •๋„์ง€๋งŒ, ๋„๊ตฌ ํ˜ธ์ถœ ๋กœ๊ทธ๊ฐ€ reasoning trace์ด์ž audit trail์ด ๋˜๋Š” ๊ฒŒ ๊ฒฐ์ •์ .
### D8. CRAG (Self-correction) โ†’ **ํ™œ์„ฑ ์œ ์ง€** (๊ด€์ธก ๊ฐ€์น˜)
| ์ง€ํ‘œ | CRAG OFF | CRAG ON | ๋ณ€ํ™” |
|---|---|---|---|
| Faithfulness | 0.641 | 0.639 | -0.1%p (๋™๊ธ‰) |
| Answer Relevancy | 0.283 | 0.250 | -3.3%p (์†Œํญ ํ•˜๋ฝ) |
| LLM ํ˜ธ์ถœ / ์•Œ๋žŒ | 3.0 | 3.7 | x1.22 |
| ๋น„์šฉ / 1000์•Œ๋žŒ | $9.40 | $12.29 | x1.31 |
| Latency (Tier 2) | 61s | 69s | x1.13 |
| Refinement ๋ฐœ๋™๋ฅ  | - | **20%** | (5๋ฒˆ ์ค‘ 1๋ฒˆ) |
| ํ‰๊ท  relevance_score | - | **0.61** | (CRAG ON์—์„œ 0~1๋กœ ๊ฐ€์‹œํ™”) |
![CRAG ์ž๊ฐ€ ์ •์ • ํ™œ๋™](crag_eval/charts/crag_activity.png)
![CRAG ํšจ๊ณผ - ๋‹ต๋ณ€ ํ’ˆ์งˆ](crag_eval/charts/quality.png)
**์ฑ„ํƒ ๊ทผ๊ฑฐ (์†”์งํ•œ trade-off)**:
1. **ํ’ˆ์งˆ ๋ณ€ํ™” ์‚ฌ์‹ค์ƒ ์—†์Œ** - faithfulness -0.1%p, relevancy -3.3%p. ๋ณธ ์ฝ”ํผ์Šค(~10๋ฌธ์„œ)์—์„  hybrid๊ฐ€ ์ด๋ฏธ ์ž˜ ์ž‘๋™
2. **์ž๊ฐ€ ์ •์ • ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ž์ฒด๋Š” ์ž‘๋™ ํ™•์ธ** - smoke test: gibberish ์ฟผ๋ฆฌ(`์•Œ์ˆ˜์—†์Œ xyzzy foobar`)์— avg score 0.0 ๋ถ€์—ฌ ํ›„ LLM์ด `CMP ๊ณต์ • ์‹คํŒจ ๋ชจ๋“œ ๋ถ„์„...`์œผ๋กœ ์žฌ์ž‘์„ฑ, avg 0.0 โ†’ 0.68 ํšŒ๋ณต
3. **์ธ์šฉ ์‹ ๋ขฐ๋„ ๊ฐ€์‹œํ™” ๊ฐ€์น˜** - ๋‹ต๋ณ€๋งˆ๋‹ค 0~1 relevance_score ๋…ธ์ถœ โ†’ ์šด์˜์ž๊ฐ€ "์ด ๊ถŒ๊ณ ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ฐ•ํ•œ ๊ทผ๊ฑฐ์— ๊ธฐ๋ฐ˜ํ•˜๋Š”๊ฐ€" ์ฆ‰์‹œ ํŒ๋‹จ
4. **๋น„์šฉ +31% ์ ˆ๋Œ€๊ฐ’ ๋ฌด์‹œ ๊ฐ€๋Šฅ** - 1000 ์•Œ๋žŒ๋‹น +$2.90
5. **agentic loop์™€์˜ ๋ถ€๋ถ„ ์ค‘๋ณต** - agent๊ฐ€ ์ด๋ฏธ ๋ถ€์กฑํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ณ  ๋‹ค๋ฅธ query๋กœ ์žฌํ˜ธ์ถœํ•˜๋Š” self-correction ์ผ๋ถ€ ์ˆ˜ํ–‰
**์‹œํ–‰์ฐฉ์˜ค**: AnthropicยทLangChain์ด CRAG๋ฅผ production ํŒจํ„ด์œผ๋กœ ์ž์ฃผ ์–ธ๊ธ‰. ๋‹จ์ˆœ ๊ตฌํ˜„(grader + refiner) + smoke test์—์„œ ์ธ์ƒ์  ์ž‘๋™ ํ™•์ธ. ๊ทธ๋Ÿฌ๋‚˜ ์ •๋Ÿ‰ ๋น„๊ต์—์„œ **ํ’ˆ์งˆ ๋ณ€ํ™” ๋ฏธ๋ฏธ** - D6 Rerank์™€ ๊ฐ™์€ ํŒจํ„ด. ์ž‘์€ ๋„๋ฉ”์ธ ์ฝ”ํผ์Šค์—์„  ์ •๊ตํ•œ self-correction์ด ROI ๋‚ฎ์Œ. ์ •๋Ÿ‰ ํ‰๊ฐ€ ์—†์ด๋Š” "CRAG ๋„์ž…ํ–ˆ์Œ" ๋งˆ์ผ€ํŒ…์œผ๋กœ ๋๋‚ฌ์„ ๊ฒƒ. ๊ฒฐ์ •: **ํ™œ์„ฑ ์œ ์ง€ํ•˜๋˜ ์ฝ”ํผ์Šค ํ™•์žฅ ์‹œ ์žฌํ‰๊ฐ€** (์ธ์šฉ ์‹ ๋ขฐ๋„ ๋…ธ์ถœ์ด๋ผ๋Š” ๋ถ€์ˆ˜ ๊ฐ€์น˜๋Š” ์œ ์ง€).
### D9. ํ•œ๊ตญ์–ด reranker โ†’ **์ฑ„ํƒ ๋ณด๋ฅ˜** (D6 ๊ฐ€์„ค ๋ถ€๋ถ„์  ์žฌํ™•์ธ)
| ๋ชจ๋“œ | ํ‰๊ท  LLM relevance | rerank latency | vs hybrid baseline |
|---|---|---|---|
| **hybrid (no rerank)** | **0.734** | 0 ms | baseline |
| BAAI/bge-reranker-base (์˜์–ด) | 0.714 | 315 ms | -0.020 |
| Dongjin-kr/ko-reranker (ํ•œ๊ตญ์–ด) | 0.703 | 826 ms | -0.031 |
![Reranker ๋น„๊ต](reranker_compare/charts/reranker_comparison.png)
**์ฟผ๋ฆฌ๋ณ„ ํŒจํ„ด**:
| ์ฟผ๋ฆฌ | hybrid | BAAI | ko | ์šฐ์Šน์ž |
|---|---|---|---|---|
| Photo CD ์ง์ ‘ | 0.867 | 0.817 | 0.867 | hybrid / ko (tie) |
| CMP ์ง์ ‘ | 0.750 | 0.767 | **0.850** | **ko (+0.10)** |
| Etch ์ง์ ‘ | 0.750 | 0.650 | 0.567 | hybrid |
| ์˜๋ฏธ ์šฐํšŒ 1 (lens ์ฒญ์†Œ) | 0.700 | 0.633 | **0.783** | **ko (+0.083)** |
| ์˜๋ฏธ ์šฐํšŒ 2 (yield ์˜ํ–ฅ) | 0.817 | **0.867** | 0.617 | BAAI |
| ์˜๋ฏธ ์šฐํšŒ 3 (PM ๊ฐ€์ด๋“œ) | 0.517 | 0.550 | 0.533 | tie |
**์‹œํ–‰์ฐฉ์˜ค (D6 โ†’ D9)**: D6์—์„œ ์˜์–ด reranker์˜ ๋ถ€์ง„ ์›์ธ์„ "ํ•œ๊ตญ์–ด ๋ชจ๋ธ๋กœ ํ’€๋ฆฐ๋‹ค"๊ณ  ๊ฐ€์„ค. D9์—์„œ ๊ฒ€์ฆํ•œ ๊ฒฐ๊ณผ - **ํ•œ๊ตญ์–ด reranker๊ฐ€ CMPยทlens cleanup ์ฟผ๋ฆฌ์—์„  ๋ช…ํ™•ํžˆ ์šฐ์œ„์ง€๋งŒ, Etchยทyield ์ฟผ๋ฆฌ์—์„  ํฐ ์†์‹ค**. 6 ์ฟผ๋ฆฌ ํ‰๊ท ์€ hybrid baseline ๋ฏธ๋‹ฌ. **๊ฒฐ๋ก : D6 ๊ฐ€์„ค์˜ ์ง„์งœ ๋ฌธ์ œ๋Š” ์˜์–ด/ํ•œ๊ตญ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ์ฝ”ํผ์Šค ๊ทœ๋ชจ**. ์ด ๊ฐ€์„ค์„ D10์—์„œ ์ •๋Ÿ‰ ๊ฒ€์ฆ.
### D10. ํ™•์žฅ ์ฝ”ํผ์Šค(34 docs)์—์„œ reranker ํšจ๊ณผ ๊ฒ€์ฆ โ†’ **๊ฐ€์„ค ์ž…์ฆ, ํšจ๊ณผ ์™„์ „ ๋ฐ˜์ „**
| ๋ชจ๋“œ | D9 (12 docs) | **D10 (34 docs)** | ๋ณ€ํ™” |
|---|---|---|---|
| hybrid (no rerank) | 0.734 | **0.592** | -0.142 (noiseโ†‘) |
| BAAI/bge-reranker-base | 0.714 (-0.020) | **0.709 (+0.117)** | **๋ฐ˜์ „!** |
| Dongjin-kr/ko-reranker | 0.703 (-0.031) | **0.675 (+0.083)** | **๋ฐ˜์ „!** |
![Reranker ๋น„๊ต (D10, 34 docs)](reranker_compare/charts/reranker_comparison.png)
**์‹œ๋ฆฌ์ฆˆ ์˜์˜ (D6 โ†’ D9 โ†’ D10)**:
- **D6**: production ํ‘œ์ค€์ด ์ž‘์€ ์ฝ”ํผ์Šค์—์„œ ์—ญํšจ๊ณผ ๋ฐœ๊ฒฌ ("rerank๊ฐ€ ๋ฌด์กฐ๊ฑด ์ข‹๋‹ค"๋Š” ํ†ต๋… ์ •๋Ÿ‰ ๋ฐ˜๋ฐ•)
- **D9**: ํ•œ๊ตญ์–ด reranker๋กœ๋„ ์•ˆ ํ’€๋ฆผ โ†’ ์˜์–ด/ํ•œ๊ตญ์–ด ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋ผ "**์ฝ”ํผ์Šค ๊ทœ๋ชจ๊ฐ€ ์ง„์งœ ์›์ธ**"์ด๋ผ๋Š” ๊ฐ€์„ค ์ œ์‹œ
- **D10**: ์ฝ”ํผ์Šค 12 โ†’ 34 ํ™•์žฅ ํ›„ ์žฌ์‹คํ–‰. **hybrid baseline -0.14, reranker ํšจ๊ณผ +0.12๋กœ ์™„์ „ ๋ฐ˜์ „** โ†’ ๊ฐ€์„ค ์ •๋Ÿ‰ ์ž…์ฆ
**ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€**: ์ •๋Ÿ‰ ํ‰๊ฐ€ ์—†์ด๋Š” ์ž˜๋ชป๋œ ํ†ต๋…์„ ๊ทธ๋Œ€๋กœ ๋Œ๊ณ  ๊ฐˆ ๋ป”ํ–ˆ๊ณ , ์ •๋Ÿ‰ ํ‰๊ฐ€ ๋•๋ถ„์— ์ง„์งœ ์›์ธ์„ ๋ถ„๋ฆฌํ•˜๊ณ  ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ฑ„ํƒ: ์ฝ”ํผ์Šค 30+ ํ™˜๊ฒฝ์—์„œ๋Š” `RAG_BACKEND=hybrid_rerank` ๊ถŒ์žฅ, ๋ฐ๋ชจ์šฉ ์ฝ”ํผ์Šค๋Š” hybrid ์œ ์ง€.
### D11. Conductor (Plan-and-Execute) vs Autonomous โ†’ **Conductor ์ฑ„ํƒ**
| ์ง€ํ‘œ | Autonomous | Conductor | ๋ณ€ํ™” |
|---|---|---|---|
| LLM ํ˜ธ์ถœ / ์•Œ๋žŒ | 10.0 | **4.0** | **-60%** |
| Tool ํ˜ธ์ถœ / ์•Œ๋žŒ | 13.7 | 16.0 | +17% |
| ์œ ๋‹ˆํฌ ์ธ์šฉ / ์•Œ๋žŒ | 6.0 | **6.0** | **๋™๋“ฑ** |
| ์ž…๋ ฅ ํ† ํฐ | 25,849 | 8,042 | **-69%** |
| ์ถœ๋ ฅ ํ† ํฐ | 13,385 | 5,895 | **-56%** |
| **Latency / ์•Œ๋žŒ** | **131์ดˆ** | **60์ดˆ** | **-54%** |
| ๋น„์šฉ / 1000์•Œ๋žŒ | $33.23 | $13.80 | **-58%** |
![ํ˜ธ์ถœ ๋น„๊ต](conductor_vs_autonomous/charts/calls_comparison.png)
![Latency ๋น„๊ต](conductor_vs_autonomous/charts/latency_comparison.png)
**์˜์˜ (D7 โ†’ D11 narrative)**:
- **D7**: "workflow โ†’ agentic"์œผ๋กœ reasoning traceยท์ž์œจ์„ฑ ํ™•๋ณด (๊ฐ Tier๊ฐ€ tool ์ž์œจ ํ˜ธ์ถœ)
- **D11**: "agentic โ†’ conductor"๋กœ ํ†ต์‹  ํšจ์œจ ํšŒ๋ณต (Central Planner๊ฐ€ plan 1ํšŒ ์‚ฐ์ถœ + Tier executor๊ฐ€ plan๋Œ€๋กœ ์‹คํ–‰ + LLM 1ํšŒ synthesis)
- ๋‘ ํŒจํ„ด ๋ชจ๋‘ ์ •๋Ÿ‰ ๋น„๊ต ํ›„ ์ฑ„ํƒ. autonomous๋Š” ํ™˜๊ฒฝ๋ณ€์ˆ˜ `AGENT_MODE=autonomous`๋กœ ๋ณด์กด (๋ณต์žกํ•œ ์•Œ๋žŒยท์˜ˆ์ƒ์น˜ ๋ชปํ•œ ์ปจํ…์ŠคํŠธ ์ ์‘ ํ•„์š” ์‹œ)
- **์žฌ๊ท€ยท๋ฌดํ•œ๋ฃจํ”„ ์œ„ํ—˜ ์›์ฒœ ์ฐจ๋‹จ**: autonomous์˜ `MAX_TOOL_ITERATIONS=4` ์บก ์˜์กด์ด plan ๊ณ ์ • ์‹คํ–‰์œผ๋กœ ๋ณธ์งˆ์  ํ•ด๊ฒฐ
## ์‹คํ–‰ ๋ฐฉ๋ฒ•
```bash
# Tier 1 ๋ชจ๋ธ ๋ฒค์น˜๋งˆํฌ (D1)
.venv/bin/python -m experiments.tier1_detection.benchmark
# Retrieval latency ๋น„๊ต (D2)
.venv/bin/python -m experiments.retrieval_compare.benchmark
# ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ vs Single LLM (D5)
.venv/bin/python -m experiments.multi_vs_single.benchmark
# RAGAS hybrid vs hybrid_rerank
.venv/bin/python -m experiments.rag_eval.benchmark
# RAG paradigm 5๋‹จ๊ณ„ ablation (D6)
.venv/bin/python -m experiments.rag_paradigm.benchmark
# ์ฐจํŠธ๋งŒ ์žฌ์ƒ์„ฑ (CSV ์บ์‹œ ์‚ฌ์šฉ):
.venv/bin/python -m experiments.rag_paradigm.benchmark --charts-only
# Workflow vs Agentic (D7)
.venv/bin/python -m experiments.agentic_vs_workflow.benchmark
# CRAG ON vs OFF (D8)
.venv/bin/python -m experiments.crag_eval.benchmark
# ํ•œ๊ตญ์–ด reranker (Dongjin-kr/ko-reranker) ํ‰๊ฐ€ (D9ยทD10)
.venv/bin/python -m experiments.reranker_compare.benchmark
# Conductor vs Autonomous (D11)
.venv/bin/python -m experiments.conductor_vs_autonomous.benchmark
```
๊ฐ ์‹คํ—˜์€ `results.md`์™€ `charts/*.png`๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.