fabagent / experiments /README.md
hee_!J
docs: D11 Conductor ์ฑ„ํƒ + Journey 11๋‹จ๊ณ„ ์ถ”๊ฐ€
d584116
|
Raw
History Blame Contribute Delete
13.3 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

FabAgent Experiments

ํ•ต์‹ฌ ์˜์‚ฌ๊ฒฐ์ •๋งˆ๋‹ค ์ •๋Ÿ‰ ๋น„๊ต ์‹คํ—˜ + ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ ํ‘œ + ์‹œํ–‰์ฐฉ์˜ค ๊ธฐ๋ก์„ ๋‚จ๊ฒจ, "์™œ ์ด ์„ ํƒ์„ ํ–ˆ๋Š”๊ฐ€"๋ฅผ ์ธก์ •๋œ ๊ทผ๊ฑฐ๋กœ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๋ชฉ๋ก

ID ์‹คํ—˜ ๋น„๊ต ๋Œ€์ƒ ๊ฒฐ์ • ๊ฒฐ๊ณผ ๋ฌธ์„œ
D1 Tier 1 ์ด์ƒ ํƒ์ง€ ๋ชจ๋ธ IsoForest / LOF / OC-SVM / baseline IsolationForest tier1_detection/results.md
D2 Retrieval ๋ฐฑ์—”๋“œ latency keyword / FAISS / hybrid / +rerank hybrid_rerank (latency) retrieval_compare/results.md
D5 ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ vs Single LLM ๋ถ„๋ฆฌยท์ „๋ฌธํ™” vs ํ†ตํ•ฉ ํ˜ธ์ถœ Multi-Agent multi_vs_single/results.md
RAG-eval RAGAS (hybrid vs hybrid_rerank) 2 backend RAGAS ๋น„๊ต (D6์— ํ†ตํ•ฉ) rag_eval/results.md
D6 RAG paradigm 5๋‹จ๊ณ„ ablation No RAG / Naive / FAISS / Hybrid / +Rerank Hybrid rag_paradigm/results.md
D7 Workflow vs Agentic ๋‹จ์ผ LLM ํ˜ธ์ถœ vs tool-using ๋ฃจํ”„ Agentic agentic_vs_workflow/results.md
D8 CRAG ON vs OFF self-correction (grader + refinement) CRAG ํ™œ์„ฑ ์œ ์ง€ (๊ด€์ธก ๊ฐ€์น˜) crag_eval/results.md
D9 ํ•œ๊ตญ์–ด reranker (Dongjin-kr/ko-reranker) vs BAAI(์˜์–ด) vs hybrid (12 docs) ๋‘˜ ๋‹ค hybrid์— ๋ฏธ๋‹ฌ reranker_compare/results.md
D10 D9 ํ›„์†: ์ฝ”ํผ์Šค 12โ†’34 ํ™•์žฅ ํ›„ reranker ์žฌํ‰๊ฐ€ hybrid / BAAI / ko-reranker (34 docs) ๊ฐ€์„ค ๊ฒ€์ฆ - ํšจ๊ณผ ์™„์ „ ๋ฐ˜์ „ reranker_compare/results.md
D11 Conductor (Plan-and-Execute) vs Autonomous 4 LLM call vs 10 LLM call Conductor ์ฑ„ํƒ (์†๋„ยท๋น„์šฉ ์šฐ์œ„, ํ’ˆ์งˆ ๋™๋“ฑ) conductor_vs_autonomous/results.md

ํ•ต์‹ฌ ๊ฒฐ์ • ์š”์•ฝ

D1. ์ด์ƒ ํƒ์ง€ ๋ชจ๋ธ โ†’ IsolationForest

๋ชจ๋ธ ROC-AUC PR-AUC
IsolationForest 0.600 0.129
baseline 0.565 0.119
OC-SVM 0.547 0.098
LOF 0.530 0.089
  • SECOM์€ ๋น„์ง€๋„ ์ด์ƒ ํƒ์ง€๊ฐ€ ์–ด๋ ค์šด ํ‘œ์ค€ ๋ฒค์น˜๋งˆํฌ (๋ฌธํ—Œ ROC-AUC ~0.6 ๋ฒ”์œ„)
  • ํŠธ๋ ˆ์ด๋“œ์˜คํ”„: Autoencoder/LSTM์€ ๋” ๋ณต์žกํ•œ ํŒจํ„ด ๊ฐ€๋Šฅํ•˜๋‚˜ ํ•™์Šต ๋ฐ์ดํ„ฐยท์‹œ๊ฐ„ ๋น„์šฉ ํผ

D2. Retrieval ๋ฐฑ์—”๋“œ latency โ†’ 4 backend ๊ฒ€์ฆ (๋‹จ์ˆœ latency ๋น„๊ต)

๋ฐฑ์—”๋“œ ํ‰๊ท  latency
keyword 0.5 ms
FAISS ~60 ms
hybrid ~54 ms
hybrid + rerank ~326 ms
  • ์˜๋ฏธ ์šฐํšŒ ์ฟผ๋ฆฌ์—์„œ FAISSยทhybrid๊ฐ€ keyword ์••๋„
  • ์ •๋ฐ€๋„ ํ‰๊ฐ€๋Š” D6 (RAG paradigm ablation)์—์„œ RAGAS๋กœ ๋ณ„๋„ ์ง„ํ–‰

D5. ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ vs Single LLM โ†’ Multi-Agent

์˜์—ญ ์šฐ์œ„
์†๋„ยท๋น„์šฉ Single (2.6x ๋น ๋ฆ„, 2.2x ์ €๋ ด)
์‘๋‹ต ๊นŠ์ด (์กฐ์น˜ ๊ถŒ๊ณ  ์ˆ˜) Multi (1.6~1.9x detailed)
๋ชจ๋“ˆํ™”ยทํ™•์žฅ์„ฑยท์ž๊ฐ€ํ•™์Šต Multi
schemaยทcitation ์ •ํ™•๋„ ๋™๋“ฑ (์–‘์ชฝ strict JSON 100%)
  • ๋น„์šฉ ์ ˆ๋Œ€๊ฐ’์ด ๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ $0.01~0.02๋กœ ๋ฏธ๋ฏธํ•ด cost-awareํ•  ํ•„์š” ์—†์Œ
  • ์šด์˜ ํ™˜๊ฒฝ(์‚ฌ์—…๋ถ€๋ณ„ Tier ์ฑ…์ž„์ž ๋ถ„๋ฆฌ, ์ƒˆ step ํ™•์žฅ)์—์„œ๋Š” Multi์˜ ๋ชจ๋“ˆํ™”๊ฐ€ ๊ฒฐ์ •์ 

D6. RAG paradigm 5๋‹จ๊ณ„ ablation โ†’ Hybrid (BM25+FAISS+RRF)

Paradigm Faithfulness Answer Relevancy Context Precision Total ms
No RAG 0.321 0.297 1.000 13,084
Naive RAG (keyword) 0.764 0.388 1.000 15,192
Vector RAG (FAISS) 0.784 0.146 1.000 12,267
Hybrid (BM25+FAISS+RRF) 0.821 0.394 1.000 10,977
Hybrid + Rerank 0.819 0.167 1.000 11,306

RAG Paradigm Evolution Quality vs Latency Trade-off

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ:

  1. RAG ๋„์ž… ํšจ๊ณผ๊ฐ€ ๊ฒฐ์ •์  - No RAG ๋Œ€๋น„ ์–ด๋–ค paradigm์„ ๋ถ™์—ฌ๋„ faithfulness 2.5๋ฐฐโ†‘
  2. Hybrid๊ฐ€ ๋ณธ ์ฝ”ํผ์Šค์—์„œ ๋ชจ๋“  ์ง€ํ‘œ 1์œ„ - quality + latency ๋ชจ๋‘ ์šฐ์œ„. production ํ‘œ์ค€ ํŒจํ„ด (Microsoft Azure AI Search, LlamaIndex ๊ถŒ๊ณ )
  3. Cross-encoder Rerank๋Š” ๋ณธ ์ฝ”ํผ์Šค์—์„œ ์—ญํšจ๊ณผ
    • faithfulness ๋™๊ธ‰, answer_relevancy 0.394 โ†’ 0.167 ๋‚™ํญ
    • ์›์ธ ์ถ”์ •: โ‘  ์ฝ”ํผ์Šค ~10๋ฌธ์„œ๋กœ Hybrid top-3์ด ์ด๋ฏธ ์ถฉ๋ถ„ โ‘ก BAAI/bge-reranker-base๊ฐ€ ์˜์–ด ํ•™์Šต โ†’ ํ•œ๊ตญ์–ด ๋„๋ฉ”์ธ ํ…์ŠคํŠธ ์ ์ˆ˜ ์‹ ํ˜ธ ์žก์Œ
    • ์ฝ”ํผ์Šค 100+ ํ™•์žฅ ๋˜๋Š” ํ•œ๊ตญ์–ด reranker(dongjin-kr/ko-reranker)๋กœ ์žฌํ‰๊ฐ€ ๊ถŒ์žฅ

์‹œํ–‰์ฐฉ์˜ค: ์ฒ˜์Œ์—” "rerank๊ฐ€ production ํ‘œ์ค€์ด๋‹ˆ๊นŒ" ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์ฑ„ํƒํ–ˆ์œผ๋‚˜, RAGAS ํ‰๊ฐ€์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •๋ฐ˜๋Œ€ ์‹ ํ˜ธ๋ฅผ ๋ณด๋‚ด ๊ธฐ๋ณธ๊ฐ’ ๋ณ€๊ฒฝ (hybrid_rerank โ†’ hybrid). ์ •๋Ÿ‰ ํ‰๊ฐ€ ์—†์œผ๋ฉด ํ†ต๋…์„ ๊ทธ๋Œ€๋กœ ๋Œ๊ณ  ๊ฐˆ ๋ป”ํ•œ ์‚ฌ๋ก€.

D7. Workflow vs Agentic โ†’ Agentic (tool-using agent)

์ง€ํ‘œ Workflow Agentic ๋ฐฐ์ˆ˜
LLM ํ˜ธ์ถœ / ์•Œ๋žŒ 3 9 x3.0
Tool ํ˜ธ์ถœ / ์•Œ๋žŒ 0 13 -
์œ ๋‹ˆํฌ ์ธ์šฉ / ์•Œ๋žŒ 4 5 x1.25
์ž…๋ ฅ ํ† ํฐ / ์•Œ๋žŒ 5,890 20,474 x3.5
์ถœ๋ ฅ ํ† ํฐ / ์•Œ๋žŒ 5,174 12,574 x2.4
Latency (T2~T4) 83s 194s x2.3
๋น„์šฉ / 1000์•Œ๋žŒ $11.82 $30.27 x2.6

Workflow vs Agentic - ํ˜ธ์ถœ ํšŸ์ˆ˜ยท์ธ์šฉ ๊นŠ์ด Tier๋ณ„ Latency ๋น„์šฉ

์ฑ„ํƒ ๊ทผ๊ฑฐ:

  1. Tool ํ˜ธ์ถœ ๋กœ๊ทธ = reasoning trace - "์™œ ์ด ๊ถŒ๊ณ ๊ฐ€ ๋‚˜์™”๋Š”๊ฐ€"์˜ audit trail ํ™•๋ณด (fab ์•ˆ์ „์„ฑ ๊ฒฐ์ •์ )
  2. Multi-source ๊ทผ๊ฑฐ ๊ฒฐํ•ฉ - INCยทFMEAยทSOPยทincident DBยทPM ์ด๋ ฅ์„ LLM์ด ์ž์œจ์ ์œผ๋กœ ๊ฒฐํ•ฉ
  3. ๋น„์šฉ +$0.018/์•Œ๋žŒ - ์ผ ์ˆ˜๋ฐฑ ์•Œ๋žŒ์—๋„ ์ผ $5 ๋ฏธ๋งŒ, ์‚ฌ์—…์  ์˜ํ–ฅ ๋ฌด์‹œ ๊ฐ€๋Šฅ
  4. Latency 2.3๋ฐฐ๋Š” ๋กœ๋”ฉ UI๋กœ ํก์ˆ˜ - ์ด๋ฏธ 4-Tier cascade ๋“ฑ์žฅ UI ๊ตฌํ˜„

์‹œํ–‰์ฐฉ์˜ค: "4-Tier๊ฐ€ LLM ํ˜ธ์ถœํ•˜๋‹ˆ๊นŒ multi-agent๋‹ค"๋ผ๊ณ  ์ฃผ์žฅํ–ˆ๋‹ค๊ฐ€, Anthropic์˜ Building Effective Agents ์ •์˜๋กœ ์ž๊ธฐ ๊ฒ€์ฆํ•˜๋‹ˆ workflow์˜€์Œ (๊ฐ Tier๊ฐ€ ์‚ฌ์ „ RAG 1ํšŒ + LLM 1ํšŒ). Tool-using ํŒจํ„ด์œผ๋กœ ์ „ํ™˜ ํ›„ ์ธ์šฉ ๊นŠ์ด๋Š” +25% ์ •๋„์ง€๋งŒ, ๋„๊ตฌ ํ˜ธ์ถœ ๋กœ๊ทธ๊ฐ€ reasoning trace์ด์ž audit trail์ด ๋˜๋Š” ๊ฒŒ ๊ฒฐ์ •์ .

D8. CRAG (Self-correction) โ†’ ํ™œ์„ฑ ์œ ์ง€ (๊ด€์ธก ๊ฐ€์น˜)

์ง€ํ‘œ CRAG OFF CRAG ON ๋ณ€ํ™”
Faithfulness 0.641 0.639 -0.1%p (๋™๊ธ‰)
Answer Relevancy 0.283 0.250 -3.3%p (์†Œํญ ํ•˜๋ฝ)
LLM ํ˜ธ์ถœ / ์•Œ๋žŒ 3.0 3.7 x1.22
๋น„์šฉ / 1000์•Œ๋žŒ $9.40 $12.29 x1.31
Latency (Tier 2) 61s 69s x1.13
Refinement ๋ฐœ๋™๋ฅ  - 20% (5๋ฒˆ ์ค‘ 1๋ฒˆ)
ํ‰๊ท  relevance_score - 0.61 (CRAG ON์—์„œ 0~1๋กœ ๊ฐ€์‹œํ™”)

CRAG ์ž๊ฐ€ ์ •์ • ํ™œ๋™ CRAG ํšจ๊ณผ - ๋‹ต๋ณ€ ํ’ˆ์งˆ

์ฑ„ํƒ ๊ทผ๊ฑฐ (์†”์งํ•œ trade-off):

  1. ํ’ˆ์งˆ ๋ณ€ํ™” ์‚ฌ์‹ค์ƒ ์—†์Œ - faithfulness -0.1%p, relevancy -3.3%p. ๋ณธ ์ฝ”ํผ์Šค(~10๋ฌธ์„œ)์—์„  hybrid๊ฐ€ ์ด๋ฏธ ์ž˜ ์ž‘๋™
  2. ์ž๊ฐ€ ์ •์ • ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ž์ฒด๋Š” ์ž‘๋™ ํ™•์ธ - smoke test: gibberish ์ฟผ๋ฆฌ(์•Œ์ˆ˜์—†์Œ xyzzy foobar)์— avg score 0.0 ๋ถ€์—ฌ ํ›„ LLM์ด CMP ๊ณต์ • ์‹คํŒจ ๋ชจ๋“œ ๋ถ„์„...์œผ๋กœ ์žฌ์ž‘์„ฑ, avg 0.0 โ†’ 0.68 ํšŒ๋ณต
  3. ์ธ์šฉ ์‹ ๋ขฐ๋„ ๊ฐ€์‹œํ™” ๊ฐ€์น˜ - ๋‹ต๋ณ€๋งˆ๋‹ค 0~1 relevance_score ๋…ธ์ถœ โ†’ ์šด์˜์ž๊ฐ€ "์ด ๊ถŒ๊ณ ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ฐ•ํ•œ ๊ทผ๊ฑฐ์— ๊ธฐ๋ฐ˜ํ•˜๋Š”๊ฐ€" ์ฆ‰์‹œ ํŒ๋‹จ
  4. ๋น„์šฉ +31% ์ ˆ๋Œ€๊ฐ’ ๋ฌด์‹œ ๊ฐ€๋Šฅ - 1000 ์•Œ๋žŒ๋‹น +$2.90
  5. agentic loop์™€์˜ ๋ถ€๋ถ„ ์ค‘๋ณต - agent๊ฐ€ ์ด๋ฏธ ๋ถ€์กฑํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ณ  ๋‹ค๋ฅธ query๋กœ ์žฌํ˜ธ์ถœํ•˜๋Š” self-correction ์ผ๋ถ€ ์ˆ˜ํ–‰

์‹œํ–‰์ฐฉ์˜ค: AnthropicยทLangChain์ด CRAG๋ฅผ production ํŒจํ„ด์œผ๋กœ ์ž์ฃผ ์–ธ๊ธ‰. ๋‹จ์ˆœ ๊ตฌํ˜„(grader + refiner) + smoke test์—์„œ ์ธ์ƒ์  ์ž‘๋™ ํ™•์ธ. ๊ทธ๋Ÿฌ๋‚˜ ์ •๋Ÿ‰ ๋น„๊ต์—์„œ ํ’ˆ์งˆ ๋ณ€ํ™” ๋ฏธ๋ฏธ - D6 Rerank์™€ ๊ฐ™์€ ํŒจํ„ด. ์ž‘์€ ๋„๋ฉ”์ธ ์ฝ”ํผ์Šค์—์„  ์ •๊ตํ•œ self-correction์ด ROI ๋‚ฎ์Œ. ์ •๋Ÿ‰ ํ‰๊ฐ€ ์—†์ด๋Š” "CRAG ๋„์ž…ํ–ˆ์Œ" ๋งˆ์ผ€ํŒ…์œผ๋กœ ๋๋‚ฌ์„ ๊ฒƒ. ๊ฒฐ์ •: ํ™œ์„ฑ ์œ ์ง€ํ•˜๋˜ ์ฝ”ํผ์Šค ํ™•์žฅ ์‹œ ์žฌํ‰๊ฐ€ (์ธ์šฉ ์‹ ๋ขฐ๋„ ๋…ธ์ถœ์ด๋ผ๋Š” ๋ถ€์ˆ˜ ๊ฐ€์น˜๋Š” ์œ ์ง€).

D9. ํ•œ๊ตญ์–ด reranker โ†’ ์ฑ„ํƒ ๋ณด๋ฅ˜ (D6 ๊ฐ€์„ค ๋ถ€๋ถ„์  ์žฌํ™•์ธ)

๋ชจ๋“œ ํ‰๊ท  LLM relevance rerank latency vs hybrid baseline
hybrid (no rerank) 0.734 0 ms baseline
BAAI/bge-reranker-base (์˜์–ด) 0.714 315 ms -0.020
Dongjin-kr/ko-reranker (ํ•œ๊ตญ์–ด) 0.703 826 ms -0.031

Reranker ๋น„๊ต

์ฟผ๋ฆฌ๋ณ„ ํŒจํ„ด:

์ฟผ๋ฆฌ hybrid BAAI ko ์šฐ์Šน์ž
Photo CD ์ง์ ‘ 0.867 0.817 0.867 hybrid / ko (tie)
CMP ์ง์ ‘ 0.750 0.767 0.850 ko (+0.10)
Etch ์ง์ ‘ 0.750 0.650 0.567 hybrid
์˜๋ฏธ ์šฐํšŒ 1 (lens ์ฒญ์†Œ) 0.700 0.633 0.783 ko (+0.083)
์˜๋ฏธ ์šฐํšŒ 2 (yield ์˜ํ–ฅ) 0.817 0.867 0.617 BAAI
์˜๋ฏธ ์šฐํšŒ 3 (PM ๊ฐ€์ด๋“œ) 0.517 0.550 0.533 tie

์‹œํ–‰์ฐฉ์˜ค (D6 โ†’ D9): D6์—์„œ ์˜์–ด reranker์˜ ๋ถ€์ง„ ์›์ธ์„ "ํ•œ๊ตญ์–ด ๋ชจ๋ธ๋กœ ํ’€๋ฆฐ๋‹ค"๊ณ  ๊ฐ€์„ค. D9์—์„œ ๊ฒ€์ฆํ•œ ๊ฒฐ๊ณผ - ํ•œ๊ตญ์–ด reranker๊ฐ€ CMPยทlens cleanup ์ฟผ๋ฆฌ์—์„  ๋ช…ํ™•ํžˆ ์šฐ์œ„์ง€๋งŒ, Etchยทyield ์ฟผ๋ฆฌ์—์„  ํฐ ์†์‹ค. 6 ์ฟผ๋ฆฌ ํ‰๊ท ์€ hybrid baseline ๋ฏธ๋‹ฌ. ๊ฒฐ๋ก : D6 ๊ฐ€์„ค์˜ ์ง„์งœ ๋ฌธ์ œ๋Š” ์˜์–ด/ํ•œ๊ตญ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ์ฝ”ํผ์Šค ๊ทœ๋ชจ. ์ด ๊ฐ€์„ค์„ D10์—์„œ ์ •๋Ÿ‰ ๊ฒ€์ฆ.

D10. ํ™•์žฅ ์ฝ”ํผ์Šค(34 docs)์—์„œ reranker ํšจ๊ณผ ๊ฒ€์ฆ โ†’ ๊ฐ€์„ค ์ž…์ฆ, ํšจ๊ณผ ์™„์ „ ๋ฐ˜์ „

๋ชจ๋“œ D9 (12 docs) D10 (34 docs) ๋ณ€ํ™”
hybrid (no rerank) 0.734 0.592 -0.142 (noiseโ†‘)
BAAI/bge-reranker-base 0.714 (-0.020) 0.709 (+0.117) ๋ฐ˜์ „!
Dongjin-kr/ko-reranker 0.703 (-0.031) 0.675 (+0.083) ๋ฐ˜์ „!

Reranker ๋น„๊ต (D10, 34 docs)

์‹œ๋ฆฌ์ฆˆ ์˜์˜ (D6 โ†’ D9 โ†’ D10):

  • D6: production ํ‘œ์ค€์ด ์ž‘์€ ์ฝ”ํผ์Šค์—์„œ ์—ญํšจ๊ณผ ๋ฐœ๊ฒฌ ("rerank๊ฐ€ ๋ฌด์กฐ๊ฑด ์ข‹๋‹ค"๋Š” ํ†ต๋… ์ •๋Ÿ‰ ๋ฐ˜๋ฐ•)
  • D9: ํ•œ๊ตญ์–ด reranker๋กœ๋„ ์•ˆ ํ’€๋ฆผ โ†’ ์˜์–ด/ํ•œ๊ตญ์–ด ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋ผ "์ฝ”ํผ์Šค ๊ทœ๋ชจ๊ฐ€ ์ง„์งœ ์›์ธ"์ด๋ผ๋Š” ๊ฐ€์„ค ์ œ์‹œ
  • D10: ์ฝ”ํผ์Šค 12 โ†’ 34 ํ™•์žฅ ํ›„ ์žฌ์‹คํ–‰. hybrid baseline -0.14, reranker ํšจ๊ณผ +0.12๋กœ ์™„์ „ ๋ฐ˜์ „ โ†’ ๊ฐ€์„ค ์ •๋Ÿ‰ ์ž…์ฆ

ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€: ์ •๋Ÿ‰ ํ‰๊ฐ€ ์—†์ด๋Š” ์ž˜๋ชป๋œ ํ†ต๋…์„ ๊ทธ๋Œ€๋กœ ๋Œ๊ณ  ๊ฐˆ ๋ป”ํ–ˆ๊ณ , ์ •๋Ÿ‰ ํ‰๊ฐ€ ๋•๋ถ„์— ์ง„์งœ ์›์ธ์„ ๋ถ„๋ฆฌํ•˜๊ณ  ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ฑ„ํƒ: ์ฝ”ํผ์Šค 30+ ํ™˜๊ฒฝ์—์„œ๋Š” RAG_BACKEND=hybrid_rerank ๊ถŒ์žฅ, ๋ฐ๋ชจ์šฉ ์ฝ”ํผ์Šค๋Š” hybrid ์œ ์ง€.

D11. Conductor (Plan-and-Execute) vs Autonomous โ†’ Conductor ์ฑ„ํƒ

์ง€ํ‘œ Autonomous Conductor ๋ณ€ํ™”
LLM ํ˜ธ์ถœ / ์•Œ๋žŒ 10.0 4.0 -60%
Tool ํ˜ธ์ถœ / ์•Œ๋žŒ 13.7 16.0 +17%
์œ ๋‹ˆํฌ ์ธ์šฉ / ์•Œ๋žŒ 6.0 6.0 ๋™๋“ฑ
์ž…๋ ฅ ํ† ํฐ 25,849 8,042 -69%
์ถœ๋ ฅ ํ† ํฐ 13,385 5,895 -56%
Latency / ์•Œ๋žŒ 131์ดˆ 60์ดˆ -54%
๋น„์šฉ / 1000์•Œ๋žŒ $33.23 $13.80 -58%

ํ˜ธ์ถœ ๋น„๊ต Latency ๋น„๊ต

์˜์˜ (D7 โ†’ D11 narrative):

  • D7: "workflow โ†’ agentic"์œผ๋กœ reasoning traceยท์ž์œจ์„ฑ ํ™•๋ณด (๊ฐ Tier๊ฐ€ tool ์ž์œจ ํ˜ธ์ถœ)
  • D11: "agentic โ†’ conductor"๋กœ ํ†ต์‹  ํšจ์œจ ํšŒ๋ณต (Central Planner๊ฐ€ plan 1ํšŒ ์‚ฐ์ถœ + Tier executor๊ฐ€ plan๋Œ€๋กœ ์‹คํ–‰ + LLM 1ํšŒ synthesis)
  • ๋‘ ํŒจํ„ด ๋ชจ๋‘ ์ •๋Ÿ‰ ๋น„๊ต ํ›„ ์ฑ„ํƒ. autonomous๋Š” ํ™˜๊ฒฝ๋ณ€์ˆ˜ AGENT_MODE=autonomous๋กœ ๋ณด์กด (๋ณต์žกํ•œ ์•Œ๋žŒยท์˜ˆ์ƒ์น˜ ๋ชปํ•œ ์ปจํ…์ŠคํŠธ ์ ์‘ ํ•„์š” ์‹œ)
  • ์žฌ๊ท€ยท๋ฌดํ•œ๋ฃจํ”„ ์œ„ํ—˜ ์›์ฒœ ์ฐจ๋‹จ: autonomous์˜ MAX_TOOL_ITERATIONS=4 ์บก ์˜์กด์ด plan ๊ณ ์ • ์‹คํ–‰์œผ๋กœ ๋ณธ์งˆ์  ํ•ด๊ฒฐ

์‹คํ–‰ ๋ฐฉ๋ฒ•

# Tier 1 ๋ชจ๋ธ ๋ฒค์น˜๋งˆํฌ (D1)
.venv/bin/python -m experiments.tier1_detection.benchmark

# Retrieval latency ๋น„๊ต (D2)
.venv/bin/python -m experiments.retrieval_compare.benchmark

# ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ vs Single LLM (D5)
.venv/bin/python -m experiments.multi_vs_single.benchmark

# RAGAS hybrid vs hybrid_rerank
.venv/bin/python -m experiments.rag_eval.benchmark

# RAG paradigm 5๋‹จ๊ณ„ ablation (D6)
.venv/bin/python -m experiments.rag_paradigm.benchmark
# ์ฐจํŠธ๋งŒ ์žฌ์ƒ์„ฑ (CSV ์บ์‹œ ์‚ฌ์šฉ):
.venv/bin/python -m experiments.rag_paradigm.benchmark --charts-only

# Workflow vs Agentic (D7)
.venv/bin/python -m experiments.agentic_vs_workflow.benchmark

# CRAG ON vs OFF (D8)
.venv/bin/python -m experiments.crag_eval.benchmark

# ํ•œ๊ตญ์–ด reranker (Dongjin-kr/ko-reranker) ํ‰๊ฐ€ (D9ยทD10)
.venv/bin/python -m experiments.reranker_compare.benchmark

# Conductor vs Autonomous (D11)
.venv/bin/python -m experiments.conductor_vs_autonomous.benchmark

๊ฐ ์‹คํ—˜์€ results.md์™€ charts/*.png๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.