Spaces:
Sleeping
Add complete Financial RAG system with Metacognitive Agent
Browse filesImplemented a comprehensive RAG (Retrieval-Augmented Generation) system for financial/economics research papers with the following features:
Core Components:
- PDF processing and text extraction (PyPDF2, pdfplumber)
- Text chunking with overlap for context preservation
- Vector embeddings (Sentence Transformers, OpenAI, Cohere support)
- ChromaDB vector store for efficient similarity search
- Metacognitive agent with 4-stage process (Planning โ Monitoring โ Evaluation โ Revision)
- FastAPI REST API with comprehensive endpoints
Key Features:
- Supports 2,639+ financial/economics journal articles
- Hallucination detection and prevention
- Iterative answer refinement based on quality evaluation
- Flexible embedding model selection (free and paid options)
- Batch processing for efficient indexing
- Detailed logging and statistics
API Endpoints:
- POST /query: Question answering with RAG
- GET /health: System health check
- GET /stats: Vector store statistics
- Interactive API docs at /docs
Scripts:
- index_pdfs.py: Index PDF files into vector database
- check_vector_db.py: Verify vector database contents
- test_query.py: Test queries against the API
Documentation:
- Comprehensive README with quick start guide
- Detailed USAGE_GUIDE in Korean
- Environment configuration examples
- Troubleshooting section
The system enables researchers to query a large corpus of financial literature with high-quality, source-backed answers while minimizing hallucinations through metacognitive reflection.
- .env.example +26 -0
- .gitignore +51 -0
- README.md +276 -0
- USAGE_GUIDE.md +366 -0
- app/__init__.py +0 -0
- app/api/__init__.py +0 -0
- app/api/models.py +85 -0
- app/api/routes.py +151 -0
- app/main.py +121 -0
- app/metacognitive_agent.py +289 -0
- app/rag_pipeline.py +163 -0
- requirements.txt +36 -0
- scripts/__init__.py +0 -0
- scripts/check_vector_db.py +85 -0
- scripts/index_pdfs.py +153 -0
- scripts/test_query.py +116 -0
- services/__init__.py +0 -0
- services/chunker.py +93 -0
- services/embedder.py +147 -0
- services/pdf_processor.py +111 -0
- services/vector_store.py +178 -0
- utils/__init__.py +0 -0
- utils/config.py +39 -0
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Anthropic API Key
|
| 2 |
+
ANTHROPIC_API_KEY=your_anthropic_api_key_here
|
| 3 |
+
|
| 4 |
+
# OpenAI API Key (for embeddings - optional)
|
| 5 |
+
OPENAI_API_KEY=your_openai_api_key_here
|
| 6 |
+
|
| 7 |
+
# Cohere API Key (alternative for embeddings - optional)
|
| 8 |
+
COHERE_API_KEY=your_cohere_api_key_here
|
| 9 |
+
|
| 10 |
+
# Vector Database Settings
|
| 11 |
+
CHROMA_PERSIST_DIRECTORY=./data/chroma_db
|
| 12 |
+
COLLECTION_NAME=financial_papers
|
| 13 |
+
|
| 14 |
+
# PDF Processing Settings
|
| 15 |
+
PDF_SOURCE_PATH=/Users/seongjincho/Desktop/HYU-06-๊ณตํ๋ฐ์ฌ ๋์ ๊ธฐ/25.8.15(ํ๋๋ฉํธ DB ๋ ํผ๋ฐ์ค)/data/
|
| 16 |
+
CHUNK_SIZE=1000
|
| 17 |
+
CHUNK_OVERLAP=200
|
| 18 |
+
|
| 19 |
+
# Embedding Model
|
| 20 |
+
# Options: "openai", "sentence-transformers", "cohere"
|
| 21 |
+
EMBEDDING_MODEL=sentence-transformers
|
| 22 |
+
EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
|
| 23 |
+
|
| 24 |
+
# API Settings
|
| 25 |
+
API_HOST=0.0.0.0
|
| 26 |
+
API_PORT=8000
|
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Python
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.py[cod]
|
| 4 |
+
*$py.class
|
| 5 |
+
*.so
|
| 6 |
+
.Python
|
| 7 |
+
build/
|
| 8 |
+
develop-eggs/
|
| 9 |
+
dist/
|
| 10 |
+
downloads/
|
| 11 |
+
eggs/
|
| 12 |
+
.eggs/
|
| 13 |
+
lib/
|
| 14 |
+
lib64/
|
| 15 |
+
parts/
|
| 16 |
+
sdist/
|
| 17 |
+
var/
|
| 18 |
+
wheels/
|
| 19 |
+
*.egg-info/
|
| 20 |
+
.installed.cfg
|
| 21 |
+
*.egg
|
| 22 |
+
|
| 23 |
+
# Virtual Environment
|
| 24 |
+
venv/
|
| 25 |
+
env/
|
| 26 |
+
ENV/
|
| 27 |
+
|
| 28 |
+
# Environment Variables
|
| 29 |
+
.env
|
| 30 |
+
|
| 31 |
+
# IDE
|
| 32 |
+
.vscode/
|
| 33 |
+
.idea/
|
| 34 |
+
*.swp
|
| 35 |
+
*.swo
|
| 36 |
+
*~
|
| 37 |
+
|
| 38 |
+
# Data
|
| 39 |
+
data/chroma_db/
|
| 40 |
+
data/*.pdf
|
| 41 |
+
*.pdf
|
| 42 |
+
|
| 43 |
+
# Logs
|
| 44 |
+
logs/
|
| 45 |
+
*.log
|
| 46 |
+
|
| 47 |
+
# MacOS
|
| 48 |
+
.DS_Store
|
| 49 |
+
|
| 50 |
+
# Jupyter
|
| 51 |
+
.ipynb_checkpoints/
|
|
@@ -0,0 +1,276 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ๐ Financial RAG with Metacognitive Agent
|
| 2 |
+
|
| 3 |
+
๊ธ์ต/๊ฒฝ์ ๋
ผ๋ฌธ ๊ธฐ๋ฐ RAG (Retrieval-Augmented Generation) ์์คํ
with ๋ฉํ์ธ์ง ์์ด์ ํธ
|
| 4 |
+
|
| 5 |
+
## ๐ฏ ์ฃผ์ ๊ธฐ๋ฅ
|
| 6 |
+
|
| 7 |
+
- โ
**2,639๊ฐ ๊ธ์ต/๊ฒฝ์ ์ ๋ ๋
ผ๋ฌธ** ๋ฒกํฐ ์ธ๋ฑ์ฑ
|
| 8 |
+
- ๐ง **๋ฉํ์ธ์ง ์์ด์ ํธ** (Planning โ Monitoring โ Evaluation โ Revision)
|
| 9 |
+
- ๐ **๊ณ ์ฑ๋ฅ ๋ฒกํฐ ๊ฒ์** (ChromaDB)
|
| 10 |
+
- ๐ซ **Hallucination ๊ฐ์ง ๋ฐ ๋ฐฉ์ง**
|
| 11 |
+
- ๐ **FastAPI ๊ธฐ๋ฐ REST API**
|
| 12 |
+
- ๐ **์๋ฒ ๋ฉ ๋ชจ๋ธ ์ ํ ๊ฐ๋ฅ** (Sentence Transformers, OpenAI, Cohere)
|
| 13 |
+
|
| 14 |
+
## ๐ ํ๋ก์ ํธ ๊ตฌ์กฐ
|
| 15 |
+
|
| 16 |
+
```
|
| 17 |
+
Hallucination_and_Deception_for_financial_RAG/
|
| 18 |
+
โโโ app/
|
| 19 |
+
โ โโโ main.py # FastAPI ๋ฉ์ธ ์ฑ
|
| 20 |
+
โ โโโ metacognitive_agent.py # ๋ฉํ์ธ์ง ์์ด์ ํธ
|
| 21 |
+
โ โโโ rag_pipeline.py # RAG ํ์ดํ๋ผ์ธ
|
| 22 |
+
โ โโโ api/
|
| 23 |
+
โ โโโ routes.py # API ์๋ํฌ์ธํธ
|
| 24 |
+
โ โโโ models.py # Pydantic ๋ชจ๋ธ
|
| 25 |
+
โโโ services/
|
| 26 |
+
โ โโโ pdf_processor.py # PDF ์ฒ๋ฆฌ
|
| 27 |
+
โ โโโ chunker.py # ํ
์คํธ ์ฒญํน
|
| 28 |
+
โ โโโ embedder.py # ์๋ฒ ๋ฉ ์์ฑ
|
| 29 |
+
โ โโโ vector_store.py # ๋ฒกํฐ DB
|
| 30 |
+
โโโ utils/
|
| 31 |
+
โ โโโ config.py # ์ค์ ๊ด๋ฆฌ
|
| 32 |
+
โโโ scripts/
|
| 33 |
+
โ โโโ index_pdfs.py # PDF ์ธ๋ฑ์ฑ ์คํฌ๋ฆฝํธ
|
| 34 |
+
โโโ data/
|
| 35 |
+
โ โโโ chroma_db/ # ๋ฒกํฐ DB ์ ์ฅ์
|
| 36 |
+
โโโ requirements.txt
|
| 37 |
+
โโโ .env.example
|
| 38 |
+
โโโ README.md
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
## ๐ ๋น ๋ฅธ ์์
|
| 42 |
+
|
| 43 |
+
### 1๏ธโฃ ํ๊ฒฝ ์ค์
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
# ๋ฆฌํฌ์งํ ๋ฆฌ ํด๋ก
|
| 47 |
+
git clone https://github.com/yourusername/Hallucination_and_Deception_for_financial_RAG.git
|
| 48 |
+
cd Hallucination_and_Deception_for_financial_RAG
|
| 49 |
+
|
| 50 |
+
# ๊ฐ์ํ๊ฒฝ ์์ฑ ๋ฐ ํ์ฑํ
|
| 51 |
+
python -m venv venv
|
| 52 |
+
source venv/bin/activate # Windows: venv\Scripts\activate
|
| 53 |
+
|
| 54 |
+
# ์์กด์ฑ ์ค์น
|
| 55 |
+
pip install -r requirements.txt
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
### 2๏ธโฃ ํ๊ฒฝ ๋ณ์ ์ค์
|
| 59 |
+
|
| 60 |
+
```bash
|
| 61 |
+
# .env ํ์ผ ์์ฑ
|
| 62 |
+
cp .env.example .env
|
| 63 |
+
|
| 64 |
+
# .env ํ์ผ ํธ์ง (ํ์!)
|
| 65 |
+
nano .env
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
`.env` ํ์ผ ์์:
|
| 69 |
+
```env
|
| 70 |
+
# Anthropic API Key (ํ์)
|
| 71 |
+
ANTHROPIC_API_KEY=your_api_key_here
|
| 72 |
+
|
| 73 |
+
# PDF ๊ฒฝ๋ก (๋ก์ปฌ ๋งฅ๋ถ ๊ฒฝ๋ก)
|
| 74 |
+
PDF_SOURCE_PATH=/Users/seongjincho/Desktop/HYU-06-๊ณตํ๋ฐ์ฌ ๋์ ๊ธฐ/25.8.15(ํ๋๋ฉํธ DB ๋ ํผ๋ฐ์ค)/data/
|
| 75 |
+
|
| 76 |
+
# ์๋ฒ ๋ฉ ๋ชจ๋ธ (๋ฌด๋ฃ: sentence-transformers)
|
| 77 |
+
EMBEDDING_MODEL=sentence-transformers
|
| 78 |
+
EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
### 3๏ธโฃ PDF ์ธ๋ฑ์ฑ (๋ก์ปฌ ๋งฅ๋ถ์์ ์คํ)
|
| 82 |
+
|
| 83 |
+
```bash
|
| 84 |
+
# PDF ํ์ผ๋ค์ ๋ฒกํฐ DB๋ก ์ธ๋ฑ์ฑ
|
| 85 |
+
python scripts/index_pdfs.py
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
์ด ๊ณผ์ ์ ๋ค์์ ์ํํฉ๋๋ค:
|
| 89 |
+
1. 2,639๊ฐ PDF ํ์ผ ์ฝ๊ธฐ
|
| 90 |
+
2. ํ
์คํธ ์ถ์ถ ๋ฐ ์ฒญํน
|
| 91 |
+
3. ์๋ฒ ๋ฉ ์์ฑ (๋ฌด๋ฃ ๋ชจ๋ธ ์ฌ์ฉ ์ ์ฝ 30-60๋ถ ์์)
|
| 92 |
+
4. ChromaDB์ ์ ์ฅ
|
| 93 |
+
|
| 94 |
+
**์ฐธ๊ณ :**
|
| 95 |
+
- ์ฒ์ ์คํ ์ Sentence Transformer ๋ชจ๋ธ ๋ค์ด๋ก๋ (~90MB)
|
| 96 |
+
- ์ธ๋ฑ์ฑ ์๋ฃ ํ `data/chroma_db/` ํด๋๊ฐ ์์ฑ๋จ
|
| 97 |
+
- ์ด ํด๋๋ง GitHub์ ์
๋ก๋ํ๋ฉด ๋จ (PDF ์๋ณธ์ ์ ์ธ)
|
| 98 |
+
|
| 99 |
+
### 4๏ธโฃ API ์๋ฒ ์คํ
|
| 100 |
+
|
| 101 |
+
```bash
|
| 102 |
+
# FastAPI ์๋ฒ ์์
|
| 103 |
+
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
์๋ฒ๊ฐ ์์๋๋ฉด:
|
| 107 |
+
- API Docs: http://localhost:8000/docs
|
| 108 |
+
- ReDoc: http://localhost:8000/redoc
|
| 109 |
+
|
| 110 |
+
## ๐ API ์ฌ์ฉ๋ฒ
|
| 111 |
+
|
| 112 |
+
### ํฌ์ค ์ฒดํฌ
|
| 113 |
+
|
| 114 |
+
```bash
|
| 115 |
+
curl http://localhost:8000/health
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
### ์ง๋ฌธํ๊ธฐ (๋ฉํ์ธ์ง ํ์ฑํ)
|
| 119 |
+
|
| 120 |
+
```bash
|
| 121 |
+
curl -X POST http://localhost:8000/query \
|
| 122 |
+
-H "Content-Type: application/json" \
|
| 123 |
+
-d '{
|
| 124 |
+
"question": "๊ธ์ต์๊ธฐ์ ์ฃผ์ ์์ธ์ ๋ฌด์์ธ๊ฐ์?",
|
| 125 |
+
"top_k": 5,
|
| 126 |
+
"enable_metacognition": true
|
| 127 |
+
}'
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
### Python์์ ์ฌ์ฉ
|
| 131 |
+
|
| 132 |
+
```python
|
| 133 |
+
import requests
|
| 134 |
+
|
| 135 |
+
response = requests.post(
|
| 136 |
+
"http://localhost:8000/query",
|
| 137 |
+
json={
|
| 138 |
+
"question": "ํฌํธํด๋ฆฌ์ค ๋ค๊ฐํ์ ํจ๊ณผ๋?",
|
| 139 |
+
"top_k": 5,
|
| 140 |
+
"enable_metacognition": True
|
| 141 |
+
}
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
result = response.json()
|
| 145 |
+
print(f"๋ต๋ณ: {result['answer']}")
|
| 146 |
+
print(f"์ถ์ฒ: {len(result['sources'])}๊ฐ ๋ฌธ์")
|
| 147 |
+
print(f"๋ฐ๋ณต ํ์: {result['metacognition']['iterations']}")
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
## ๐ง ๋ฉํ์ธ์ง ์์ด์ ํธ๋?
|
| 151 |
+
|
| 152 |
+
๋ฉํ์ธ์ง ์์ด์ ํธ๋ ๋ค์ 4๋จ๊ณ๋ฅผ ํตํด ๊ณ ํ์ง ๋ต๋ณ์ ์์ฑํฉ๋๋ค:
|
| 153 |
+
|
| 154 |
+
### 1๏ธโฃ Planning (๊ณํ)
|
| 155 |
+
- ์ง๋ฌธ ๋ถ์
|
| 156 |
+
- ๋ต๋ณ ์ ๋ต ์๋ฆฝ
|
| 157 |
+
- ํ์ํ ์ ๋ณด ํ์
|
| 158 |
+
|
| 159 |
+
### 2๏ธโฃ Monitoring (๊ฐ์)
|
| 160 |
+
- ๋ต๋ณ ์์ฑ ๊ณผ์ ๋ชจ๋ํฐ๋ง
|
| 161 |
+
- Hallucination ๊ฐ์ง
|
| 162 |
+
- ๋
ผ๋ฆฌ์ ํ๋น์ฑ ๊ฒ์ฆ
|
| 163 |
+
|
| 164 |
+
### 3๏ธโฃ Evaluation (ํ๊ฐ)
|
| 165 |
+
- ์์ ์ฑ, ์ ํ์ฑ, ๋ช
ํ์ฑ, ์ ๋ขฐ์ฑ ํ๊ฐ
|
| 166 |
+
- 1-10 ์ ์ ๋ถ์ฌ
|
| 167 |
+
- ๊ฐ์ ํ์ ๋ถ๋ถ ์๋ณ
|
| 168 |
+
|
| 169 |
+
### 4๏ธโฃ Revision (์์ )
|
| 170 |
+
- ํผ๋๋ฐฑ ๊ธฐ๋ฐ ๋ต๋ณ ๊ฐ์
|
| 171 |
+
- ์ต๋ 2ํ ๋ฐ๋ณต
|
| 172 |
+
- ์ ์ 8์ ์ด์ ์ ์ข
๋ฃ
|
| 173 |
+
|
| 174 |
+
## ๐ง ๊ณ ๊ธ ์ค์
|
| 175 |
+
|
| 176 |
+
### ์๋ฒ ๋ฉ ๋ชจ๋ธ ๋ณ๊ฒฝ
|
| 177 |
+
|
| 178 |
+
#### OpenAI ์ฌ์ฉ (์ ๋ฃ, ๊ณ ํ์ง)
|
| 179 |
+
```env
|
| 180 |
+
EMBEDDING_MODEL=openai
|
| 181 |
+
EMBEDDING_MODEL_NAME=text-embedding-ada-002
|
| 182 |
+
OPENAI_API_KEY=your_openai_key
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
#### Cohere ์ฌ์ฉ (๋ฌด๋ฃ ํฐ์ด ์์)
|
| 186 |
+
```env
|
| 187 |
+
EMBEDDING_MODEL=cohere
|
| 188 |
+
EMBEDDING_MODEL_NAME=embed-multilingual-v3.0
|
| 189 |
+
COHERE_API_KEY=your_cohere_key
|
| 190 |
+
```
|
| 191 |
+
|
| 192 |
+
### ์ฒญํน ํ๋ผ๋ฏธํฐ ์กฐ์
|
| 193 |
+
|
| 194 |
+
```env
|
| 195 |
+
CHUNK_SIZE=1000 # ์ฒญํฌ ํฌ๊ธฐ (๊ธฐ๋ณธ: 1000์)
|
| 196 |
+
CHUNK_OVERLAP=200 # ์ฒญํฌ ๊ฒน์นจ (๊ธฐ๋ณธ: 200์)
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
## ๐ ์ฑ๋ฅ ์ต์ ํ
|
| 200 |
+
|
| 201 |
+
### ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋
|
| 202 |
+
- **์ธ๋ฑ์ฑ ์**: ~2-4GB (2,639๊ฐ PDF ๊ธฐ์ค)
|
| 203 |
+
- **API ์คํ ์**: ~1-2GB
|
| 204 |
+
|
| 205 |
+
### ์๋ต ์๊ฐ
|
| 206 |
+
- **๋ฉํ์ธ์ง ๋นํ์ฑํ**: ~2-5์ด
|
| 207 |
+
- **๋ฉํ์ธ์ง ํ์ฑํ**: ~10-30์ด (ํ์ง โฌ๏ธ)
|
| 208 |
+
|
| 209 |
+
### ๋ฐฐ์น ์ฒ๋ฆฌ
|
| 210 |
+
์๋ฒ ๋ฉ ์์ฑ ์ ๋ฐฐ์น ํฌ๊ธฐ ์กฐ์ :
|
| 211 |
+
```python
|
| 212 |
+
# scripts/index_pdfs.py ์์
|
| 213 |
+
embeddings = embedder.embed_batch(texts, batch_size=64) # ๊ธฐ๋ณธ: 32
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
## ๐ ๋ฌธ์ ํด๊ฒฐ
|
| 217 |
+
|
| 218 |
+
### PDF ๊ฒฝ๋ก ์ค๋ฅ
|
| 219 |
+
```
|
| 220 |
+
FileNotFoundError: Directory not found
|
| 221 |
+
```
|
| 222 |
+
โ `.env` ํ์ผ์ `PDF_SOURCE_PATH` ํ์ธ
|
| 223 |
+
|
| 224 |
+
### API ํค ์ค๋ฅ
|
| 225 |
+
```
|
| 226 |
+
AuthenticationError: Invalid API key
|
| 227 |
+
```
|
| 228 |
+
โ `.env` ํ์ผ์ `ANTHROPIC_API_KEY` ํ์ธ
|
| 229 |
+
|
| 230 |
+
### Vector DB๊ฐ ๋น์ด์์
|
| 231 |
+
```
|
| 232 |
+
total_documents: 0
|
| 233 |
+
```
|
| 234 |
+
โ `python scripts/index_pdfs.py` ๋จผ์ ์คํ
|
| 235 |
+
|
| 236 |
+
### ๋ฉ๋ชจ๋ฆฌ ๋ถ์กฑ
|
| 237 |
+
โ ๋ฐฐ์น ํฌ๊ธฐ ์ค์ด๊ธฐ: `batch_size=16`
|
| 238 |
+
|
| 239 |
+
## ๐ ๋ค์ ๋จ๊ณ
|
| 240 |
+
|
| 241 |
+
1. **๋ฒกํฐ DB ์
๋ก๋**
|
| 242 |
+
```bash
|
| 243 |
+
git add data/chroma_db/
|
| 244 |
+
git commit -m "Add vector database"
|
| 245 |
+
git push
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
2. **ํด๋ผ์ฐ๋ ๋ฐฐํฌ** (์ ํ์ฌํญ)
|
| 249 |
+
- AWS EC2 / GCP / Azure
|
| 250 |
+
- Docker ์ปจํ
์ด๋ํ
|
| 251 |
+
- API ํค ๊ด๋ฆฌ (AWS Secrets Manager ๋ฑ)
|
| 252 |
+
|
| 253 |
+
3. **ํ๋ก ํธ์๋ ๊ตฌ์ถ** (์ ํ์ฌํญ)
|
| 254 |
+
- Streamlit / Gradio
|
| 255 |
+
- React / Vue.js
|
| 256 |
+
|
| 257 |
+
## ๐ค ๊ธฐ์ฌ
|
| 258 |
+
|
| 259 |
+
์ด์ ๋ฐ PR ํ์ํฉ๋๋ค!
|
| 260 |
+
|
| 261 |
+
## ๐ ๋ผ์ด์ ์ค
|
| 262 |
+
|
| 263 |
+
MIT License
|
| 264 |
+
|
| 265 |
+
## ๐จโ๐ ์์ฑ์
|
| 266 |
+
|
| 267 |
+
์กฐ์ฑ์ง (Seongjin Cho)
|
| 268 |
+
- ํ์๋ํ๊ต ๊ณตํ๋ฐ์ฌ ๊ณผ์
|
| 269 |
+
- ๊ธ์ต/๊ฒฝ์ ์ฐ๊ตฌ
|
| 270 |
+
|
| 271 |
+
---
|
| 272 |
+
|
| 273 |
+
**โ ๏ธ ์ค์ ์๋ฆผ:**
|
| 274 |
+
- API ํค๋ฅผ ์ ๋ GitHub์ ์ปค๋ฐํ์ง ๋ง์ธ์!
|
| 275 |
+
- `.env` ํ์ผ์ `.gitignore`์ ํฌํจ๋์ด ์์ต๋๋ค
|
| 276 |
+
- PDF ์๋ณธ์ ์ฉ๋์ด ํฌ๋ฏ๋ก ๋ฒกํฐ DB๋ง ์
๋ก๋ํ์ธ์
|
|
@@ -0,0 +1,366 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ๐ ์ฌ์ฉ ๊ฐ์ด๋ (ํ๊ตญ์ด)
|
| 2 |
+
|
| 3 |
+
## ๋ชฉ์ฐจ
|
| 4 |
+
1. [์ค์น](#1-์ค์น)
|
| 5 |
+
2. [PDF ์ธ๋ฑ์ฑ](#2-pdf-์ธ๋ฑ์ฑ)
|
| 6 |
+
3. [API ์๋ฒ ์คํ](#3-api-์๋ฒ-์คํ)
|
| 7 |
+
4. [์ง๋ฌธํ๊ธฐ](#4-์ง๋ฌธํ๊ธฐ)
|
| 8 |
+
5. [๊ณ ๊ธ ์ฌ์ฉ๋ฒ](#5-๊ณ ๊ธ-์ฌ์ฉ๋ฒ)
|
| 9 |
+
6. [๋ฌธ์ ํด๊ฒฐ](#6-๋ฌธ์ -ํด๊ฒฐ)
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## 1. ์ค์น
|
| 14 |
+
|
| 15 |
+
### 1-1. Python ํ๊ฒฝ ํ์ธ
|
| 16 |
+
```bash
|
| 17 |
+
python --version # 3.8 ์ด์ ํ์
|
| 18 |
+
```
|
| 19 |
+
|
| 20 |
+
### 1-2. ๊ฐ์ํ๊ฒฝ ์์ฑ
|
| 21 |
+
```bash
|
| 22 |
+
# ๊ฐ์ํ๊ฒฝ ์์ฑ
|
| 23 |
+
python -m venv venv
|
| 24 |
+
|
| 25 |
+
# ํ์ฑํ (๋งฅ/๋ฆฌ๋
์ค)
|
| 26 |
+
source venv/bin/activate
|
| 27 |
+
|
| 28 |
+
# ํ์ฑํ (์๋์ฐ)
|
| 29 |
+
venv\Scripts\activate
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
### 1-3. ์์กด์ฑ ์ค์น
|
| 33 |
+
```bash
|
| 34 |
+
pip install -r requirements.txt
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
**์ค์น๋๋ ์ฃผ์ ํจํค์ง:**
|
| 38 |
+
- FastAPI (์น ์๋ฒ)
|
| 39 |
+
- Anthropic (Claude API)
|
| 40 |
+
- ChromaDB (๋ฒกํฐ ๋ฐ์ดํฐ๋ฒ ์ด์ค)
|
| 41 |
+
- Sentence Transformers (์๋ฒ ๋ฉ)
|
| 42 |
+
- PyPDF2, pdfplumber (PDF ์ฒ๋ฆฌ)
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 2. PDF ์ธ๋ฑ์ฑ
|
| 47 |
+
|
| 48 |
+
### 2-1. ํ๊ฒฝ ๋ณ์ ์ค์
|
| 49 |
+
|
| 50 |
+
`.env` ํ์ผ ์์ฑ:
|
| 51 |
+
```bash
|
| 52 |
+
cp .env.example .env
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
`.env` ํ์ผ ํธ์ง:
|
| 56 |
+
```env
|
| 57 |
+
# ํ์: Anthropic API ํค
|
| 58 |
+
ANTHROPIC_API_KEY=sk-ant-api03-xxx...
|
| 59 |
+
|
| 60 |
+
# ํ์: PDF ํ์ผ ๊ฒฝ๋ก (๋ก์ปฌ ๋งฅ๋ถ)
|
| 61 |
+
PDF_SOURCE_PATH=/Users/seongjincho/Desktop/HYU-06-๊ณตํ๋ฐ์ฌ ๋์ ๊ธฐ/25.8.15(ํ๋๋ฉํธ DB ๋ ํผ๋ฐ์ค)/data/
|
| 62 |
+
|
| 63 |
+
# ์ ํ: ์๋ฒ ๋ฉ ๋ชจ๋ธ (๊ธฐ๋ณธ๊ฐ ์ฌ์ฉ ๊ถ์ฅ)
|
| 64 |
+
EMBEDDING_MODEL=sentence-transformers
|
| 65 |
+
EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
### 2-2. ์ธ๋ฑ์ฑ ์คํ
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
python scripts/index_pdfs.py
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
**์์ ์์ ์๊ฐ:**
|
| 75 |
+
- 2,639๊ฐ PDF ํ์ผ
|
| 76 |
+
- ์ฝ 30-60๋ถ (๋ฌด๋ฃ ๋ชจ๋ธ ์ฌ์ฉ ์)
|
| 77 |
+
- ์ฒ์ ์คํ ์ ๋ชจ๋ธ ๋ค์ด๋ก๋ ์ถ๊ฐ (~90MB)
|
| 78 |
+
|
| 79 |
+
**์งํ ๊ณผ์ :**
|
| 80 |
+
```
|
| 81 |
+
[1/4] PDF ํ์ผ ์ฒ๋ฆฌ ์ค...
|
| 82 |
+
- ์ ์ฒด ๋ฌธ์: 2639๊ฐ
|
| 83 |
+
- ์ ์ฒด ํ์ด์ง: 50000+ํ์ด์ง
|
| 84 |
+
|
| 85 |
+
[2/4] ํ
์คํธ ์ฒญํน ์ค...
|
| 86 |
+
- ์ ์ฒด ์ฒญํฌ: 30000+๊ฐ
|
| 87 |
+
|
| 88 |
+
[3/4] ์๋ฒ ๋ฉ ์์ฑ ์ค...
|
| 89 |
+
- ์๋ฒ ๋ฉ ๊ฐ์: 30000+๊ฐ
|
| 90 |
+
- ์๋ฒ ๋ฉ ์ฐจ์: 384์ฐจ์
|
| 91 |
+
|
| 92 |
+
[4/4] Vector DB์ ์ ์ฅ ์ค...
|
| 93 |
+
- ์ ์ฅ ์๋ฃ!
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
### 2-3. ์ธ๋ฑ์ฑ ํ์ธ
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
python scripts/check_vector_db.py
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
---
|
| 103 |
+
|
| 104 |
+
## 3. API ์๋ฒ ์คํ
|
| 105 |
+
|
| 106 |
+
### 3-1. ์๋ฒ ์์
|
| 107 |
+
|
| 108 |
+
```bash
|
| 109 |
+
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
**๋๋ ๊ฐ๋จํ๊ฒ:**
|
| 113 |
+
```bash
|
| 114 |
+
python app/main.py
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
### 3-2. ์๋ฒ ํ์ธ
|
| 118 |
+
|
| 119 |
+
๋ธ๋ผ์ฐ์ ์์:
|
| 120 |
+
- API ๋ฌธ์: http://localhost:8000/docs
|
| 121 |
+
- ๋์ฒด ๋ฌธ์: http://localhost:8000/redoc
|
| 122 |
+
|
| 123 |
+
ํฐ๋ฏธ๋์์:
|
| 124 |
+
```bash
|
| 125 |
+
curl http://localhost:8000/health
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## 4. ์ง๋ฌธํ๊ธฐ
|
| 131 |
+
|
| 132 |
+
### 4-1. ์น UI๋ก ์ง๋ฌธ (๊ฐ์ฅ ์ฌ์)
|
| 133 |
+
|
| 134 |
+
1. http://localhost:8000/docs ์ ์
|
| 135 |
+
2. `POST /query` ํด๋ฆญ
|
| 136 |
+
3. "Try it out" ํด๋ฆญ
|
| 137 |
+
4. Request body ์
๋ ฅ:
|
| 138 |
+
```json
|
| 139 |
+
{
|
| 140 |
+
"question": "๊ธ์ต์๊ธฐ์ ์ฃผ์ ์์ธ์?",
|
| 141 |
+
"top_k": 5,
|
| 142 |
+
"enable_metacognition": true
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
5. "Execute" ํด๋ฆญ
|
| 146 |
+
|
| 147 |
+
### 4-2. ํฐ๋ฏธ๋์์ ์ง๋ฌธ
|
| 148 |
+
|
| 149 |
+
```bash
|
| 150 |
+
curl -X POST http://localhost:8000/query \
|
| 151 |
+
-H "Content-Type: application/json" \
|
| 152 |
+
-d '{
|
| 153 |
+
"question": "ํฌํธํด๋ฆฌ์ค ๋ค๊ฐํ์ ํจ๊ณผ๋?",
|
| 154 |
+
"top_k": 5,
|
| 155 |
+
"enable_metacognition": true
|
| 156 |
+
}'
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
### 4-3. Python ์คํฌ๋ฆฝํธ๋ก ์ง๋ฌธ
|
| 160 |
+
|
| 161 |
+
```bash
|
| 162 |
+
python scripts/test_query.py
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
๋๋ Python ์ฝ๋์์:
|
| 166 |
+
```python
|
| 167 |
+
import requests
|
| 168 |
+
|
| 169 |
+
response = requests.post(
|
| 170 |
+
"http://localhost:8000/query",
|
| 171 |
+
json={
|
| 172 |
+
"question": "์ค์์ํ ๊ธ๋ฆฌ ์ ์ฑ
์ ํจ๊ณผ๋?",
|
| 173 |
+
"top_k": 5,
|
| 174 |
+
"enable_metacognition": True
|
| 175 |
+
}
|
| 176 |
+
)
|
| 177 |
+
|
| 178 |
+
result = response.json()
|
| 179 |
+
print(result["answer"])
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
---
|
| 183 |
+
|
| 184 |
+
## 5. ๊ณ ๊ธ ์ฌ์ฉ๋ฒ
|
| 185 |
+
|
| 186 |
+
### 5-1. ๋ฉํ์ธ์ง ๋นํ์ฑํ (๋น ๋ฅธ ์๋ต)
|
| 187 |
+
|
| 188 |
+
```json
|
| 189 |
+
{
|
| 190 |
+
"question": "์ง๋ฌธ",
|
| 191 |
+
"top_k": 5,
|
| 192 |
+
"enable_metacognition": false
|
| 193 |
+
}
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
**์ฐจ์ด์ :**
|
| 197 |
+
- ๋ฉํ์ธ์ง ํ์ฑํ: 10-30์ด, ๊ณ ํ์ง
|
| 198 |
+
- ๋ฉํ์ธ์ง ๋นํ์ฑํ: 2-5์ด, ์ผ๋ฐ ํ์ง
|
| 199 |
+
|
| 200 |
+
### 5-2. ๊ฒ์ ๊ฒฐ๊ณผ ๊ฐ์ ์กฐ์
|
| 201 |
+
|
| 202 |
+
```json
|
| 203 |
+
{
|
| 204 |
+
"question": "์ง๋ฌธ",
|
| 205 |
+
"top_k": 10, // ๋ ๋ง์ ๋ฌธ์ ๊ฒ์
|
| 206 |
+
"enable_metacognition": true
|
| 207 |
+
}
|
| 208 |
+
```
|
| 209 |
+
|
| 210 |
+
### 5-3. ๋ฉํ๋ฐ์ดํฐ ํํฐ๋ง
|
| 211 |
+
|
| 212 |
+
ํน์ ์ ์์ ๋
ผ๋ฌธ๋ง ๊ฒ์:
|
| 213 |
+
```json
|
| 214 |
+
{
|
| 215 |
+
"question": "์ง๋ฌธ",
|
| 216 |
+
"top_k": 5,
|
| 217 |
+
"filter_metadata": {
|
| 218 |
+
"author": "John Doe"
|
| 219 |
+
}
|
| 220 |
+
}
|
| 221 |
+
```
|
| 222 |
+
|
| 223 |
+
### 5-4. ์๋ต ๊ตฌ์กฐ ์ดํด
|
| 224 |
+
|
| 225 |
+
```json
|
| 226 |
+
{
|
| 227 |
+
"question": "์๋ณธ ์ง๋ฌธ",
|
| 228 |
+
"answer": "์์ฑ๋ ๋ต๋ณ",
|
| 229 |
+
"sources": [
|
| 230 |
+
{
|
| 231 |
+
"text": "๋ฌธ์ ๋ด์ฉ...",
|
| 232 |
+
"source_filename": "paper123.pdf",
|
| 233 |
+
"similarity": 0.89,
|
| 234 |
+
"metadata": {
|
| 235 |
+
"title": "๋
ผ๋ฌธ ์ ๋ชฉ",
|
| 236 |
+
"author": "์ ์"
|
| 237 |
+
}
|
| 238 |
+
}
|
| 239 |
+
],
|
| 240 |
+
"metacognition": {
|
| 241 |
+
"thinking_history": [...],
|
| 242 |
+
"iterations": 2
|
| 243 |
+
},
|
| 244 |
+
"search_stats": {
|
| 245 |
+
"documents_found": 5,
|
| 246 |
+
"top_similarity": 0.89
|
| 247 |
+
}
|
| 248 |
+
}
|
| 249 |
+
```
|
| 250 |
+
|
| 251 |
+
---
|
| 252 |
+
|
| 253 |
+
## 6. ๋ฌธ์ ํด๊ฒฐ
|
| 254 |
+
|
| 255 |
+
### ๋ฌธ์ 1: PDF ๊ฒฝ๋ก ์ค๋ฅ
|
| 256 |
+
```
|
| 257 |
+
FileNotFoundError: Directory not found
|
| 258 |
+
```
|
| 259 |
+
|
| 260 |
+
**ํด๊ฒฐ:**
|
| 261 |
+
```bash
|
| 262 |
+
# .env ํ์ผ์์ PDF ๊ฒฝ๋ก ํ์ธ
|
| 263 |
+
nano .env
|
| 264 |
+
|
| 265 |
+
# ๊ฒฝ๋ก๊ฐ ์ ํํ์ง ํ์ธ
|
| 266 |
+
ls "/Users/seongjincho/Desktop/..."
|
| 267 |
+
```
|
| 268 |
+
|
| 269 |
+
### ๋ฌธ์ 2: API ํค ์ค๋ฅ
|
| 270 |
+
```
|
| 271 |
+
AuthenticationError: Invalid API key
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
**ํด๊ฒฐ:**
|
| 275 |
+
```bash
|
| 276 |
+
# .env ํ์ผ ํ์ธ
|
| 277 |
+
nano .env
|
| 278 |
+
|
| 279 |
+
# ANTHROPIC_API_KEY๊ฐ ์ฌ๋ฐ๋ฅธ์ง ํ์ธ
|
| 280 |
+
# ํค๋ sk-ant-api03-๋ก ์์๏ฟฝ๏ฟฝ๏ฟฝ์ผ ํจ
|
| 281 |
+
```
|
| 282 |
+
|
| 283 |
+
### ๋ฌธ์ 3: Vector DB๊ฐ ๋น์ด์์
|
| 284 |
+
```
|
| 285 |
+
total_documents: 0
|
| 286 |
+
```
|
| 287 |
+
|
| 288 |
+
**ํด๊ฒฐ:**
|
| 289 |
+
```bash
|
| 290 |
+
# ์ธ๋ฑ์ฑ ๋จผ์ ์คํ
|
| 291 |
+
python scripts/index_pdfs.py
|
| 292 |
+
|
| 293 |
+
# ํ์ธ
|
| 294 |
+
python scripts/check_vector_db.py
|
| 295 |
+
```
|
| 296 |
+
|
| 297 |
+
### ๋ฌธ์ 4: ๋ฉ๋ชจ๋ฆฌ ๋ถ์กฑ
|
| 298 |
+
```
|
| 299 |
+
MemoryError
|
| 300 |
+
```
|
| 301 |
+
|
| 302 |
+
**ํด๊ฒฐ:**
|
| 303 |
+
```bash
|
| 304 |
+
# scripts/index_pdfs.py ์์
|
| 305 |
+
# ๋ฐฐ์น ํฌ๊ธฐ ์ค์ด๊ธฐ
|
| 306 |
+
embeddings = embedder.embed_batch(texts, batch_size=16) # 32 โ 16
|
| 307 |
+
```
|
| 308 |
+
|
| 309 |
+
### ๋ฌธ์ 5: ์๋ฒ๊ฐ ์์๋์ง ์์
|
| 310 |
+
```
|
| 311 |
+
Address already in use
|
| 312 |
+
```
|
| 313 |
+
|
| 314 |
+
**ํด๊ฒฐ:**
|
| 315 |
+
```bash
|
| 316 |
+
# ํฌํธ ๋ณ๊ฒฝ
|
| 317 |
+
uvicorn app.main:app --reload --port 8001
|
| 318 |
+
|
| 319 |
+
# ๋๋ .env ํ์ผ์์
|
| 320 |
+
API_PORT=8001
|
| 321 |
+
```
|
| 322 |
+
|
| 323 |
+
### ๋ฌธ์ 6: ์๋ฒ ๋ฉ ๋ชจ๋ธ ๋ค์ด๋ก๋ ์คํจ
|
| 324 |
+
```
|
| 325 |
+
ConnectionError
|
| 326 |
+
```
|
| 327 |
+
|
| 328 |
+
**ํด๊ฒฐ:**
|
| 329 |
+
```bash
|
| 330 |
+
# ์๋์ผ๋ก ๋ชจ๋ธ ๋ค์ด๋ก๋
|
| 331 |
+
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
|
| 332 |
+
```
|
| 333 |
+
|
| 334 |
+
---
|
| 335 |
+
|
| 336 |
+
## ๐ก ํ
|
| 337 |
+
|
| 338 |
+
### ์ฑ๋ฅ ์ต์ ํ
|
| 339 |
+
1. **SSD ์ฌ์ฉ**: Vector DB๋ SSD์ ์ ์ฅ ๊ถ์ฅ
|
| 340 |
+
2. **๋ฉ๋ชจ๋ฆฌ**: ์ต์ 8GB RAM ๊ถ์ฅ
|
| 341 |
+
3. **๋ฐฐ์น ํฌ๊ธฐ**: GPU ์์ผ๋ฉด batch_size=16-32
|
| 342 |
+
|
| 343 |
+
### ๋น์ฉ ์ ๊ฐ
|
| 344 |
+
1. **๋ฌด๋ฃ ์๋ฒ ๋ฉ**: Sentence Transformers ์ฌ์ฉ
|
| 345 |
+
2. **๋ฉํ์ธ์ง ๋นํ์ฑํ**: ๋น ๋ฅธ ํ
์คํธ ์
|
| 346 |
+
|
| 347 |
+
### ํ์ง ํฅ์
|
| 348 |
+
1. **๋ฉํ์ธ์ง ํ์ฑํ**: ๊ณ ํ์ง ๋ต๋ณ ํ์ ์
|
| 349 |
+
2. **top_k ์ฆ๊ฐ**: ๋ ๋ง์ ๋ฌธ์ ์ฐธ๊ณ
|
| 350 |
+
3. **์ฒญํฌ ํฌ๊ธฐ ์กฐ์ **: ๊ธด ๋ฌธ๋งฅ ํ์ ์
|
| 351 |
+
|
| 352 |
+
---
|
| 353 |
+
|
| 354 |
+
## ๐ ์ง์
|
| 355 |
+
|
| 356 |
+
๋ฌธ์ ๊ฐ ํด๊ฒฐ๋์ง ์์ผ๋ฉด:
|
| 357 |
+
1. GitHub Issues ๋ฑ๋ก
|
| 358 |
+
2. ๋ก๊ทธ ํ์ผ ์ฒจ๋ถ
|
| 359 |
+
3. ์๋ฌ ๋ฉ์์ง ์ ์ฒด ๋ณต์ฌ
|
| 360 |
+
|
| 361 |
+
**๋ก๊ทธ ํ์ธ:**
|
| 362 |
+
```bash
|
| 363 |
+
# API ์๋ฒ ๋ก๊ทธ๋ ํฐ๋ฏธ๋์ ์ถ๋ ฅ๋จ
|
| 364 |
+
# ํ์์ ํ์ผ๋ก ์ ์ฅ
|
| 365 |
+
uvicorn app.main:app > server.log 2>&1
|
| 366 |
+
```
|
|
File without changes
|
|
File without changes
|
|
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
API ์์ฒญ/์๋ต ๋ชจ๋ธ ์ ์ (Pydantic)
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
from pydantic import BaseModel, Field
|
| 6 |
+
from typing import List, Dict, Optional, Any
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
class QueryRequest(BaseModel):
|
| 10 |
+
"""์ง๋ฌธ ์์ฒญ ๋ชจ๋ธ"""
|
| 11 |
+
question: str = Field(..., description="์ฌ์ฉ์ ์ง๋ฌธ")
|
| 12 |
+
top_k: int = Field(default=5, ge=1, le=20, description="๊ฒ์ํ ๋ฌธ์ ๊ฐ์")
|
| 13 |
+
enable_metacognition: bool = Field(default=True, description="๋ฉํ์ธ์ง ๊ณผ์ ํ์ฑํ ์ฌ๋ถ")
|
| 14 |
+
filter_metadata: Optional[Dict[str, str]] = Field(default=None, description="๋ฉํ๋ฐ์ดํฐ ํํฐ")
|
| 15 |
+
|
| 16 |
+
class Config:
|
| 17 |
+
json_schema_extra = {
|
| 18 |
+
"example": {
|
| 19 |
+
"question": "๊ธ์ต์๊ธฐ์ ์ฃผ์ ์์ธ์ ๋ฌด์์ธ๊ฐ์?",
|
| 20 |
+
"top_k": 5,
|
| 21 |
+
"enable_metacognition": True
|
| 22 |
+
}
|
| 23 |
+
}
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
class SourceDocument(BaseModel):
|
| 27 |
+
"""์ถ์ฒ ๋ฌธ์ ๋ชจ๋ธ"""
|
| 28 |
+
text: str = Field(..., description="๋ฌธ์ ํ
์คํธ")
|
| 29 |
+
source_filename: str = Field(..., description="์ถ์ฒ ํ์ผ๋ช
")
|
| 30 |
+
similarity: float = Field(..., description="์ ์ฌ๋ ์ ์")
|
| 31 |
+
metadata: Dict[str, Any] = Field(default_factory=dict, description="๋ฌธ์ ๋ฉํ๋ฐ์ดํฐ")
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
class MetaCognitionInfo(BaseModel):
|
| 35 |
+
"""๋ฉํ์ธ์ง ์ ๋ณด ๋ชจ๋ธ"""
|
| 36 |
+
thinking_history: List[Dict[str, Any]] = Field(..., description="์ฌ๊ณ ๊ณผ์ ํ์คํ ๋ฆฌ")
|
| 37 |
+
iterations: int = Field(..., description="๊ฐ์ ๋ฐ๋ณต ํ์")
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
class SearchStats(BaseModel):
|
| 41 |
+
"""๊ฒ์ ํต๊ณ ๋ชจ๋ธ"""
|
| 42 |
+
documents_found: int = Field(..., description="๋ฐ๊ฒฌ๋ ๋ฌธ์ ์")
|
| 43 |
+
top_similarity: float = Field(..., description="์ต๊ณ ์ ์ฌ๋ ์ ์")
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
class QueryResponse(BaseModel):
|
| 47 |
+
"""์ง๋ฌธ ์๋ต ๋ชจ๋ธ"""
|
| 48 |
+
question: str = Field(..., description="์๋ณธ ์ง๋ฌธ")
|
| 49 |
+
answer: str = Field(..., description="์์ฑ๋ ๋ต๋ณ")
|
| 50 |
+
sources: List[SourceDocument] = Field(..., description="์ฐธ๊ณ ํ ์ถ์ฒ ๋ฌธ์๋ค")
|
| 51 |
+
metacognition: Optional[MetaCognitionInfo] = Field(default=None, description="๋ฉํ์ธ์ง ์ ๋ณด")
|
| 52 |
+
search_stats: SearchStats = Field(..., description="๊ฒ์ ํต๊ณ")
|
| 53 |
+
|
| 54 |
+
class Config:
|
| 55 |
+
json_schema_extra = {
|
| 56 |
+
"example": {
|
| 57 |
+
"question": "๊ธ์ต์๊ธฐ์ ์ฃผ์ ์์ธ์ ๋ฌด์์ธ๊ฐ์?",
|
| 58 |
+
"answer": "2008๋
๊ธ์ต์๊ธฐ์ ์ฃผ์ ์์ธ์...",
|
| 59 |
+
"sources": [
|
| 60 |
+
{
|
| 61 |
+
"text": "๋
ผ๋ฌธ ๋ด์ฉ...",
|
| 62 |
+
"source_filename": "financial_crisis_2008.pdf",
|
| 63 |
+
"similarity": 0.89,
|
| 64 |
+
"metadata": {"author": "John Doe"}
|
| 65 |
+
}
|
| 66 |
+
],
|
| 67 |
+
"search_stats": {
|
| 68 |
+
"documents_found": 5,
|
| 69 |
+
"top_similarity": 0.89
|
| 70 |
+
}
|
| 71 |
+
}
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
class HealthResponse(BaseModel):
|
| 76 |
+
"""ํฌ์ค ์ฒดํฌ ์๋ต"""
|
| 77 |
+
status: str = Field(..., description="์๋ฒ ์ํ")
|
| 78 |
+
vector_store: Dict[str, Any] = Field(..., description="๋ฒกํฐ ์คํ ์ด ์ ๋ณด")
|
| 79 |
+
embedding_model: Dict[str, Any] = Field(..., description="์๋ฒ ๋ฉ ๋ชจ๋ธ ์ ๋ณด")
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
class ErrorResponse(BaseModel):
|
| 83 |
+
"""์๋ฌ ์๋ต"""
|
| 84 |
+
error: str = Field(..., description="์๋ฌ ๋ฉ์์ง")
|
| 85 |
+
detail: Optional[str] = Field(default=None, description="์์ธ ์ ๋ณด")
|
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
FastAPI ๋ผ์ฐํธ ์ ์
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
from fastapi import APIRouter, HTTPException, status
|
| 6 |
+
from loguru import logger
|
| 7 |
+
|
| 8 |
+
from app.api.models import (
|
| 9 |
+
QueryRequest,
|
| 10 |
+
QueryResponse,
|
| 11 |
+
HealthResponse,
|
| 12 |
+
ErrorResponse
|
| 13 |
+
)
|
| 14 |
+
|
| 15 |
+
# ๋ผ์ฐํฐ ์์ฑ
|
| 16 |
+
router = APIRouter()
|
| 17 |
+
|
| 18 |
+
# RAG ํ์ดํ๋ผ์ธ์ main.py์์ ์ฃผ์
๋จ
|
| 19 |
+
rag_pipeline = None
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
def set_rag_pipeline(pipeline):
|
| 23 |
+
"""RAG ํ์ดํ๋ผ์ธ ์ค์ """
|
| 24 |
+
global rag_pipeline
|
| 25 |
+
rag_pipeline = pipeline
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
@router.get("/", tags=["Root"])
|
| 29 |
+
async def root():
|
| 30 |
+
"""API ๋ฃจํธ ์๋ํฌ์ธํธ"""
|
| 31 |
+
return {
|
| 32 |
+
"message": "Financial RAG API with Metacognitive Agent",
|
| 33 |
+
"version": "1.0.0",
|
| 34 |
+
"endpoints": {
|
| 35 |
+
"health": "/health",
|
| 36 |
+
"query": "/query",
|
| 37 |
+
"docs": "/docs"
|
| 38 |
+
}
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
@router.get(
|
| 43 |
+
"/health",
|
| 44 |
+
response_model=HealthResponse,
|
| 45 |
+
tags=["Health"],
|
| 46 |
+
summary="ํฌ์ค ์ฒดํฌ"
|
| 47 |
+
)
|
| 48 |
+
async def health_check():
|
| 49 |
+
"""
|
| 50 |
+
์์คํ
์ํ ํ์ธ
|
| 51 |
+
|
| 52 |
+
Returns:
|
| 53 |
+
์์คํ
ํต๊ณ ๋ฐ ์ํ ์ ๋ณด
|
| 54 |
+
"""
|
| 55 |
+
try:
|
| 56 |
+
if not rag_pipeline:
|
| 57 |
+
raise HTTPException(
|
| 58 |
+
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
|
| 59 |
+
detail="RAG pipeline not initialized"
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
stats = rag_pipeline.get_statistics()
|
| 63 |
+
|
| 64 |
+
return HealthResponse(
|
| 65 |
+
status="healthy",
|
| 66 |
+
vector_store=stats["vector_store"],
|
| 67 |
+
embedding_model=stats["embedding_model"]
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
except Exception as e:
|
| 71 |
+
logger.error(f"Health check failed: {str(e)}")
|
| 72 |
+
raise HTTPException(
|
| 73 |
+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
| 74 |
+
detail=str(e)
|
| 75 |
+
)
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
@router.post(
|
| 79 |
+
"/query",
|
| 80 |
+
response_model=QueryResponse,
|
| 81 |
+
tags=["Query"],
|
| 82 |
+
summary="์ง๋ฌธํ๊ธฐ",
|
| 83 |
+
description="๊ธ์ต/๊ฒฝ์ ๊ด๋ จ ์ง๋ฌธ์ ๋ํด RAG ์์คํ
์ผ๋ก ๋ต๋ณ ์์ฑ"
|
| 84 |
+
)
|
| 85 |
+
async def query(request: QueryRequest):
|
| 86 |
+
"""
|
| 87 |
+
์ง๋ฌธ์ ๋ํ ๋ต๋ณ ์์ฑ
|
| 88 |
+
|
| 89 |
+
Args:
|
| 90 |
+
request: ์ง๋ฌธ ์์ฒญ (question, top_k, enable_metacognition ๋ฑ)
|
| 91 |
+
|
| 92 |
+
Returns:
|
| 93 |
+
๋ต๋ณ, ์ถ์ฒ ๋ฌธ์, ๋ฉํ์ธ์ง ์ ๋ณด ๋ฑ
|
| 94 |
+
"""
|
| 95 |
+
try:
|
| 96 |
+
if not rag_pipeline:
|
| 97 |
+
raise HTTPException(
|
| 98 |
+
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
|
| 99 |
+
detail="RAG pipeline not initialized"
|
| 100 |
+
)
|
| 101 |
+
|
| 102 |
+
logger.info(f"Received query: {request.question}")
|
| 103 |
+
|
| 104 |
+
# RAG ํ์ดํ๋ผ์ธ์ผ๋ก ์ง๋ฌธ ์ฒ๋ฆฌ
|
| 105 |
+
result = await rag_pipeline.query(
|
| 106 |
+
question=request.question,
|
| 107 |
+
top_k=request.top_k,
|
| 108 |
+
enable_metacognition=request.enable_metacognition,
|
| 109 |
+
filter_metadata=request.filter_metadata
|
| 110 |
+
)
|
| 111 |
+
|
| 112 |
+
logger.info(f"Query processed successfully")
|
| 113 |
+
|
| 114 |
+
return QueryResponse(**result)
|
| 115 |
+
|
| 116 |
+
except Exception as e:
|
| 117 |
+
logger.error(f"Query failed: {str(e)}")
|
| 118 |
+
raise HTTPException(
|
| 119 |
+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
| 120 |
+
detail=f"Query processing failed: {str(e)}"
|
| 121 |
+
)
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
@router.get(
|
| 125 |
+
"/stats",
|
| 126 |
+
tags=["Stats"],
|
| 127 |
+
summary="ํต๊ณ ์ ๋ณด"
|
| 128 |
+
)
|
| 129 |
+
async def get_stats():
|
| 130 |
+
"""
|
| 131 |
+
RAG ์์คํ
ํต๊ณ ์ ๋ณด
|
| 132 |
+
|
| 133 |
+
Returns:
|
| 134 |
+
๋ฒกํฐ ์คํ ์ด, ์๋ฒ ๋ฉ ๋ชจ๋ธ ๋ฑ์ ํต๊ณ
|
| 135 |
+
"""
|
| 136 |
+
try:
|
| 137 |
+
if not rag_pipeline:
|
| 138 |
+
raise HTTPException(
|
| 139 |
+
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
|
| 140 |
+
detail="RAG pipeline not initialized"
|
| 141 |
+
)
|
| 142 |
+
|
| 143 |
+
stats = rag_pipeline.get_statistics()
|
| 144 |
+
return stats
|
| 145 |
+
|
| 146 |
+
except Exception as e:
|
| 147 |
+
logger.error(f"Stats retrieval failed: {str(e)}")
|
| 148 |
+
raise HTTPException(
|
| 149 |
+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
| 150 |
+
detail=str(e)
|
| 151 |
+
)
|
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
FastAPI ๋ฉ์ธ ์ ํ๋ฆฌ์ผ์ด์
|
| 3 |
+
|
| 4 |
+
์คํ ๋ฐฉ๋ฒ:
|
| 5 |
+
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from fastapi import FastAPI
|
| 9 |
+
from fastapi.middleware.cors import CORSMiddleware
|
| 10 |
+
from loguru import logger
|
| 11 |
+
import sys
|
| 12 |
+
|
| 13 |
+
from app.api import routes
|
| 14 |
+
from app.metacognitive_agent import MetaCognitiveAgent
|
| 15 |
+
from app.rag_pipeline import RAGPipeline
|
| 16 |
+
from services.vector_store import VectorStore
|
| 17 |
+
from services.embedder import Embedder
|
| 18 |
+
from utils.config import settings
|
| 19 |
+
|
| 20 |
+
# ๋ก๊น
์ค์
|
| 21 |
+
logger.remove()
|
| 22 |
+
logger.add(
|
| 23 |
+
sys.stdout,
|
| 24 |
+
format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan>:<cyan>{function}</cyan> - <level>{message}</level>",
|
| 25 |
+
level="INFO"
|
| 26 |
+
)
|
| 27 |
+
|
| 28 |
+
# FastAPI ์ฑ ์์ฑ
|
| 29 |
+
app = FastAPI(
|
| 30 |
+
title="Financial RAG API",
|
| 31 |
+
description="๊ธ์ต/๊ฒฝ์ ๋
ผ๋ฌธ ๊ธฐ๋ฐ RAG ์์คํ
with ๋ฉํ์ธ์ง ์์ด์ ํธ",
|
| 32 |
+
version="1.0.0",
|
| 33 |
+
docs_url="/docs",
|
| 34 |
+
redoc_url="/redoc"
|
| 35 |
+
)
|
| 36 |
+
|
| 37 |
+
# CORS ์ค์
|
| 38 |
+
app.add_middleware(
|
| 39 |
+
CORSMiddleware,
|
| 40 |
+
allow_origins=["*"], # ํ๋ก๋์
์์๋ ํน์ ๋๋ฉ์ธ์ผ๋ก ์ ํ
|
| 41 |
+
allow_credentials=True,
|
| 42 |
+
allow_methods=["*"],
|
| 43 |
+
allow_headers=["*"],
|
| 44 |
+
)
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
@app.on_event("startup")
|
| 48 |
+
async def startup_event():
|
| 49 |
+
"""์๋ฒ ์์ ์ ์ด๊ธฐํ"""
|
| 50 |
+
logger.info("=" * 80)
|
| 51 |
+
logger.info("Financial RAG API ์์ ์ค...")
|
| 52 |
+
logger.info("=" * 80)
|
| 53 |
+
|
| 54 |
+
try:
|
| 55 |
+
# 1. Vector Store ์ด๊ธฐํ
|
| 56 |
+
logger.info("1๏ธโฃ Vector Store ์ด๊ธฐํ ์ค...")
|
| 57 |
+
vector_store = VectorStore(
|
| 58 |
+
persist_directory=settings.chroma_persist_directory,
|
| 59 |
+
collection_name=settings.collection_name
|
| 60 |
+
)
|
| 61 |
+
logger.info(f"โ
Vector Store ์ด๊ธฐํ ์๋ฃ ({vector_store.collection.count()}๊ฐ ๋ฌธ์)")
|
| 62 |
+
|
| 63 |
+
# 2. Embedder ์ด๊ธฐํ
|
| 64 |
+
logger.info("2๏ธโฃ Embedder ์ด๊ธฐํ ์ค...")
|
| 65 |
+
embedder = Embedder(
|
| 66 |
+
model_type=settings.embedding_model,
|
| 67 |
+
model_name=settings.embedding_model_name,
|
| 68 |
+
openai_api_key=settings.openai_api_key,
|
| 69 |
+
cohere_api_key=settings.cohere_api_key
|
| 70 |
+
)
|
| 71 |
+
logger.info(f"โ
Embedder ์ด๊ธฐํ ์๋ฃ ({embedder.get_embedding_dimension()}์ฐจ์)")
|
| 72 |
+
|
| 73 |
+
# 3. Metacognitive Agent ์ด๊ธฐํ
|
| 74 |
+
logger.info("3๏ธโฃ Metacognitive Agent ์ด๊ธฐํ ์ค...")
|
| 75 |
+
agent = MetaCognitiveAgent(api_key=settings.anthropic_api_key)
|
| 76 |
+
logger.info(f"โ
Agent ์ด๊ธฐํ ์๋ฃ ({agent.model})")
|
| 77 |
+
|
| 78 |
+
# 4. RAG Pipeline ์์ฑ
|
| 79 |
+
logger.info("4๏ธโฃ RAG Pipeline ์์ฑ ์ค...")
|
| 80 |
+
rag_pipeline = RAGPipeline(
|
| 81 |
+
vector_store=vector_store,
|
| 82 |
+
embedder=embedder,
|
| 83 |
+
metacognitive_agent=agent
|
| 84 |
+
)
|
| 85 |
+
logger.info("โ
RAG Pipeline ์์ฑ ์๋ฃ")
|
| 86 |
+
|
| 87 |
+
# ๋ผ์ฐํฐ์ ํ์ดํ๋ผ์ธ ์ค์
|
| 88 |
+
routes.set_rag_pipeline(rag_pipeline)
|
| 89 |
+
|
| 90 |
+
logger.info("=" * 80)
|
| 91 |
+
logger.info("โจ API ์๋ฒ ์ค๋น ์๋ฃ!")
|
| 92 |
+
logger.info(f"๐ Vector DB: {vector_store.collection.count()}๊ฐ ๋ฌธ์")
|
| 93 |
+
logger.info(f"๐ค Model: {agent.model}")
|
| 94 |
+
logger.info(f"๐ API Docs: http://{settings.api_host}:{settings.api_port}/docs")
|
| 95 |
+
logger.info("=" * 80)
|
| 96 |
+
|
| 97 |
+
except Exception as e:
|
| 98 |
+
logger.error(f"โ ์ด๊ธฐํ ์คํจ: {str(e)}")
|
| 99 |
+
raise
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
@app.on_event("shutdown")
|
| 103 |
+
async def shutdown_event():
|
| 104 |
+
"""์๋ฒ ์ข
๋ฃ ์ ์ ๋ฆฌ"""
|
| 105 |
+
logger.info("API ์๋ฒ ์ข
๋ฃ ์ค...")
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
# ๋ผ์ฐํฐ ๋ฑ๋ก
|
| 109 |
+
app.include_router(routes.router)
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
if __name__ == "__main__":
|
| 113 |
+
import uvicorn
|
| 114 |
+
|
| 115 |
+
uvicorn.run(
|
| 116 |
+
"app.main:app",
|
| 117 |
+
host=settings.api_host,
|
| 118 |
+
port=settings.api_port,
|
| 119 |
+
reload=True,
|
| 120 |
+
log_level="info"
|
| 121 |
+
)
|
|
@@ -0,0 +1,289 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
๋ฉํ์ธ์ง ์์ด์ ํธ (Metacognitive Agent)
|
| 3 |
+
|
| 4 |
+
์ด ์์ด์ ํธ๋ ๋ค์๊ณผ ๊ฐ์ ๋ฉํ์ธ์ง ์ ๋ต์ ์ฌ์ฉํฉ๋๋ค:
|
| 5 |
+
1. Planning (๊ณํ): ๋ต๋ณ ์ ๋ต ์๋ฆฝ
|
| 6 |
+
2. Monitoring (๊ฐ์): ๋ต๋ณ ๊ณผ์ ๋ชจ๋ํฐ๋ง
|
| 7 |
+
3. Evaluation (ํ๊ฐ): ๋ต๋ณ ํ์ง ํ๊ฐ
|
| 8 |
+
4. Revision (์์ ): ํ์์ ๋ต๋ณ ๊ฐ์
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from typing import List, Dict, Optional
|
| 12 |
+
from anthropic import Anthropic
|
| 13 |
+
from loguru import logger
|
| 14 |
+
import json
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
class MetaCognitiveAgent:
|
| 18 |
+
"""๋ฉํ์ธ์ง ๋ฅ๋ ฅ์ ๊ฐ์ง AI ์์ด์ ํธ"""
|
| 19 |
+
|
| 20 |
+
def __init__(self, api_key: str):
|
| 21 |
+
"""
|
| 22 |
+
Args:
|
| 23 |
+
api_key: Anthropic API ํค
|
| 24 |
+
"""
|
| 25 |
+
self.client = Anthropic(api_key=api_key)
|
| 26 |
+
self.thinking_history = []
|
| 27 |
+
self.model = "claude-3-5-sonnet-20241022"
|
| 28 |
+
|
| 29 |
+
# ๋ฉํ์ธ์ง ํ๋กฌํํธ
|
| 30 |
+
self.reflection_prompts = {
|
| 31 |
+
"planning": """
|
| 32 |
+
๋น์ ์ ๊ธ์ต/๊ฒฝ์ ๋ถ์ผ์ ์ ๋ฌธ๊ฐ์
๋๋ค. ๋ค์ ์ง๋ฌธ์ ๋ตํ๊ธฐ ์ํ ์ ๋ต์ ์๋ฆฝํ์ธ์.
|
| 33 |
+
|
| 34 |
+
์ง๋ฌธ: {query}
|
| 35 |
+
|
| 36 |
+
๊ฒ์๋ ๊ด๋ จ ๋ฌธ์:
|
| 37 |
+
{context}
|
| 38 |
+
|
| 39 |
+
๋ค์ ์ฌํญ์ ๊ณ ๋ คํ์ฌ ๋ต๋ณ ๊ณํ์ ์ธ์ฐ์ธ์:
|
| 40 |
+
1. ์ง๋ฌธ์ด ์๊ตฌํ๋ ํต์ฌ ์ ๋ณด๋ ๋ฌด์์ธ๊ฐ?
|
| 41 |
+
2. ์ ๊ณต๋ ๋ฌธ์๋ค์ด ์ง๋ฌธ์ ๋ตํ๊ธฐ์ ์ถฉ๋ถํ๊ฐ?
|
| 42 |
+
3. ์ด๋ค ์ ๋ณด๋ฅผ ์ฐ์ ์ ์ผ๋ก ์ฌ์ฉํด์ผ ํ๋๊ฐ?
|
| 43 |
+
4. ์ฃผ์ํด์ผ ํ ์ ์ด๋ ํ๊ณ๋ ๋ฌด์์ธ๊ฐ?
|
| 44 |
+
|
| 45 |
+
๊ณํ์ JSON ํ์์ผ๋ก ์์ฑํ์ธ์:
|
| 46 |
+
{{
|
| 47 |
+
"key_information": "์ง๋ฌธ์ ํต์ฌ ์ ๋ณด",
|
| 48 |
+
"context_adequacy": "๋ฌธ์์ ์ถฉ๋ถ์ฑ (์ถฉ๋ถ/๋ถ์กฑ/๋ถํ์ค)",
|
| 49 |
+
"strategy": "๋ต๋ณ ์ ๋ต",
|
| 50 |
+
"limitations": "์ฃผ์์ฌํญ ๋ฐ ํ๊ณ"
|
| 51 |
+
}}
|
| 52 |
+
""",
|
| 53 |
+
|
| 54 |
+
"monitoring": """
|
| 55 |
+
ํ์ฌ ์์ฑ ์ค์ธ ๋ต๋ณ์ ๊ฒํ ํ์ธ์.
|
| 56 |
+
|
| 57 |
+
์ง๋ฌธ: {query}
|
| 58 |
+
ํ์ฌ ๋ต๋ณ: {response}
|
| 59 |
+
|
| 60 |
+
๋ค์์ ํ์ธํ์ธ์:
|
| 61 |
+
1. ๋ต๋ณ์ด ์ง๋ฌธ์ ์ง์ ์ ์ผ๋ก ๋๋ตํ๊ณ ์๋๊ฐ?
|
| 62 |
+
2. ์ ๊ณต๋ ๋ฌธ์์ ์ ๋ณด๋ฅผ ์ ํํ ์ฌ์ฉํ๊ณ ์๋๊ฐ?
|
| 63 |
+
3. ์ถ๋ก ์ด ๋
ผ๋ฆฌ์ ์ผ๋ก ํ๋นํ๊ฐ?
|
| 64 |
+
4. Hallucination(๊ทผ๊ฑฐ ์๋ ์ ๋ณด)์ด ํฌํจ๋์ด ์์ง ์์๊ฐ?
|
| 65 |
+
|
| 66 |
+
ํ๊ฐ๋ฅผ JSON ํ์์ผ๋ก ์์ฑํ์ธ์:
|
| 67 |
+
{{
|
| 68 |
+
"relevance": "์ง๋ฌธ๊ณผ์ ๊ด๋ จ์ฑ (๋์/์ค๊ฐ/๋ฎ์)",
|
| 69 |
+
"accuracy": "์ ํ์ฑ (๋์/์ค๊ฐ/๋ฎ์)",
|
| 70 |
+
"logic": "๋
ผ๋ฆฌ์ฑ (ํ๋นํจ/๋ณดํต/๋ฌธ์ ์์)",
|
| 71 |
+
"hallucination_risk": "Hallucination ์ํ๋ (๋ฎ์/์ค๊ฐ/๋์)",
|
| 72 |
+
"issues": ["๋ฐ๊ฒฌ๋ ๋ฌธ์ ์ ๋ค"]
|
| 73 |
+
}}
|
| 74 |
+
""",
|
| 75 |
+
|
| 76 |
+
"evaluation": """
|
| 77 |
+
์ต์ข
๋ต๋ณ์ ํ๊ฐํ์ธ์.
|
| 78 |
+
|
| 79 |
+
์ง๋ฌธ: {query}
|
| 80 |
+
๋ต๋ณ: {response}
|
| 81 |
+
์ฌ์ฉ๋ ์ถ์ฒ: {sources}
|
| 82 |
+
|
| 83 |
+
๋ค์ ๊ธฐ์ค์ผ๋ก ํ๊ฐํ์ธ์:
|
| 84 |
+
1. ์์ ์ฑ: ์ง๋ฌธ์ ์์ ํ ๋ตํ๋๊ฐ?
|
| 85 |
+
2. ์ ํ์ฑ: ์ ๋ณด๊ฐ ์ ํํ๊ฐ?
|
| 86 |
+
3. ๋ช
ํ์ฑ: ๋ต๋ณ์ด ๋ช
ํํ๊ณ ์ดํดํ๊ธฐ ์ฌ์ด๊ฐ?
|
| 87 |
+
4. ์ ๋ขฐ์ฑ: ์ถ์ฒ๊ฐ ๋ช
ํํ๊ณ ์ ๋ขฐํ ์ ์๋๊ฐ?
|
| 88 |
+
|
| 89 |
+
ํ๊ฐ๋ฅผ JSON ํ์์ผ๋ก ์์ฑํ์ธ์:
|
| 90 |
+
{{
|
| 91 |
+
"completeness": "์์ ์ฑ ์ ์ (1-10)",
|
| 92 |
+
"accuracy": "์ ํ์ฑ ์ ์ (1-10)",
|
| 93 |
+
"clarity": "๋ช
ํ์ฑ ์ ์ (1-10)",
|
| 94 |
+
"reliability": "์ ๋ขฐ์ฑ ์ ์ (1-10)",
|
| 95 |
+
"overall_score": "์ ์ฒด ์ ์ (1-10)",
|
| 96 |
+
"feedback": "๊ฐ์ ์ด ํ์ํ ๋ถ๋ถ"
|
| 97 |
+
}}
|
| 98 |
+
""",
|
| 99 |
+
|
| 100 |
+
"revision": """
|
| 101 |
+
๋ต๋ณ์ ๊ฐ์ ํ์ธ์.
|
| 102 |
+
|
| 103 |
+
์๋ณธ ๋ต๋ณ: {response}
|
| 104 |
+
ํ๊ฐ ํผ๋๋ฐฑ: {feedback}
|
| 105 |
+
|
| 106 |
+
ํผ๋๋ฐฑ์ ๋ฐํ์ผ๋ก ๋ต๋ณ์ ๊ฐ์ ํ์ธ์. ํนํ:
|
| 107 |
+
1. ๋ถ์ ํํ ์ ๋ณด ์์
|
| 108 |
+
2. ๋ถ์์ ํ ๋ถ๋ถ ๋ณด์
|
| 109 |
+
3. ๋ถ๋ช
ํํ ํํ ๊ฐ์
|
| 110 |
+
4. ๊ทผ๊ฑฐ ์๋ ์ฃผ์ฅ ์ ๊ฑฐ
|
| 111 |
+
|
| 112 |
+
๊ฐ์ ๋ ๋ต๋ณ๋ง ์ ๊ณตํ์ธ์.
|
| 113 |
+
"""
|
| 114 |
+
}
|
| 115 |
+
|
| 116 |
+
async def think_and_reflect(
|
| 117 |
+
self,
|
| 118 |
+
query: str,
|
| 119 |
+
context_documents: List[Dict],
|
| 120 |
+
max_iterations: int = 2
|
| 121 |
+
) -> Dict:
|
| 122 |
+
"""
|
| 123 |
+
๋ฉํ์ธ์ง ๊ณผ์ ์ ํตํ ๋ต๋ณ ์์ฑ
|
| 124 |
+
|
| 125 |
+
Args:
|
| 126 |
+
query: ์ฌ์ฉ์ ์ง๋ฌธ
|
| 127 |
+
context_documents: ๊ฒ์๋ ๊ด๋ จ ๋ฌธ์๋ค
|
| 128 |
+
max_iterations: ์ต๋ ๊ฐ์ ๋ฐ๋ณต ํ์
|
| 129 |
+
|
| 130 |
+
Returns:
|
| 131 |
+
์ต์ข
๋ต๋ณ ๋ฐ ๋ฉํ์ธ์ง ๊ณผ์ ์ ๋ณด
|
| 132 |
+
"""
|
| 133 |
+
self.thinking_history = []
|
| 134 |
+
|
| 135 |
+
# ์ปจํ
์คํธ ํฌ๋งทํ
|
| 136 |
+
context_text = self._format_context(context_documents)
|
| 137 |
+
|
| 138 |
+
# 1๋จ๊ณ: ๊ณํ ์๋ฆฝ (Planning)
|
| 139 |
+
logger.info("1๏ธโฃ Planning: ๋ต๋ณ ์ ๋ต ์๋ฆฝ ์ค...")
|
| 140 |
+
plan = await self._plan(query, context_text)
|
| 141 |
+
self.thinking_history.append({"step": "planning", "content": plan})
|
| 142 |
+
|
| 143 |
+
# 2๋จ๊ณ: ์ด๊ธฐ ์๋ต ์์ฑ
|
| 144 |
+
logger.info("2๏ธโฃ Generating: ์ด๊ธฐ ๋ต๋ณ ์์ฑ ์ค...")
|
| 145 |
+
initial_response = await self._generate_response(query, context_text, plan)
|
| 146 |
+
self.thinking_history.append({"step": "initial_response", "content": initial_response})
|
| 147 |
+
|
| 148 |
+
# 3๋จ๊ณ: ๋ชจ๋ํฐ๋ง (Monitoring)
|
| 149 |
+
logger.info("3๏ธโฃ Monitoring: ๋ต๋ณ ๊ฒํ ์ค...")
|
| 150 |
+
monitoring_result = await self._monitor(query, initial_response)
|
| 151 |
+
self.thinking_history.append({"step": "monitoring", "content": monitoring_result})
|
| 152 |
+
|
| 153 |
+
current_response = initial_response
|
| 154 |
+
|
| 155 |
+
# 4๋จ๊ณ: ๋ฐ๋ณต์ ๊ฐ์
|
| 156 |
+
for iteration in range(max_iterations):
|
| 157 |
+
# ํ๊ฐ (Evaluation)
|
| 158 |
+
logger.info(f"4๏ธโฃ Evaluation [{iteration + 1}/{max_iterations}]: ๋ต๋ณ ํ๊ฐ ์ค...")
|
| 159 |
+
evaluation = await self._evaluate(
|
| 160 |
+
query,
|
| 161 |
+
current_response,
|
| 162 |
+
[doc.get('source_filename', 'unknown') for doc in context_documents]
|
| 163 |
+
)
|
| 164 |
+
self.thinking_history.append({"step": f"evaluation_{iteration}", "content": evaluation})
|
| 165 |
+
|
| 166 |
+
# ํ๊ฐ ์ ์๊ฐ ์ถฉ๋ถํ ๋์ผ๋ฉด ์ข
๋ฃ
|
| 167 |
+
try:
|
| 168 |
+
eval_data = json.loads(evaluation)
|
| 169 |
+
overall_score = float(eval_data.get('overall_score', 0))
|
| 170 |
+
|
| 171 |
+
if overall_score >= 8.0:
|
| 172 |
+
logger.info(f"โ
์ถฉ๋ถํ ํ์ง ๋ฌ์ฑ (์ ์: {overall_score}/10)")
|
| 173 |
+
break
|
| 174 |
+
except:
|
| 175 |
+
pass
|
| 176 |
+
|
| 177 |
+
# ๊ฐ์ (Revision)
|
| 178 |
+
logger.info(f"5๏ธโฃ Revision [{iteration + 1}/{max_iterations}]: ๋ต๋ณ ๊ฐ์ ์ค...")
|
| 179 |
+
current_response = await self._revise(current_response, evaluation)
|
| 180 |
+
self.thinking_history.append({"step": f"revision_{iteration}", "content": current_response})
|
| 181 |
+
|
| 182 |
+
return {
|
| 183 |
+
"query": query,
|
| 184 |
+
"final_response": current_response,
|
| 185 |
+
"thinking_history": self.thinking_history,
|
| 186 |
+
"context_documents": context_documents,
|
| 187 |
+
"iterations": len([h for h in self.thinking_history if "revision" in h["step"]])
|
| 188 |
+
}
|
| 189 |
+
|
| 190 |
+
async def _plan(self, query: str, context: str) -> str:
|
| 191 |
+
"""๊ณํ ์๋ฆฝ"""
|
| 192 |
+
prompt = self.reflection_prompts["planning"].format(
|
| 193 |
+
query=query,
|
| 194 |
+
context=context
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
message = self.client.messages.create(
|
| 198 |
+
model=self.model,
|
| 199 |
+
max_tokens=1024,
|
| 200 |
+
messages=[{"role": "user", "content": prompt}]
|
| 201 |
+
)
|
| 202 |
+
|
| 203 |
+
return message.content[0].text
|
| 204 |
+
|
| 205 |
+
async def _generate_response(self, query: str, context: str, plan: str) -> str:
|
| 206 |
+
"""์ด๊ธฐ ์๋ต ์์ฑ"""
|
| 207 |
+
prompt = f"""
|
| 208 |
+
๋น์ ์ ๊ธ์ต/๊ฒฝ์ ๋ถ์ผ์ ์ ๋ฌธ๊ฐ์
๋๋ค.
|
| 209 |
+
|
| 210 |
+
๋ต๋ณ ๊ณํ:
|
| 211 |
+
{plan}
|
| 212 |
+
|
| 213 |
+
์ง๋ฌธ: {query}
|
| 214 |
+
|
| 215 |
+
์ฐธ๊ณ ๋ฌธ์:
|
| 216 |
+
{context}
|
| 217 |
+
|
| 218 |
+
์ ๊ณํ์ ๋ฐํ์ผ๋ก ์ง๋ฌธ์ ๋ต๋ณํ์ธ์. ๋ฐ๋์:
|
| 219 |
+
1. ์ ๊ณต๋ ๋ฌธ์์ ์ ๋ณด๋ง ์ฌ์ฉํ์ธ์
|
| 220 |
+
2. ํ์คํ์ง ์์ ์ ๋ณด๋ ์ถ์ธกํ์ง ๋ง์ธ์
|
| 221 |
+
3. ์ถ์ฒ๋ฅผ ๋ช
ํํ ๋ฐํ์ธ์
|
| 222 |
+
4. ํ๊ตญ์ด๋ก ๋ต๋ณํ์ธ์
|
| 223 |
+
"""
|
| 224 |
+
|
| 225 |
+
message = self.client.messages.create(
|
| 226 |
+
model=self.model,
|
| 227 |
+
max_tokens=2048,
|
| 228 |
+
messages=[{"role": "user", "content": prompt}]
|
| 229 |
+
)
|
| 230 |
+
|
| 231 |
+
return message.content[0].text
|
| 232 |
+
|
| 233 |
+
async def _monitor(self, query: str, response: str) -> str:
|
| 234 |
+
"""๋ต๋ณ ๋ชจ๋ํฐ๋ง"""
|
| 235 |
+
prompt = self.reflection_prompts["monitoring"].format(
|
| 236 |
+
query=query,
|
| 237 |
+
response=response
|
| 238 |
+
)
|
| 239 |
+
|
| 240 |
+
message = self.client.messages.create(
|
| 241 |
+
model=self.model,
|
| 242 |
+
max_tokens=1024,
|
| 243 |
+
messages=[{"role": "user", "content": prompt}]
|
| 244 |
+
)
|
| 245 |
+
|
| 246 |
+
return message.content[0].text
|
| 247 |
+
|
| 248 |
+
async def _evaluate(self, query: str, response: str, sources: List[str]) -> str:
|
| 249 |
+
"""๋ต๋ณ ํ๊ฐ"""
|
| 250 |
+
prompt = self.reflection_prompts["evaluation"].format(
|
| 251 |
+
query=query,
|
| 252 |
+
response=response,
|
| 253 |
+
sources=", ".join(sources)
|
| 254 |
+
)
|
| 255 |
+
|
| 256 |
+
message = self.client.messages.create(
|
| 257 |
+
model=self.model,
|
| 258 |
+
max_tokens=1024,
|
| 259 |
+
messages=[{"role": "user", "content": prompt}]
|
| 260 |
+
)
|
| 261 |
+
|
| 262 |
+
return message.content[0].text
|
| 263 |
+
|
| 264 |
+
async def _revise(self, response: str, feedback: str) -> str:
|
| 265 |
+
"""๋ต๋ณ ๊ฐ์ """
|
| 266 |
+
prompt = self.reflection_prompts["revision"].format(
|
| 267 |
+
response=response,
|
| 268 |
+
feedback=feedback
|
| 269 |
+
)
|
| 270 |
+
|
| 271 |
+
message = self.client.messages.create(
|
| 272 |
+
model=self.model,
|
| 273 |
+
max_tokens=2048,
|
| 274 |
+
messages=[{"role": "user", "content": prompt}]
|
| 275 |
+
)
|
| 276 |
+
|
| 277 |
+
return message.content[0].text
|
| 278 |
+
|
| 279 |
+
def _format_context(self, documents: List[Dict]) -> str:
|
| 280 |
+
"""๋ฌธ์๋ค์ ์ปจํ
์คํธ ํ
์คํธ๋ก ํฌ๋งทํ
"""
|
| 281 |
+
formatted = []
|
| 282 |
+
for i, doc in enumerate(documents, 1):
|
| 283 |
+
text = doc.get('text', doc.get('document', ''))
|
| 284 |
+
metadata = doc.get('metadata', {})
|
| 285 |
+
source = metadata.get('source_filename', 'Unknown')
|
| 286 |
+
|
| 287 |
+
formatted.append(f"[๋ฌธ์ {i}] {source}\n{text}\n")
|
| 288 |
+
|
| 289 |
+
return "\n".join(formatted)
|
|
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
RAG (Retrieval-Augmented Generation) ํ์ดํ๋ผ์ธ
|
| 3 |
+
|
| 4 |
+
๋ฒกํฐ ๊ฒ์ + ๋ฉํ์ธ์ง ์์ด์ ํธ๋ฅผ ๊ฒฐํฉํ RAG ์์คํ
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from typing import List, Dict, Optional
|
| 8 |
+
from loguru import logger
|
| 9 |
+
|
| 10 |
+
from services.vector_store import VectorStore
|
| 11 |
+
from services.embedder import Embedder
|
| 12 |
+
from app.metacognitive_agent import MetaCognitiveAgent
|
| 13 |
+
from utils.config import settings
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
class RAGPipeline:
|
| 17 |
+
"""RAG ํ์ดํ๋ผ์ธ ํด๋์ค"""
|
| 18 |
+
|
| 19 |
+
def __init__(
|
| 20 |
+
self,
|
| 21 |
+
vector_store: VectorStore,
|
| 22 |
+
embedder: Embedder,
|
| 23 |
+
metacognitive_agent: MetaCognitiveAgent
|
| 24 |
+
):
|
| 25 |
+
"""
|
| 26 |
+
Args:
|
| 27 |
+
vector_store: ๋ฒกํฐ ์ ์ฅ์
|
| 28 |
+
embedder: ์๋ฒ ๋ฉ ์์ฑ๊ธฐ
|
| 29 |
+
metacognitive_agent: ๋ฉํ์ธ์ง ์์ด์ ํธ
|
| 30 |
+
"""
|
| 31 |
+
self.vector_store = vector_store
|
| 32 |
+
self.embedder = embedder
|
| 33 |
+
self.agent = metacognitive_agent
|
| 34 |
+
|
| 35 |
+
async def query(
|
| 36 |
+
self,
|
| 37 |
+
question: str,
|
| 38 |
+
top_k: int = 5,
|
| 39 |
+
enable_metacognition: bool = True,
|
| 40 |
+
filter_metadata: Optional[Dict[str, str]] = None
|
| 41 |
+
) -> Dict:
|
| 42 |
+
"""
|
| 43 |
+
์ง๋ฌธ์ ๋ํ ๋ต๋ณ ์์ฑ
|
| 44 |
+
|
| 45 |
+
Args:
|
| 46 |
+
question: ์ฌ์ฉ์ ์ง๋ฌธ
|
| 47 |
+
top_k: ๊ฒ์ํ ๋ฌธ์ ๊ฐ์
|
| 48 |
+
enable_metacognition: ๋ฉํ์ธ์ง ๊ณผ์ ํ์ฑํ ์ฌ๋ถ
|
| 49 |
+
filter_metadata: ๋ฉํ๋ฐ์ดํฐ ํํฐ
|
| 50 |
+
|
| 51 |
+
Returns:
|
| 52 |
+
๋ต๋ณ ๋ฐ ๊ด๋ จ ์ ๋ณด
|
| 53 |
+
"""
|
| 54 |
+
logger.info(f"RAG Query: {question}")
|
| 55 |
+
|
| 56 |
+
# 1. ์ง๋ฌธ ์๋ฒ ๋ฉ ์์ฑ
|
| 57 |
+
logger.info("1๏ธโฃ ์ง๋ฌธ ์๋ฒ ๋ฉ ์์ฑ ์ค...")
|
| 58 |
+
query_embedding = self.embedder.embed_text(question)
|
| 59 |
+
|
| 60 |
+
# 2. ๊ด๋ จ ๋ฌธ์ ๊ฒ์
|
| 61 |
+
logger.info(f"2๏ธโฃ ๊ด๋ จ ๋ฌธ์ ๊ฒ์ ์ค (top_k={top_k})...")
|
| 62 |
+
search_results = self.vector_store.search(
|
| 63 |
+
query_embedding=query_embedding,
|
| 64 |
+
top_k=top_k,
|
| 65 |
+
filter_metadata=filter_metadata
|
| 66 |
+
)
|
| 67 |
+
|
| 68 |
+
# ๊ฒ์ ๊ฒฐ๊ณผ ํฌ๋งทํ
|
| 69 |
+
context_documents = []
|
| 70 |
+
for doc, metadata, distance in zip(
|
| 71 |
+
search_results['documents'],
|
| 72 |
+
search_results['metadatas'],
|
| 73 |
+
search_results['distances']
|
| 74 |
+
):
|
| 75 |
+
context_documents.append({
|
| 76 |
+
'text': doc,
|
| 77 |
+
'metadata': metadata,
|
| 78 |
+
'similarity': 1 - distance, # distance๋ฅผ similarity๋ก ๋ณํ
|
| 79 |
+
'source_filename': metadata.get('source_filename', 'unknown')
|
| 80 |
+
})
|
| 81 |
+
|
| 82 |
+
logger.info(f"๊ฒ์ ์๋ฃ: {len(context_documents)}๊ฐ ๋ฌธ์ ๋ฐ๊ฒฌ")
|
| 83 |
+
|
| 84 |
+
# 3. ๋ฉํ์ธ์ง ์์ด์ ํธ๋ก ๋ต๋ณ ์์ฑ
|
| 85 |
+
if enable_metacognition:
|
| 86 |
+
logger.info("3๏ธโฃ ๋ฉํ์ธ์ง ์์ด์ ํธ๋ก ๋ต๋ณ ์์ฑ ์ค...")
|
| 87 |
+
result = await self.agent.think_and_reflect(
|
| 88 |
+
query=question,
|
| 89 |
+
context_documents=context_documents
|
| 90 |
+
)
|
| 91 |
+
|
| 92 |
+
return {
|
| 93 |
+
"question": question,
|
| 94 |
+
"answer": result["final_response"],
|
| 95 |
+
"sources": context_documents,
|
| 96 |
+
"metacognition": {
|
| 97 |
+
"thinking_history": result["thinking_history"],
|
| 98 |
+
"iterations": result["iterations"]
|
| 99 |
+
},
|
| 100 |
+
"search_stats": {
|
| 101 |
+
"documents_found": len(context_documents),
|
| 102 |
+
"top_similarity": context_documents[0]['similarity'] if context_documents else 0
|
| 103 |
+
}
|
| 104 |
+
}
|
| 105 |
+
else:
|
| 106 |
+
# ๋ฉํ์ธ์ง ์์ด ๊ฐ๋จํ ๋ต๋ณ
|
| 107 |
+
logger.info("3๏ธโฃ ๊ฐ๋จํ ๋ต๋ณ ์์ฑ ์ค...")
|
| 108 |
+
simple_response = await self._generate_simple_response(question, context_documents)
|
| 109 |
+
|
| 110 |
+
return {
|
| 111 |
+
"question": question,
|
| 112 |
+
"answer": simple_response,
|
| 113 |
+
"sources": context_documents,
|
| 114 |
+
"search_stats": {
|
| 115 |
+
"documents_found": len(context_documents),
|
| 116 |
+
"top_similarity": context_documents[0]['similarity'] if context_documents else 0
|
| 117 |
+
}
|
| 118 |
+
}
|
| 119 |
+
|
| 120 |
+
async def _generate_simple_response(self, question: str, context_documents: List[Dict]) -> str:
|
| 121 |
+
"""๋ฉํ์ธ์ง ์๋ ๊ฐ๋จํ ๋ต๋ณ ์์ฑ"""
|
| 122 |
+
# ์ปจํ
์คํธ ํฌ๋งทํ
|
| 123 |
+
context_text = "\n\n".join([
|
| 124 |
+
f"[์ถ์ฒ: {doc['source_filename']}]\n{doc['text']}"
|
| 125 |
+
for doc in context_documents
|
| 126 |
+
])
|
| 127 |
+
|
| 128 |
+
prompt = f"""
|
| 129 |
+
๋น์ ์ ๊ธ์ต/๊ฒฝ์ ๋ถ์ผ์ ์ ๋ฌธ๊ฐ์
๋๋ค.
|
| 130 |
+
|
| 131 |
+
์ง๋ฌธ: {question}
|
| 132 |
+
|
| 133 |
+
์ฐธ๊ณ ๋ฌธ์:
|
| 134 |
+
{context_text}
|
| 135 |
+
|
| 136 |
+
์ ๋ฌธ์๋ค์ ์ฐธ๊ณ ํ์ฌ ์ง๋ฌธ์ ๋ต๋ณํ์ธ์. ๋ฐ๋์:
|
| 137 |
+
1. ์ ๊ณต๋ ๋ฌธ์์ ์ ๋ณด๋ง ์ฌ์ฉํ์ธ์
|
| 138 |
+
2. ํ์คํ์ง ์์ ์ ๋ณด๋ ์ถ์ธกํ์ง ๋ง์ธ์
|
| 139 |
+
3. ์ถ์ฒ๋ฅผ ๋ช
ํํ ๋ฐํ์ธ์
|
| 140 |
+
4. ํ๊ตญ์ด๋ก ๋ต๋ณํ์ธ์
|
| 141 |
+
"""
|
| 142 |
+
|
| 143 |
+
message = self.agent.client.messages.create(
|
| 144 |
+
model=self.agent.model,
|
| 145 |
+
max_tokens=2048,
|
| 146 |
+
messages=[{"role": "user", "content": prompt}]
|
| 147 |
+
)
|
| 148 |
+
|
| 149 |
+
return message.content[0].text
|
| 150 |
+
|
| 151 |
+
def get_statistics(self) -> Dict:
|
| 152 |
+
"""RAG ์์คํ
ํต๊ณ"""
|
| 153 |
+
vector_stats = self.vector_store.get_collection_stats()
|
| 154 |
+
|
| 155 |
+
return {
|
| 156 |
+
"vector_store": vector_stats,
|
| 157 |
+
"embedding_model": {
|
| 158 |
+
"type": self.embedder.model_type,
|
| 159 |
+
"name": self.embedder.model_name,
|
| 160 |
+
"dimension": self.embedder.get_embedding_dimension()
|
| 161 |
+
},
|
| 162 |
+
"agent_model": self.agent.model
|
| 163 |
+
}
|
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# FastAPI and Web Server
|
| 2 |
+
fastapi==0.109.0
|
| 3 |
+
uvicorn[standard]==0.27.0
|
| 4 |
+
pydantic==2.5.3
|
| 5 |
+
pydantic-settings==2.1.0
|
| 6 |
+
python-multipart==0.0.6
|
| 7 |
+
|
| 8 |
+
# Anthropic Claude
|
| 9 |
+
anthropic==0.18.1
|
| 10 |
+
|
| 11 |
+
# PDF Processing
|
| 12 |
+
PyPDF2==3.0.1
|
| 13 |
+
pdfplumber==0.10.3
|
| 14 |
+
pymupdf==1.23.8
|
| 15 |
+
|
| 16 |
+
# Vector Database
|
| 17 |
+
chromadb==0.4.22
|
| 18 |
+
sentence-transformers==2.3.1
|
| 19 |
+
|
| 20 |
+
# Embeddings (multiple options)
|
| 21 |
+
openai==1.10.0
|
| 22 |
+
cohere==4.47
|
| 23 |
+
|
| 24 |
+
# Text Processing
|
| 25 |
+
langchain==0.1.4
|
| 26 |
+
langchain-community==0.0.17
|
| 27 |
+
tiktoken==0.5.2
|
| 28 |
+
|
| 29 |
+
# Utilities
|
| 30 |
+
python-dotenv==1.0.0
|
| 31 |
+
tqdm==4.66.1
|
| 32 |
+
numpy==1.26.3
|
| 33 |
+
pandas==2.1.4
|
| 34 |
+
|
| 35 |
+
# Logging and Monitoring
|
| 36 |
+
loguru==0.7.2
|
|
File without changes
|
|
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Vector DB ์ํ ํ์ธ ์คํฌ๋ฆฝํธ
|
| 3 |
+
|
| 4 |
+
์ธ๋ฑ์ฑ์ด ์๋ฃ๋ ํ ๋ฒกํฐ DB์ ๋ด์ฉ์ ํ์ธํฉ๋๋ค.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import sys
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
|
| 10 |
+
# ํ๋ก์ ํธ ๋ฃจํธ๋ฅผ Python ๊ฒฝ๋ก์ ์ถ๊ฐ
|
| 11 |
+
project_root = Path(__file__).parent.parent
|
| 12 |
+
sys.path.insert(0, str(project_root))
|
| 13 |
+
|
| 14 |
+
from dotenv import load_dotenv
|
| 15 |
+
from services.vector_store import VectorStore
|
| 16 |
+
from utils.config import settings
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def main():
|
| 20 |
+
"""Vector DB ์ํ ํ์ธ"""
|
| 21 |
+
load_dotenv()
|
| 22 |
+
|
| 23 |
+
print("=" * 80)
|
| 24 |
+
print("Vector DB ์ํ ํ์ธ")
|
| 25 |
+
print("=" * 80)
|
| 26 |
+
|
| 27 |
+
# Vector Store ์ด๊ธฐํ
|
| 28 |
+
vector_store = VectorStore(
|
| 29 |
+
persist_directory=settings.chroma_persist_directory,
|
| 30 |
+
collection_name=settings.collection_name
|
| 31 |
+
)
|
| 32 |
+
|
| 33 |
+
# ํต๊ณ ์ ๋ณด
|
| 34 |
+
stats = vector_store.get_collection_stats()
|
| 35 |
+
print(f"\n๐ ๊ธฐ๋ณธ ์ ๋ณด:")
|
| 36 |
+
print(f" ์ปฌ๋ ์
๋ช
: {stats['collection_name']}")
|
| 37 |
+
print(f" ์ ์ฅ ๊ฒฝ๋ก: {stats['persist_directory']}")
|
| 38 |
+
print(f" ์ ์ฒด ๋ฌธ์: {stats['total_documents']}๊ฐ")
|
| 39 |
+
print(f" ๋ฐ์ดํฐ ์กด์ฌ: {'โ
์' if stats['has_data'] else 'โ ์๋์ค'}")
|
| 40 |
+
|
| 41 |
+
if not stats['has_data']:
|
| 42 |
+
print("\nโ ๏ธ Vector DB๊ฐ ๋น์ด์์ต๋๋ค!")
|
| 43 |
+
print(" python scripts/index_pdfs.py ๋ฅผ ๋จผ์ ์คํํ์ธ์.")
|
| 44 |
+
return
|
| 45 |
+
|
| 46 |
+
# ์ํ ๋ฐ์ดํฐ ํ์ธ
|
| 47 |
+
print(f"\n๐ ์ํ ๋ฌธ์ ํ์ธ:")
|
| 48 |
+
sample = vector_store.collection.peek(limit=3)
|
| 49 |
+
|
| 50 |
+
for i, (doc_id, doc, metadata) in enumerate(zip(
|
| 51 |
+
sample['ids'],
|
| 52 |
+
sample['documents'],
|
| 53 |
+
sample['metadatas']
|
| 54 |
+
), 1):
|
| 55 |
+
print(f"\n[{i}] {doc_id}")
|
| 56 |
+
print(f" ์ถ์ฒ: {metadata.get('source_filename', 'unknown')}")
|
| 57 |
+
print(f" ์ ๋ชฉ: {metadata.get('title', 'N/A')}")
|
| 58 |
+
print(f" ์ ์: {metadata.get('author', 'N/A')}")
|
| 59 |
+
print(f" ๋ด์ฉ: {doc[:150]}...")
|
| 60 |
+
|
| 61 |
+
# ๊ฐ๋จํ ๊ฒ์ ํ
์คํธ
|
| 62 |
+
print(f"\n๐ ๊ฒ์ ํ
์คํธ:")
|
| 63 |
+
test_query = "financial crisis"
|
| 64 |
+
print(f" ์ฟผ๋ฆฌ: '{test_query}'")
|
| 65 |
+
|
| 66 |
+
results = vector_store.search_by_text(test_query, top_k=3)
|
| 67 |
+
|
| 68 |
+
print(f" ๊ฒฐ๊ณผ: {len(results['documents'])}๊ฐ ๋ฌธ์ ๋ฐ๊ฒฌ")
|
| 69 |
+
for i, (doc, metadata, distance) in enumerate(zip(
|
| 70 |
+
results['documents'],
|
| 71 |
+
results['metadatas'],
|
| 72 |
+
results['distances']
|
| 73 |
+
), 1):
|
| 74 |
+
similarity = 1 - distance
|
| 75 |
+
print(f"\n [{i}] {metadata.get('source_filename', 'unknown')}")
|
| 76 |
+
print(f" ์ ์ฌ๋: {similarity:.3f}")
|
| 77 |
+
print(f" ๋ด์ฉ: {doc[:100]}...")
|
| 78 |
+
|
| 79 |
+
print("\n" + "=" * 80)
|
| 80 |
+
print("โ
Vector DB ํ์ธ ์๋ฃ!")
|
| 81 |
+
print("=" * 80)
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
if __name__ == "__main__":
|
| 85 |
+
main()
|
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
๋ก์ปฌ์์ ์คํํ PDF ์ธ๋ฑ์ฑ ์คํฌ๋ฆฝํธ
|
| 3 |
+
|
| 4 |
+
์ด ์คํฌ๋ฆฝํธ๋ฅผ ๋งฅ๋ถ ๋ก์ปฌ์์ ์คํํ๋ฉด:
|
| 5 |
+
1. ์ง์ ๋ ๊ฒฝ๋ก์ ๋ชจ๋ PDF ํ์ผ ์ฝ๊ธฐ
|
| 6 |
+
2. ํ
์คํธ ์ถ์ถ ๋ฐ ์ฒญํน
|
| 7 |
+
3. ์๋ฒ ๋ฉ ์์ฑ
|
| 8 |
+
4. ChromaDB์ ์ ์ฅ
|
| 9 |
+
|
| 10 |
+
์ฌ์ฉ๋ฒ:
|
| 11 |
+
python scripts/index_pdfs.py
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
import sys
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
|
| 17 |
+
# ํ๋ก์ ํธ ๋ฃจํธ๋ฅผ Python ๊ฒฝ๋ก์ ์ถ๊ฐ
|
| 18 |
+
project_root = Path(__file__).parent.parent
|
| 19 |
+
sys.path.insert(0, str(project_root))
|
| 20 |
+
|
| 21 |
+
from dotenv import load_dotenv
|
| 22 |
+
from loguru import logger
|
| 23 |
+
import time
|
| 24 |
+
|
| 25 |
+
from services.pdf_processor import PDFProcessor
|
| 26 |
+
from services.chunker import TextChunker
|
| 27 |
+
from services.embedder import Embedder
|
| 28 |
+
from services.vector_store import VectorStore
|
| 29 |
+
from utils.config import settings
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def main():
|
| 33 |
+
"""๋ฉ์ธ ์ธ๋ฑ์ฑ ํ๋ก์ธ์ค"""
|
| 34 |
+
|
| 35 |
+
# ํ๊ฒฝ ๋ณ์ ๋ก๋
|
| 36 |
+
load_dotenv()
|
| 37 |
+
|
| 38 |
+
logger.info("=" * 80)
|
| 39 |
+
logger.info("PDF ์ธ๋ฑ์ฑ ์์")
|
| 40 |
+
logger.info("=" * 80)
|
| 41 |
+
|
| 42 |
+
start_time = time.time()
|
| 43 |
+
|
| 44 |
+
# 1. PDF ์ฒ๋ฆฌ
|
| 45 |
+
logger.info(f"\n[1/4] PDF ํ์ผ ์ฒ๋ฆฌ ์ค...")
|
| 46 |
+
logger.info(f"PDF ๊ฒฝ๋ก: {settings.pdf_source_path}")
|
| 47 |
+
|
| 48 |
+
pdf_processor = PDFProcessor(settings.pdf_source_path)
|
| 49 |
+
documents = pdf_processor.process_all_pdfs()
|
| 50 |
+
|
| 51 |
+
if not documents:
|
| 52 |
+
logger.error("์ฒ๋ฆฌํ PDF ํ์ผ์ด ์์ต๋๋ค. ๊ฒฝ๋ก๋ฅผ ํ์ธํ์ธ์.")
|
| 53 |
+
return
|
| 54 |
+
|
| 55 |
+
# ํต๊ณ ์ถ๋ ฅ
|
| 56 |
+
stats = pdf_processor.get_statistics()
|
| 57 |
+
logger.info(f"\n์ฒ๋ฆฌ ์๋ฃ:")
|
| 58 |
+
logger.info(f" - ์ ์ฒด ๋ฌธ์: {stats['total_documents']}๊ฐ")
|
| 59 |
+
logger.info(f" - ์ ์ฒด ํ์ด์ง: {stats['total_pages']}ํ์ด์ง")
|
| 60 |
+
logger.info(f" - ํ๊ท ํ์ด์ง/๋ฌธ์: {stats['avg_pages_per_doc']:.1f}ํ์ด์ง")
|
| 61 |
+
logger.info(f" - ์ ์ฒด ๋ฌธ์ ์: {stats['total_characters']:,}์")
|
| 62 |
+
|
| 63 |
+
# 2. ํ
์คํธ ์ฒญํน
|
| 64 |
+
logger.info(f"\n[2/4] ํ
์คํธ ์ฒญํน ์ค...")
|
| 65 |
+
logger.info(f"์ฒญํฌ ํฌ๊ธฐ: {settings.chunk_size}, ์ค๋ฒ๋ฉ: {settings.chunk_overlap}")
|
| 66 |
+
|
| 67 |
+
chunker = TextChunker(
|
| 68 |
+
chunk_size=settings.chunk_size,
|
| 69 |
+
chunk_overlap=settings.chunk_overlap
|
| 70 |
+
)
|
| 71 |
+
chunks = chunker.chunk_all_documents(documents)
|
| 72 |
+
|
| 73 |
+
chunk_stats = chunker.get_chunk_statistics(chunks)
|
| 74 |
+
logger.info(f"\n์ฒญํน ์๋ฃ:")
|
| 75 |
+
logger.info(f" - ์ ์ฒด ์ฒญํฌ: {chunk_stats['total_chunks']}๊ฐ")
|
| 76 |
+
logger.info(f" - ํ๊ท ์ฒญํฌ ๊ธธ์ด: {chunk_stats['avg_chunk_length']:.0f}์")
|
| 77 |
+
logger.info(f" - ๋ฌธ์๋น ํ๊ท ์ฒญํฌ: {chunk_stats['total_chunks'] / len(documents):.1f}๊ฐ")
|
| 78 |
+
|
| 79 |
+
# 3. ์๋ฒ ๋ฉ ์์ฑ
|
| 80 |
+
logger.info(f"\n[3/4] ์๋ฒ ๋ฉ ์์ฑ ์ค...")
|
| 81 |
+
logger.info(f"์๋ฒ ๋ฉ ๋ชจ๋ธ: {settings.embedding_model} ({settings.embedding_model_name})")
|
| 82 |
+
|
| 83 |
+
embedder = Embedder(
|
| 84 |
+
model_type=settings.embedding_model,
|
| 85 |
+
model_name=settings.embedding_model_name,
|
| 86 |
+
openai_api_key=settings.openai_api_key,
|
| 87 |
+
cohere_api_key=settings.cohere_api_key
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
texts = [chunk['text'] for chunk in chunks]
|
| 91 |
+
embeddings = embedder.embed_batch(texts, batch_size=32)
|
| 92 |
+
|
| 93 |
+
logger.info(f"\n์๋ฒ ๋ฉ ์๋ฃ:")
|
| 94 |
+
logger.info(f" - ์๋ฒ ๋ฉ ๊ฐ์: {len(embeddings)}๊ฐ")
|
| 95 |
+
logger.info(f" - ์๋ฒ ๋ฉ ์ฐจ์: {len(embeddings[0])}์ฐจ์")
|
| 96 |
+
|
| 97 |
+
# 4. Vector DB์ ์ ์ฅ
|
| 98 |
+
logger.info(f"\n[4/4] Vector DB์ ์ ์ฅ ์ค...")
|
| 99 |
+
logger.info(f"์ ์ฅ ๊ฒฝ๋ก: {settings.chroma_persist_directory}")
|
| 100 |
+
|
| 101 |
+
vector_store = VectorStore(
|
| 102 |
+
persist_directory=settings.chroma_persist_directory,
|
| 103 |
+
collection_name=settings.collection_name
|
| 104 |
+
)
|
| 105 |
+
|
| 106 |
+
# ๊ธฐ์กด ๋ฐ์ดํฐ๊ฐ ์์ผ๋ฉด ์ฌ์ฉ์์๊ฒ ํ์ธ
|
| 107 |
+
current_count = vector_store.collection.count()
|
| 108 |
+
if current_count > 0:
|
| 109 |
+
logger.warning(f"\n๊ธฐ์กด ๋ฐ์ดํฐ {current_count}๊ฐ๊ฐ ์กด์ฌํฉ๋๋ค.")
|
| 110 |
+
response = input("๊ธฐ์กด ๋ฐ์ดํฐ๋ฅผ ์ญ์ ํ๊ณ ์๋ก ์ธ๋ฑ์ฑํ์๊ฒ ์ต๋๊น? (y/N): ")
|
| 111 |
+
if response.lower() == 'y':
|
| 112 |
+
vector_store.reset_collection()
|
| 113 |
+
else:
|
| 114 |
+
logger.info("๊ธฐ์กด ๋ฐ์ดํฐ์ ์ถ๊ฐํฉ๋๋ค.")
|
| 115 |
+
|
| 116 |
+
vector_store.add_documents(chunks, embeddings)
|
| 117 |
+
|
| 118 |
+
# ์ต์ข
ํต๊ณ
|
| 119 |
+
final_stats = vector_store.get_collection_stats()
|
| 120 |
+
logger.info(f"\n์ ์ฅ ์๋ฃ:")
|
| 121 |
+
logger.info(f" - ์ปฌ๋ ์
: {final_stats['collection_name']}")
|
| 122 |
+
logger.info(f" - ์ ์ฒด ๋ฌธ์: {final_stats['total_documents']}๊ฐ")
|
| 123 |
+
logger.info(f" - ์ ์ฅ ๊ฒฝ๋ก: {final_stats['persist_directory']}")
|
| 124 |
+
|
| 125 |
+
# ์ด ์์ ์๊ฐ
|
| 126 |
+
elapsed_time = time.time() - start_time
|
| 127 |
+
logger.info(f"\n์ด ์์ ์๊ฐ: {elapsed_time:.1f}์ด ({elapsed_time/60:.1f}๋ถ)")
|
| 128 |
+
|
| 129 |
+
logger.info("\n" + "=" * 80)
|
| 130 |
+
logger.info("์ธ๋ฑ์ฑ ์๋ฃ! ๐")
|
| 131 |
+
logger.info("์ด์ FastAPI ์๋ฒ๋ฅผ ์คํํ์ฌ RAG ์์คํ
์ ์ฌ์ฉํ ์ ์์ต๋๋ค.")
|
| 132 |
+
logger.info("=" * 80)
|
| 133 |
+
|
| 134 |
+
# ๊ฐ๋จํ ๊ฒ์ ํ
์คํธ
|
| 135 |
+
logger.info("\n๊ฒ์ ํ
์คํธ๋ฅผ ์ํํ์๊ฒ ์ต๋๊น?")
|
| 136 |
+
test_query = input("๊ฒ์์ด๋ฅผ ์
๋ ฅํ์ธ์ (Enter๋ฅผ ๋๋ฅด๋ฉด ๊ฑด๋๋): ").strip()
|
| 137 |
+
|
| 138 |
+
if test_query:
|
| 139 |
+
logger.info(f"\n'{test_query}' ๊ฒ์ ์ค...")
|
| 140 |
+
results = vector_store.search_by_text(test_query, top_k=3)
|
| 141 |
+
|
| 142 |
+
logger.info(f"\n์์ {len(results['documents'])}๊ฐ ๊ฒฐ๊ณผ:")
|
| 143 |
+
for i, (doc, metadata, distance) in enumerate(zip(
|
| 144 |
+
results['documents'],
|
| 145 |
+
results['metadatas'],
|
| 146 |
+
results['distances']
|
| 147 |
+
), 1):
|
| 148 |
+
logger.info(f"\n[{i}] {metadata['source_filename']} (similarity: {1-distance:.3f})")
|
| 149 |
+
logger.info(f"๋ด์ฉ: {doc[:200]}...")
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
if __name__ == "__main__":
|
| 153 |
+
main()
|
|
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
RAG ์์คํ
ํ
์คํธ ์คํฌ๋ฆฝํธ
|
| 3 |
+
|
| 4 |
+
API ์๋ฒ๊ฐ ์คํ ์ค์ผ ๋ ์ฌ์ฉ
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import requests
|
| 8 |
+
import json
|
| 9 |
+
from typing import Dict
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
def test_query(
|
| 13 |
+
question: str,
|
| 14 |
+
top_k: int = 5,
|
| 15 |
+
enable_metacognition: bool = True,
|
| 16 |
+
api_url: str = "http://localhost:8000"
|
| 17 |
+
) -> Dict:
|
| 18 |
+
"""
|
| 19 |
+
์ง๋ฌธ ํ
์คํธ
|
| 20 |
+
|
| 21 |
+
Args:
|
| 22 |
+
question: ์ง๋ฌธ
|
| 23 |
+
top_k: ๊ฒ์ํ ๋ฌธ์ ๊ฐ์
|
| 24 |
+
enable_metacognition: ๋ฉํ์ธ์ง ํ์ฑํ
|
| 25 |
+
api_url: API URL
|
| 26 |
+
|
| 27 |
+
Returns:
|
| 28 |
+
์๋ต ๋ฐ์ดํฐ
|
| 29 |
+
"""
|
| 30 |
+
print("=" * 80)
|
| 31 |
+
print(f"์ง๋ฌธ: {question}")
|
| 32 |
+
print("=" * 80)
|
| 33 |
+
|
| 34 |
+
# ์์ฒญ
|
| 35 |
+
response = requests.post(
|
| 36 |
+
f"{api_url}/query",
|
| 37 |
+
json={
|
| 38 |
+
"question": question,
|
| 39 |
+
"top_k": top_k,
|
| 40 |
+
"enable_metacognition": enable_metacognition
|
| 41 |
+
}
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
if response.status_code != 200:
|
| 45 |
+
print(f"โ ์ค๋ฅ: {response.status_code}")
|
| 46 |
+
print(response.text)
|
| 47 |
+
return {}
|
| 48 |
+
|
| 49 |
+
result = response.json()
|
| 50 |
+
|
| 51 |
+
# ๊ฒฐ๊ณผ ์ถ๋ ฅ
|
| 52 |
+
print("\n๐ ๋ต๋ณ:")
|
| 53 |
+
print("-" * 80)
|
| 54 |
+
print(result["answer"])
|
| 55 |
+
print("-" * 80)
|
| 56 |
+
|
| 57 |
+
print(f"\n๐ ์ฐธ๊ณ ๋ฌธํ: {len(result['sources'])}๊ฐ")
|
| 58 |
+
for i, source in enumerate(result['sources'][:3], 1):
|
| 59 |
+
print(f"\n[{i}] {source['source_filename']}")
|
| 60 |
+
print(f" ์ ์ฌ๋: {source['similarity']:.3f}")
|
| 61 |
+
print(f" ๋ด์ฉ: {source['text'][:100]}...")
|
| 62 |
+
|
| 63 |
+
if result.get('metacognition'):
|
| 64 |
+
print(f"\n๐ง ๋ฉํ์ธ์ง ์ ๋ณด:")
|
| 65 |
+
print(f" ๋ฐ๋ณต ํ์: {result['metacognition']['iterations']}")
|
| 66 |
+
print(f" ์ฌ๊ณ ๊ณผ์ ๋จ๊ณ: {len(result['metacognition']['thinking_history'])}")
|
| 67 |
+
|
| 68 |
+
print("\n" + "=" * 80)
|
| 69 |
+
return result
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def test_health(api_url: str = "http://localhost:8000"):
|
| 73 |
+
"""ํฌ์ค ์ฒดํฌ"""
|
| 74 |
+
print("๐ฅ ํฌ์ค ์ฒดํฌ ์ค...")
|
| 75 |
+
response = requests.get(f"{api_url}/health")
|
| 76 |
+
|
| 77 |
+
if response.status_code == 200:
|
| 78 |
+
data = response.json()
|
| 79 |
+
print("โ
์๋ฒ ์ ์")
|
| 80 |
+
print(f" Vector Store: {data['vector_store']['total_documents']}๊ฐ ๋ฌธ์")
|
| 81 |
+
print(f" Embedding: {data['embedding_model']['type']} ({data['embedding_model']['dimension']}์ฐจ์)")
|
| 82 |
+
else:
|
| 83 |
+
print(f"โ ์๋ฒ ์ค๋ฅ: {response.status_code}")
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
if __name__ == "__main__":
|
| 87 |
+
# ํฌ์ค ์ฒดํฌ
|
| 88 |
+
test_health()
|
| 89 |
+
|
| 90 |
+
print("\n")
|
| 91 |
+
|
| 92 |
+
# ์ํ ์ง๋ฌธ๋ค
|
| 93 |
+
questions = [
|
| 94 |
+
"๊ธ์ต์๊ธฐ์ ์ฃผ์ ์์ธ์ ๋ฌด์์ธ๊ฐ์?",
|
| 95 |
+
"ํฌํธํด๋ฆฌ์ค ๋ค๊ฐํ์ ํจ๊ณผ๋?",
|
| 96 |
+
"์ค์์ํ์ ๊ธ๋ฆฌ ์ ์ฑ
์ด ์์ฅ์ ๋ฏธ์น๋ ์ํฅ์?",
|
| 97 |
+
]
|
| 98 |
+
|
| 99 |
+
for question in questions:
|
| 100 |
+
try:
|
| 101 |
+
test_query(question, top_k=5, enable_metacognition=True)
|
| 102 |
+
print("\n\n")
|
| 103 |
+
except Exception as e:
|
| 104 |
+
print(f"โ ์ค๋ฅ: {str(e)}\n\n")
|
| 105 |
+
|
| 106 |
+
# ์ฌ์ฉ์ ์
๋ ฅ ์ง๋ฌธ
|
| 107 |
+
print("\n์ปค์คํ
์ง๋ฌธ์ ์
๋ ฅํ์ธ์ (Enter๋ฅผ ๋๋ฅด๋ฉด ์ข
๋ฃ):")
|
| 108 |
+
while True:
|
| 109 |
+
question = input("\n์ง๋ฌธ: ").strip()
|
| 110 |
+
if not question:
|
| 111 |
+
break
|
| 112 |
+
|
| 113 |
+
try:
|
| 114 |
+
test_query(question, top_k=5, enable_metacognition=True)
|
| 115 |
+
except Exception as e:
|
| 116 |
+
print(f"โ ์ค๋ฅ: {str(e)}")
|
|
File without changes
|
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ํ
์คํธ๋ฅผ ์ ์ ํ ํฌ๊ธฐ๋ก ๋ถํ (Chunking)
|
| 3 |
+
"""
|
| 4 |
+
from typing import List, Dict
|
| 5 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 6 |
+
from loguru import logger
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
class TextChunker:
|
| 10 |
+
"""ํ
์คํธ๋ฅผ ์๋ฏธ ์๋ ์ฒญํฌ๋ก ๋ถํ ํ๋ ํด๋์ค"""
|
| 11 |
+
|
| 12 |
+
def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
|
| 13 |
+
"""
|
| 14 |
+
Args:
|
| 15 |
+
chunk_size: ๊ฐ ์ฒญํฌ์ ์ต๋ ๋ฌธ์ ์
|
| 16 |
+
chunk_overlap: ์ฒญํฌ ๊ฐ ๊ฒน์น๋ ๋ฌธ์ ์ (์ปจํ
์คํธ ๋ณด์กด)
|
| 17 |
+
"""
|
| 18 |
+
self.chunk_size = chunk_size
|
| 19 |
+
self.chunk_overlap = chunk_overlap
|
| 20 |
+
|
| 21 |
+
# LangChain์ RecursiveCharacterTextSplitter ์ฌ์ฉ
|
| 22 |
+
# ๋ฌธ๋จ, ๋ฌธ์ฅ, ๋จ์ด ์์๋ก ๋ถํ ์๋
|
| 23 |
+
self.text_splitter = RecursiveCharacterTextSplitter(
|
| 24 |
+
chunk_size=chunk_size,
|
| 25 |
+
chunk_overlap=chunk_overlap,
|
| 26 |
+
length_function=len,
|
| 27 |
+
separators=["\n\n", "\n", ". ", " ", ""]
|
| 28 |
+
)
|
| 29 |
+
|
| 30 |
+
def chunk_document(self, doc_data: Dict[str, any]) -> List[Dict[str, any]]:
|
| 31 |
+
"""
|
| 32 |
+
๋จ์ผ ๋ฌธ์๋ฅผ ์ฌ๋ฌ ์ฒญํฌ๋ก ๋ถํ
|
| 33 |
+
|
| 34 |
+
Args:
|
| 35 |
+
doc_data: PDF processor์์ ์ถ์ถํ ๋ฌธ์ ๋ฐ์ดํฐ
|
| 36 |
+
|
| 37 |
+
Returns:
|
| 38 |
+
List of chunks with metadata
|
| 39 |
+
"""
|
| 40 |
+
text = doc_data['text']
|
| 41 |
+
chunks = self.text_splitter.split_text(text)
|
| 42 |
+
|
| 43 |
+
chunked_docs = []
|
| 44 |
+
for i, chunk in enumerate(chunks):
|
| 45 |
+
chunked_docs.append({
|
| 46 |
+
'text': chunk,
|
| 47 |
+
'chunk_id': i,
|
| 48 |
+
'source_filename': doc_data['filename'],
|
| 49 |
+
'source_filepath': doc_data['filepath'],
|
| 50 |
+
'total_chunks': len(chunks),
|
| 51 |
+
'metadata': doc_data['metadata'],
|
| 52 |
+
'page_count': doc_data['page_count']
|
| 53 |
+
})
|
| 54 |
+
|
| 55 |
+
return chunked_docs
|
| 56 |
+
|
| 57 |
+
def chunk_all_documents(self, documents: List[Dict[str, any]]) -> List[Dict[str, any]]:
|
| 58 |
+
"""
|
| 59 |
+
์ฌ๋ฌ ๋ฌธ์๋ฅผ ๋ชจ๋ ์ฒญํฌ๋ก ๋ถํ
|
| 60 |
+
|
| 61 |
+
Args:
|
| 62 |
+
documents: PDF processor์์ ์ถ์ถํ ๋ฌธ์ ๋ฆฌ์คํธ
|
| 63 |
+
|
| 64 |
+
Returns:
|
| 65 |
+
List of all chunks from all documents
|
| 66 |
+
"""
|
| 67 |
+
all_chunks = []
|
| 68 |
+
|
| 69 |
+
logger.info(f"Chunking {len(documents)} documents...")
|
| 70 |
+
|
| 71 |
+
for doc in documents:
|
| 72 |
+
chunks = self.chunk_document(doc)
|
| 73 |
+
all_chunks.extend(chunks)
|
| 74 |
+
|
| 75 |
+
logger.info(f"Created {len(all_chunks)} chunks from {len(documents)} documents")
|
| 76 |
+
logger.info(f"Average {len(all_chunks) / len(documents):.1f} chunks per document")
|
| 77 |
+
|
| 78 |
+
return all_chunks
|
| 79 |
+
|
| 80 |
+
def get_chunk_statistics(self, chunks: List[Dict[str, any]]) -> Dict[str, any]:
|
| 81 |
+
"""์ฒญํฌ ํต๊ณ ์ ๋ณด"""
|
| 82 |
+
if not chunks:
|
| 83 |
+
return {}
|
| 84 |
+
|
| 85 |
+
chunk_lengths = [len(chunk['text']) for chunk in chunks]
|
| 86 |
+
|
| 87 |
+
return {
|
| 88 |
+
'total_chunks': len(chunks),
|
| 89 |
+
'avg_chunk_length': sum(chunk_lengths) / len(chunks),
|
| 90 |
+
'min_chunk_length': min(chunk_lengths),
|
| 91 |
+
'max_chunk_length': max(chunk_lengths),
|
| 92 |
+
'total_characters': sum(chunk_lengths),
|
| 93 |
+
}
|
|
@@ -0,0 +1,147 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ํ
์คํธ๋ฅผ ๋ฒกํฐ ์๋ฒ ๋ฉ์ผ๋ก ๋ณํ
|
| 3 |
+
"""
|
| 4 |
+
from typing import List, Optional
|
| 5 |
+
import numpy as np
|
| 6 |
+
from loguru import logger
|
| 7 |
+
from sentence_transformers import SentenceTransformer
|
| 8 |
+
from tqdm import tqdm
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
class Embedder:
|
| 12 |
+
"""ํ
์คํธ๋ฅผ ๋ฒกํฐ๋ก ๋ณํํ๋ ํด๋์ค"""
|
| 13 |
+
|
| 14 |
+
def __init__(
|
| 15 |
+
self,
|
| 16 |
+
model_type: str = "sentence-transformers",
|
| 17 |
+
model_name: str = "all-MiniLM-L6-v2",
|
| 18 |
+
openai_api_key: Optional[str] = None,
|
| 19 |
+
cohere_api_key: Optional[str] = None
|
| 20 |
+
):
|
| 21 |
+
"""
|
| 22 |
+
Args:
|
| 23 |
+
model_type: ์ฌ์ฉํ ์๋ฒ ๋ฉ ๋ชจ๋ธ ํ์
(sentence-transformers, openai, cohere)
|
| 24 |
+
model_name: ๋ชจ๋ธ ์ด๋ฆ
|
| 25 |
+
openai_api_key: OpenAI API ํค (openai ์ฌ์ฉ ์)
|
| 26 |
+
cohere_api_key: Cohere API ํค (cohere ์ฌ์ฉ ์)
|
| 27 |
+
"""
|
| 28 |
+
self.model_type = model_type
|
| 29 |
+
self.model_name = model_name
|
| 30 |
+
|
| 31 |
+
if model_type == "sentence-transformers":
|
| 32 |
+
logger.info(f"Loading Sentence Transformer model: {model_name}")
|
| 33 |
+
self.model = SentenceTransformer(model_name)
|
| 34 |
+
logger.info(f"Model loaded. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
|
| 35 |
+
|
| 36 |
+
elif model_type == "openai":
|
| 37 |
+
if not openai_api_key:
|
| 38 |
+
raise ValueError("OpenAI API key required for openai embeddings")
|
| 39 |
+
import openai
|
| 40 |
+
openai.api_key = openai_api_key
|
| 41 |
+
self.model = None
|
| 42 |
+
logger.info(f"Using OpenAI embeddings: {model_name}")
|
| 43 |
+
|
| 44 |
+
elif model_type == "cohere":
|
| 45 |
+
if not cohere_api_key:
|
| 46 |
+
raise ValueError("Cohere API key required for cohere embeddings")
|
| 47 |
+
import cohere
|
| 48 |
+
self.model = cohere.Client(cohere_api_key)
|
| 49 |
+
logger.info(f"Using Cohere embeddings: {model_name}")
|
| 50 |
+
|
| 51 |
+
else:
|
| 52 |
+
raise ValueError(f"Unknown model type: {model_type}")
|
| 53 |
+
|
| 54 |
+
def embed_text(self, text: str) -> List[float]:
|
| 55 |
+
"""
|
| 56 |
+
๋จ์ผ ํ
์คํธ๋ฅผ ์๋ฒ ๋ฉ์ผ๋ก ๋ณํ
|
| 57 |
+
|
| 58 |
+
Args:
|
| 59 |
+
text: ์๋ฒ ๋ฉํ ํ
์คํธ
|
| 60 |
+
|
| 61 |
+
Returns:
|
| 62 |
+
Embedding vector as list of floats
|
| 63 |
+
"""
|
| 64 |
+
if self.model_type == "sentence-transformers":
|
| 65 |
+
embedding = self.model.encode(text, convert_to_numpy=True)
|
| 66 |
+
return embedding.tolist()
|
| 67 |
+
|
| 68 |
+
elif self.model_type == "openai":
|
| 69 |
+
import openai
|
| 70 |
+
response = openai.embeddings.create(
|
| 71 |
+
input=text,
|
| 72 |
+
model=self.model_name
|
| 73 |
+
)
|
| 74 |
+
return response.data[0].embedding
|
| 75 |
+
|
| 76 |
+
elif self.model_type == "cohere":
|
| 77 |
+
response = self.model.embed(
|
| 78 |
+
texts=[text],
|
| 79 |
+
model=self.model_name
|
| 80 |
+
)
|
| 81 |
+
return response.embeddings[0]
|
| 82 |
+
|
| 83 |
+
def embed_batch(self, texts: List[str], batch_size: int = 32) -> List[List[float]]:
|
| 84 |
+
"""
|
| 85 |
+
์ฌ๋ฌ ํ
์คํธ๋ฅผ ๋ฐฐ์น๋ก ์๋ฒ ๋ฉ
|
| 86 |
+
|
| 87 |
+
Args:
|
| 88 |
+
texts: ์๋ฒ ๋ฉํ ํ
์คํธ ๋ฆฌ์คํธ
|
| 89 |
+
batch_size: ๋ฐฐ์น ํฌ๊ธฐ
|
| 90 |
+
|
| 91 |
+
Returns:
|
| 92 |
+
List of embedding vectors
|
| 93 |
+
"""
|
| 94 |
+
logger.info(f"Embedding {len(texts)} texts with batch size {batch_size}")
|
| 95 |
+
|
| 96 |
+
if self.model_type == "sentence-transformers":
|
| 97 |
+
# Sentence Transformers๋ ์์ฒด ๋ฐฐ์น ์ฒ๋ฆฌ ์ง์
|
| 98 |
+
embeddings = []
|
| 99 |
+
for i in tqdm(range(0, len(texts), batch_size), desc="Embedding batches"):
|
| 100 |
+
batch = texts[i:i + batch_size]
|
| 101 |
+
batch_embeddings = self.model.encode(
|
| 102 |
+
batch,
|
| 103 |
+
convert_to_numpy=True,
|
| 104 |
+
show_progress_bar=False
|
| 105 |
+
)
|
| 106 |
+
embeddings.extend(batch_embeddings.tolist())
|
| 107 |
+
return embeddings
|
| 108 |
+
|
| 109 |
+
elif self.model_type == "openai":
|
| 110 |
+
# OpenAI API ํธ์ถ ์ ํ ๊ณ ๋ ค
|
| 111 |
+
import openai
|
| 112 |
+
embeddings = []
|
| 113 |
+
for i in tqdm(range(0, len(texts), batch_size), desc="Embedding batches"):
|
| 114 |
+
batch = texts[i:i + batch_size]
|
| 115 |
+
response = openai.embeddings.create(
|
| 116 |
+
input=batch,
|
| 117 |
+
model=self.model_name
|
| 118 |
+
)
|
| 119 |
+
batch_embeddings = [item.embedding for item in response.data]
|
| 120 |
+
embeddings.extend(batch_embeddings)
|
| 121 |
+
return embeddings
|
| 122 |
+
|
| 123 |
+
elif self.model_type == "cohere":
|
| 124 |
+
# Cohere ๋ฐฐ์น ์ฒ๋ฆฌ
|
| 125 |
+
embeddings = []
|
| 126 |
+
for i in tqdm(range(0, len(texts), batch_size), desc="Embedding batches"):
|
| 127 |
+
batch = texts[i:i + batch_size]
|
| 128 |
+
response = self.model.embed(
|
| 129 |
+
texts=batch,
|
| 130 |
+
model=self.model_name
|
| 131 |
+
)
|
| 132 |
+
embeddings.extend(response.embeddings)
|
| 133 |
+
return embeddings
|
| 134 |
+
|
| 135 |
+
def get_embedding_dimension(self) -> int:
|
| 136 |
+
"""์๋ฒ ๋ฉ ์ฐจ์ ๋ฐํ"""
|
| 137 |
+
if self.model_type == "sentence-transformers":
|
| 138 |
+
return self.model.get_sentence_embedding_dimension()
|
| 139 |
+
elif self.model_type == "openai":
|
| 140 |
+
if "ada-002" in self.model_name:
|
| 141 |
+
return 1536
|
| 142 |
+
return 1536 # default
|
| 143 |
+
elif self.model_type == "cohere":
|
| 144 |
+
if "embed-multilingual" in self.model_name:
|
| 145 |
+
return 768
|
| 146 |
+
return 1024 # default
|
| 147 |
+
return 768
|
|
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
PDF ํ์ผ ์ฒ๋ฆฌ ๋ฐ ํ
์คํธ ์ถ์ถ
|
| 3 |
+
"""
|
| 4 |
+
import os
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
from typing import List, Dict, Optional
|
| 7 |
+
import PyPDF2
|
| 8 |
+
import pdfplumber
|
| 9 |
+
from loguru import logger
|
| 10 |
+
from tqdm import tqdm
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class PDFProcessor:
|
| 14 |
+
"""PDF ํ์ผ์์ ํ
์คํธ์ ๋ฉํ๋ฐ์ดํฐ๋ฅผ ์ถ์ถํ๋ ํด๋์ค"""
|
| 15 |
+
|
| 16 |
+
def __init__(self, pdf_directory: str):
|
| 17 |
+
"""
|
| 18 |
+
Args:
|
| 19 |
+
pdf_directory: PDF ํ์ผ๋ค์ด ์๋ ๋๋ ํ ๋ฆฌ ๊ฒฝ๋ก
|
| 20 |
+
"""
|
| 21 |
+
self.pdf_directory = Path(pdf_directory)
|
| 22 |
+
self.processed_docs = []
|
| 23 |
+
|
| 24 |
+
def get_pdf_files(self) -> List[Path]:
|
| 25 |
+
"""๋๋ ํ ๋ฆฌ์์ ๋ชจ๋ PDF ํ์ผ ์ฐพ๊ธฐ"""
|
| 26 |
+
if not self.pdf_directory.exists():
|
| 27 |
+
raise FileNotFoundError(f"Directory not found: {self.pdf_directory}")
|
| 28 |
+
|
| 29 |
+
pdf_files = list(self.pdf_directory.glob("**/*.pdf"))
|
| 30 |
+
logger.info(f"Found {len(pdf_files)} PDF files in {self.pdf_directory}")
|
| 31 |
+
return pdf_files
|
| 32 |
+
|
| 33 |
+
def extract_text_from_pdf(self, pdf_path: Path) -> Optional[Dict[str, any]]:
|
| 34 |
+
"""
|
| 35 |
+
๋จ์ผ PDF ํ์ผ์์ ํ
์คํธ ์ถ์ถ
|
| 36 |
+
|
| 37 |
+
Args:
|
| 38 |
+
pdf_path: PDF ํ์ผ ๊ฒฝ๋ก
|
| 39 |
+
|
| 40 |
+
Returns:
|
| 41 |
+
Dict with 'text', 'metadata', 'filename', 'page_count'
|
| 42 |
+
"""
|
| 43 |
+
try:
|
| 44 |
+
# pdfplumber๋ฅผ ์ฌ์ฉํ ํ
์คํธ ์ถ์ถ (๋ ์ ํํจ)
|
| 45 |
+
with pdfplumber.open(pdf_path) as pdf:
|
| 46 |
+
text = ""
|
| 47 |
+
for page in pdf.pages:
|
| 48 |
+
page_text = page.extract_text()
|
| 49 |
+
if page_text:
|
| 50 |
+
text += page_text + "\n\n"
|
| 51 |
+
|
| 52 |
+
# PyPDF2๋ก ๋ฉํ๋ฐ์ดํฐ ์ถ์ถ
|
| 53 |
+
with open(pdf_path, 'rb') as f:
|
| 54 |
+
pdf_reader = PyPDF2.PdfReader(f)
|
| 55 |
+
metadata = pdf_reader.metadata if pdf_reader.metadata else {}
|
| 56 |
+
page_count = len(pdf_reader.pages)
|
| 57 |
+
|
| 58 |
+
return {
|
| 59 |
+
'text': text.strip(),
|
| 60 |
+
'metadata': {
|
| 61 |
+
'title': metadata.get('/Title', ''),
|
| 62 |
+
'author': metadata.get('/Author', ''),
|
| 63 |
+
'subject': metadata.get('/Subject', ''),
|
| 64 |
+
'creator': metadata.get('/Creator', ''),
|
| 65 |
+
},
|
| 66 |
+
'filename': pdf_path.name,
|
| 67 |
+
'filepath': str(pdf_path),
|
| 68 |
+
'page_count': page_count
|
| 69 |
+
}
|
| 70 |
+
|
| 71 |
+
except Exception as e:
|
| 72 |
+
logger.error(f"Error processing {pdf_path.name}: {str(e)}")
|
| 73 |
+
return None
|
| 74 |
+
|
| 75 |
+
def process_all_pdfs(self) -> List[Dict[str, any]]:
|
| 76 |
+
"""
|
| 77 |
+
๋ชจ๋ PDF ํ์ผ ์ฒ๋ฆฌ
|
| 78 |
+
|
| 79 |
+
Returns:
|
| 80 |
+
List of dictionaries containing extracted text and metadata
|
| 81 |
+
"""
|
| 82 |
+
pdf_files = self.get_pdf_files()
|
| 83 |
+
self.processed_docs = []
|
| 84 |
+
|
| 85 |
+
logger.info(f"Processing {len(pdf_files)} PDF files...")
|
| 86 |
+
|
| 87 |
+
for pdf_path in tqdm(pdf_files, desc="Processing PDFs"):
|
| 88 |
+
doc_data = self.extract_text_from_pdf(pdf_path)
|
| 89 |
+
if doc_data and doc_data['text']: # ํ
์คํธ๊ฐ ์๋ ๊ฒฝ์ฐ๋ง ์ถ๊ฐ
|
| 90 |
+
self.processed_docs.append(doc_data)
|
| 91 |
+
else:
|
| 92 |
+
logger.warning(f"No text extracted from {pdf_path.name}")
|
| 93 |
+
|
| 94 |
+
logger.info(f"Successfully processed {len(self.processed_docs)} PDFs")
|
| 95 |
+
return self.processed_docs
|
| 96 |
+
|
| 97 |
+
def get_statistics(self) -> Dict[str, any]:
|
| 98 |
+
"""์ฒ๋ฆฌ๋ ๋ฌธ์๋ค์ ํต๊ณ ์ ๋ณด"""
|
| 99 |
+
if not self.processed_docs:
|
| 100 |
+
return {}
|
| 101 |
+
|
| 102 |
+
total_pages = sum(doc['page_count'] for doc in self.processed_docs)
|
| 103 |
+
total_chars = sum(len(doc['text']) for doc in self.processed_docs)
|
| 104 |
+
|
| 105 |
+
return {
|
| 106 |
+
'total_documents': len(self.processed_docs),
|
| 107 |
+
'total_pages': total_pages,
|
| 108 |
+
'total_characters': total_chars,
|
| 109 |
+
'avg_pages_per_doc': total_pages / len(self.processed_docs),
|
| 110 |
+
'avg_chars_per_doc': total_chars / len(self.processed_docs),
|
| 111 |
+
}
|
|
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Vector Database ํตํฉ (ChromaDB ์ฌ์ฉ)
|
| 3 |
+
"""
|
| 4 |
+
from typing import List, Dict, Optional, Any
|
| 5 |
+
import chromadb
|
| 6 |
+
from chromadb.config import Settings
|
| 7 |
+
from loguru import logger
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
class VectorStore:
|
| 12 |
+
"""ChromaDB๋ฅผ ์ฌ์ฉํ ๋ฒกํฐ ์ ์ฅ์ ํด๋์ค"""
|
| 13 |
+
|
| 14 |
+
def __init__(
|
| 15 |
+
self,
|
| 16 |
+
persist_directory: str = "./data/chroma_db",
|
| 17 |
+
collection_name: str = "financial_papers"
|
| 18 |
+
):
|
| 19 |
+
"""
|
| 20 |
+
Args:
|
| 21 |
+
persist_directory: ChromaDB ๋ฐ์ดํฐ ์ ์ฅ ๊ฒฝ๋ก
|
| 22 |
+
collection_name: ์ปฌ๋ ์
์ด๋ฆ
|
| 23 |
+
"""
|
| 24 |
+
self.persist_directory = Path(persist_directory)
|
| 25 |
+
self.collection_name = collection_name
|
| 26 |
+
|
| 27 |
+
# ๋๋ ํ ๋ฆฌ ์์ฑ
|
| 28 |
+
self.persist_directory.mkdir(parents=True, exist_ok=True)
|
| 29 |
+
|
| 30 |
+
# ChromaDB ํด๋ผ์ด์ธํธ ์ด๊ธฐํ
|
| 31 |
+
logger.info(f"Initializing ChromaDB at {persist_directory}")
|
| 32 |
+
self.client = chromadb.PersistentClient(
|
| 33 |
+
path=str(self.persist_directory)
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
# ์ปฌ๋ ์
์์ฑ ๋๋ ๊ฐ์ ธ์ค๊ธฐ
|
| 37 |
+
self.collection = self.client.get_or_create_collection(
|
| 38 |
+
name=collection_name,
|
| 39 |
+
metadata={"description": "Financial and Economics research papers"}
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
logger.info(f"Collection '{collection_name}' ready. Current count: {self.collection.count()}")
|
| 43 |
+
|
| 44 |
+
def add_documents(
|
| 45 |
+
self,
|
| 46 |
+
chunks: List[Dict[str, Any]],
|
| 47 |
+
embeddings: List[List[float]]
|
| 48 |
+
) -> None:
|
| 49 |
+
"""
|
| 50 |
+
๋ฌธ์ ์ฒญํฌ๋ค์ ๋ฒกํฐ DB์ ์ถ๊ฐ
|
| 51 |
+
|
| 52 |
+
Args:
|
| 53 |
+
chunks: ์ฒญํฌ ๋ฐ์ดํฐ ๋ฆฌ์คํธ (text, metadata ํฌํจ)
|
| 54 |
+
embeddings: ๊ฐ ์ฒญํฌ์ ์๋ฒ ๋ฉ ๋ฒกํฐ
|
| 55 |
+
"""
|
| 56 |
+
if len(chunks) != len(embeddings):
|
| 57 |
+
raise ValueError("Number of chunks and embeddings must match")
|
| 58 |
+
|
| 59 |
+
logger.info(f"Adding {len(chunks)} documents to vector store...")
|
| 60 |
+
|
| 61 |
+
# ChromaDB์ ํ์ํ ํ์์ผ๋ก ๋ณํ
|
| 62 |
+
ids = [f"{chunk['source_filename']}_{chunk['chunk_id']}" for chunk in chunks]
|
| 63 |
+
documents = [chunk['text'] for chunk in chunks]
|
| 64 |
+
metadatas = [
|
| 65 |
+
{
|
| 66 |
+
'source_filename': chunk['source_filename'],
|
| 67 |
+
'source_filepath': chunk['source_filepath'],
|
| 68 |
+
'chunk_id': str(chunk['chunk_id']),
|
| 69 |
+
'total_chunks': str(chunk['total_chunks']),
|
| 70 |
+
'title': chunk['metadata'].get('title', ''),
|
| 71 |
+
'author': chunk['metadata'].get('author', ''),
|
| 72 |
+
'page_count': str(chunk['page_count'])
|
| 73 |
+
}
|
| 74 |
+
for chunk in chunks
|
| 75 |
+
]
|
| 76 |
+
|
| 77 |
+
# ๋ฐฐ์น๋ก ์ถ๊ฐ (ChromaDB๋ ํ๋ฒ์ ๋ง์ ์ ์ฒ๋ฆฌ ๊ฐ๋ฅ)
|
| 78 |
+
batch_size = 100
|
| 79 |
+
for i in range(0, len(chunks), batch_size):
|
| 80 |
+
batch_end = min(i + batch_size, len(chunks))
|
| 81 |
+
self.collection.add(
|
| 82 |
+
ids=ids[i:batch_end],
|
| 83 |
+
embeddings=embeddings[i:batch_end],
|
| 84 |
+
documents=documents[i:batch_end],
|
| 85 |
+
metadatas=metadatas[i:batch_end]
|
| 86 |
+
)
|
| 87 |
+
logger.info(f"Added batch {i // batch_size + 1}/{(len(chunks) + batch_size - 1) // batch_size}")
|
| 88 |
+
|
| 89 |
+
logger.info(f"Successfully added {len(chunks)} documents. Total in collection: {self.collection.count()}")
|
| 90 |
+
|
| 91 |
+
def search(
|
| 92 |
+
self,
|
| 93 |
+
query_embedding: List[float],
|
| 94 |
+
top_k: int = 5,
|
| 95 |
+
filter_metadata: Optional[Dict[str, str]] = None
|
| 96 |
+
) -> Dict[str, Any]:
|
| 97 |
+
"""
|
| 98 |
+
๋ฒกํฐ ๊ฒ์ ์ํ
|
| 99 |
+
|
| 100 |
+
Args:
|
| 101 |
+
query_embedding: ์ฟผ๋ฆฌ์ ์๋ฒ ๋ฉ ๋ฒกํฐ
|
| 102 |
+
top_k: ๋ฐํํ ๊ฒฐ๊ณผ ๊ฐ์
|
| 103 |
+
filter_metadata: ๋ฉํ๋ฐ์ดํฐ ํํฐ (optional)
|
| 104 |
+
|
| 105 |
+
Returns:
|
| 106 |
+
๊ฒ์ ๊ฒฐ๊ณผ (documents, metadatas, distances)
|
| 107 |
+
"""
|
| 108 |
+
results = self.collection.query(
|
| 109 |
+
query_embeddings=[query_embedding],
|
| 110 |
+
n_results=top_k,
|
| 111 |
+
where=filter_metadata
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
return {
|
| 115 |
+
'documents': results['documents'][0] if results['documents'] else [],
|
| 116 |
+
'metadatas': results['metadatas'][0] if results['metadatas'] else [],
|
| 117 |
+
'distances': results['distances'][0] if results['distances'] else [],
|
| 118 |
+
'ids': results['ids'][0] if results['ids'] else []
|
| 119 |
+
}
|
| 120 |
+
|
| 121 |
+
def search_by_text(
|
| 122 |
+
self,
|
| 123 |
+
query_text: str,
|
| 124 |
+
top_k: int = 5,
|
| 125 |
+
filter_metadata: Optional[Dict[str, str]] = None
|
| 126 |
+
) -> Dict[str, Any]:
|
| 127 |
+
"""
|
| 128 |
+
ํ
์คํธ๋ก ๊ฒ์ (ChromaDB๊ฐ ์๋์ผ๋ก ์๋ฒ ๋ฉ)
|
| 129 |
+
|
| 130 |
+
Args:
|
| 131 |
+
query_text: ๊ฒ์ ์ฟผ๋ฆฌ ํ
์คํธ
|
| 132 |
+
top_k: ๋ฐํํ ๊ฒฐ๊ณผ ๊ฐ์
|
| 133 |
+
filter_metadata: ๋ฉํ๋ฐ์ดํฐ ํํฐ
|
| 134 |
+
|
| 135 |
+
Returns:
|
| 136 |
+
๊ฒ์ ๊ฒฐ๊ณผ
|
| 137 |
+
"""
|
| 138 |
+
results = self.collection.query(
|
| 139 |
+
query_texts=[query_text],
|
| 140 |
+
n_results=top_k,
|
| 141 |
+
where=filter_metadata
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
return {
|
| 145 |
+
'documents': results['documents'][0] if results['documents'] else [],
|
| 146 |
+
'metadatas': results['metadatas'][0] if results['metadatas'] else [],
|
| 147 |
+
'distances': results['distances'][0] if results['distances'] else [],
|
| 148 |
+
'ids': results['ids'][0] if results['ids'] else []
|
| 149 |
+
}
|
| 150 |
+
|
| 151 |
+
def get_collection_stats(self) -> Dict[str, Any]:
|
| 152 |
+
"""์ปฌ๋ ์
ํต๊ณ ์ ๋ณด"""
|
| 153 |
+
count = self.collection.count()
|
| 154 |
+
|
| 155 |
+
# ์ํ ๋ฐ์ดํฐ ๊ฐ์ ธ์ค๊ธฐ
|
| 156 |
+
sample = self.collection.peek(limit=1)
|
| 157 |
+
|
| 158 |
+
return {
|
| 159 |
+
'collection_name': self.collection_name,
|
| 160 |
+
'total_documents': count,
|
| 161 |
+
'persist_directory': str(self.persist_directory),
|
| 162 |
+
'has_data': count > 0
|
| 163 |
+
}
|
| 164 |
+
|
| 165 |
+
def delete_collection(self) -> None:
|
| 166 |
+
"""์ปฌ๋ ์
์ญ์ (์ฃผ์: ๋ชจ๋ ๋ฐ์ดํฐ ์ญ์ ๋จ)"""
|
| 167 |
+
logger.warning(f"Deleting collection '{self.collection_name}'")
|
| 168 |
+
self.client.delete_collection(name=self.collection_name)
|
| 169 |
+
logger.info("Collection deleted")
|
| 170 |
+
|
| 171 |
+
def reset_collection(self) -> None:
|
| 172 |
+
"""์ปฌ๋ ์
์ด๊ธฐํ (์ญ์ ํ ์ฌ์์ฑ)"""
|
| 173 |
+
self.delete_collection()
|
| 174 |
+
self.collection = self.client.get_or_create_collection(
|
| 175 |
+
name=self.collection_name,
|
| 176 |
+
metadata={"description": "Financial and Economics research papers"}
|
| 177 |
+
)
|
| 178 |
+
logger.info("Collection reset")
|
|
File without changes
|
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Configuration management using pydantic-settings
|
| 3 |
+
"""
|
| 4 |
+
from pydantic_settings import BaseSettings
|
| 5 |
+
from typing import Optional
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
class Settings(BaseSettings):
|
| 9 |
+
"""Application settings loaded from environment variables"""
|
| 10 |
+
|
| 11 |
+
# API Keys
|
| 12 |
+
anthropic_api_key: str
|
| 13 |
+
openai_api_key: Optional[str] = None
|
| 14 |
+
cohere_api_key: Optional[str] = None
|
| 15 |
+
|
| 16 |
+
# Vector Database Settings
|
| 17 |
+
chroma_persist_directory: str = "./data/chroma_db"
|
| 18 |
+
collection_name: str = "financial_papers"
|
| 19 |
+
|
| 20 |
+
# PDF Processing Settings
|
| 21 |
+
pdf_source_path: str
|
| 22 |
+
chunk_size: int = 1000
|
| 23 |
+
chunk_overlap: int = 200
|
| 24 |
+
|
| 25 |
+
# Embedding Model Settings
|
| 26 |
+
embedding_model: str = "sentence-transformers" # options: openai, sentence-transformers, cohere
|
| 27 |
+
embedding_model_name: str = "all-MiniLM-L6-v2"
|
| 28 |
+
|
| 29 |
+
# API Settings
|
| 30 |
+
api_host: str = "0.0.0.0"
|
| 31 |
+
api_port: int = 8000
|
| 32 |
+
|
| 33 |
+
class Config:
|
| 34 |
+
env_file = ".env"
|
| 35 |
+
case_sensitive = False
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
# Global settings instance
|
| 39 |
+
settings = Settings()
|