Claude commited on
Commit
f6b05db
ยท
unverified ยท
0 Parent(s):

Add complete Financial RAG system with Metacognitive Agent

Browse files

Implemented a comprehensive RAG (Retrieval-Augmented Generation) system for financial/economics research papers with the following features:

Core Components:
- PDF processing and text extraction (PyPDF2, pdfplumber)
- Text chunking with overlap for context preservation
- Vector embeddings (Sentence Transformers, OpenAI, Cohere support)
- ChromaDB vector store for efficient similarity search
- Metacognitive agent with 4-stage process (Planning โ†’ Monitoring โ†’ Evaluation โ†’ Revision)
- FastAPI REST API with comprehensive endpoints

Key Features:
- Supports 2,639+ financial/economics journal articles
- Hallucination detection and prevention
- Iterative answer refinement based on quality evaluation
- Flexible embedding model selection (free and paid options)
- Batch processing for efficient indexing
- Detailed logging and statistics

API Endpoints:
- POST /query: Question answering with RAG
- GET /health: System health check
- GET /stats: Vector store statistics
- Interactive API docs at /docs

Scripts:
- index_pdfs.py: Index PDF files into vector database
- check_vector_db.py: Verify vector database contents
- test_query.py: Test queries against the API

Documentation:
- Comprehensive README with quick start guide
- Detailed USAGE_GUIDE in Korean
- Environment configuration examples
- Troubleshooting section

The system enables researchers to query a large corpus of financial literature with high-quality, source-backed answers while minimizing hallucinations through metacognitive reflection.

.env.example ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Anthropic API Key
2
+ ANTHROPIC_API_KEY=your_anthropic_api_key_here
3
+
4
+ # OpenAI API Key (for embeddings - optional)
5
+ OPENAI_API_KEY=your_openai_api_key_here
6
+
7
+ # Cohere API Key (alternative for embeddings - optional)
8
+ COHERE_API_KEY=your_cohere_api_key_here
9
+
10
+ # Vector Database Settings
11
+ CHROMA_PERSIST_DIRECTORY=./data/chroma_db
12
+ COLLECTION_NAME=financial_papers
13
+
14
+ # PDF Processing Settings
15
+ PDF_SOURCE_PATH=/Users/seongjincho/Desktop/HYU-06-๊ณตํ•™๋ฐ•์‚ฌ ๋„์ „๊ธฐ/25.8.15(ํŽ€๋”๋ฉ˜ํ„ธ DB ๋ ˆํผ๋Ÿฐ์Šค)/data/
16
+ CHUNK_SIZE=1000
17
+ CHUNK_OVERLAP=200
18
+
19
+ # Embedding Model
20
+ # Options: "openai", "sentence-transformers", "cohere"
21
+ EMBEDDING_MODEL=sentence-transformers
22
+ EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
23
+
24
+ # API Settings
25
+ API_HOST=0.0.0.0
26
+ API_PORT=8000
.gitignore ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual Environment
24
+ venv/
25
+ env/
26
+ ENV/
27
+
28
+ # Environment Variables
29
+ .env
30
+
31
+ # IDE
32
+ .vscode/
33
+ .idea/
34
+ *.swp
35
+ *.swo
36
+ *~
37
+
38
+ # Data
39
+ data/chroma_db/
40
+ data/*.pdf
41
+ *.pdf
42
+
43
+ # Logs
44
+ logs/
45
+ *.log
46
+
47
+ # MacOS
48
+ .DS_Store
49
+
50
+ # Jupyter
51
+ .ipynb_checkpoints/
README.md ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ๐Ÿ“š Financial RAG with Metacognitive Agent
2
+
3
+ ๊ธˆ์œต/๊ฒฝ์ œ ๋…ผ๋ฌธ ๊ธฐ๋ฐ˜ RAG (Retrieval-Augmented Generation) ์‹œ์Šคํ…œ with ๋ฉ”ํƒ€์ธ์ง€ ์—์ด์ „ํŠธ
4
+
5
+ ## ๐ŸŽฏ ์ฃผ์š” ๊ธฐ๋Šฅ
6
+
7
+ - โœ… **2,639๊ฐœ ๊ธˆ์œต/๊ฒฝ์ œ ์ €๋„ ๋…ผ๋ฌธ** ๋ฒกํ„ฐ ์ธ๋ฑ์‹ฑ
8
+ - ๐Ÿง  **๋ฉ”ํƒ€์ธ์ง€ ์—์ด์ „ํŠธ** (Planning โ†’ Monitoring โ†’ Evaluation โ†’ Revision)
9
+ - ๐Ÿ” **๊ณ ์„ฑ๋Šฅ ๋ฒกํ„ฐ ๊ฒ€์ƒ‰** (ChromaDB)
10
+ - ๐Ÿšซ **Hallucination ๊ฐ์ง€ ๋ฐ ๋ฐฉ์ง€**
11
+ - ๐Ÿš€ **FastAPI ๊ธฐ๋ฐ˜ REST API**
12
+ - ๐Ÿ“Š **์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์„ ํƒ ๊ฐ€๋Šฅ** (Sentence Transformers, OpenAI, Cohere)
13
+
14
+ ## ๐Ÿ“ ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ
15
+
16
+ ```
17
+ Hallucination_and_Deception_for_financial_RAG/
18
+ โ”œโ”€โ”€ app/
19
+ โ”‚ โ”œโ”€โ”€ main.py # FastAPI ๋ฉ”์ธ ์•ฑ
20
+ โ”‚ โ”œโ”€โ”€ metacognitive_agent.py # ๋ฉ”ํƒ€์ธ์ง€ ์—์ด์ „ํŠธ
21
+ โ”‚ โ”œโ”€โ”€ rag_pipeline.py # RAG ํŒŒ์ดํ”„๋ผ์ธ
22
+ โ”‚ โ””โ”€โ”€ api/
23
+ โ”‚ โ”œโ”€โ”€ routes.py # API ์—”๋“œํฌ์ธํŠธ
24
+ โ”‚ โ””โ”€โ”€ models.py # Pydantic ๋ชจ๋ธ
25
+ โ”œโ”€โ”€ services/
26
+ โ”‚ โ”œโ”€โ”€ pdf_processor.py # PDF ์ฒ˜๋ฆฌ
27
+ โ”‚ โ”œโ”€โ”€ chunker.py # ํ…์ŠคํŠธ ์ฒญํ‚น
28
+ โ”‚ โ”œโ”€โ”€ embedder.py # ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
29
+ โ”‚ โ””โ”€โ”€ vector_store.py # ๋ฒกํ„ฐ DB
30
+ โ”œโ”€โ”€ utils/
31
+ โ”‚ โ””โ”€โ”€ config.py # ์„ค์ • ๊ด€๋ฆฌ
32
+ โ”œโ”€โ”€ scripts/
33
+ โ”‚ โ””โ”€โ”€ index_pdfs.py # PDF ์ธ๋ฑ์‹ฑ ์Šคํฌ๋ฆฝํŠธ
34
+ โ”œโ”€โ”€ data/
35
+ โ”‚ โ””โ”€โ”€ chroma_db/ # ๋ฒกํ„ฐ DB ์ €์žฅ์†Œ
36
+ โ”œโ”€โ”€ requirements.txt
37
+ โ”œโ”€โ”€ .env.example
38
+ โ””โ”€โ”€ README.md
39
+ ```
40
+
41
+ ## ๐Ÿš€ ๋น ๋ฅธ ์‹œ์ž‘
42
+
43
+ ### 1๏ธโƒฃ ํ™˜๊ฒฝ ์„ค์ •
44
+
45
+ ```bash
46
+ # ๋ฆฌํฌ์ง€ํ† ๋ฆฌ ํด๋ก 
47
+ git clone https://github.com/yourusername/Hallucination_and_Deception_for_financial_RAG.git
48
+ cd Hallucination_and_Deception_for_financial_RAG
49
+
50
+ # ๊ฐ€์ƒํ™˜๊ฒฝ ์ƒ์„ฑ ๋ฐ ํ™œ์„ฑํ™”
51
+ python -m venv venv
52
+ source venv/bin/activate # Windows: venv\Scripts\activate
53
+
54
+ # ์˜์กด์„ฑ ์„ค์น˜
55
+ pip install -r requirements.txt
56
+ ```
57
+
58
+ ### 2๏ธโƒฃ ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •
59
+
60
+ ```bash
61
+ # .env ํŒŒ์ผ ์ƒ์„ฑ
62
+ cp .env.example .env
63
+
64
+ # .env ํŒŒ์ผ ํŽธ์ง‘ (ํ•„์ˆ˜!)
65
+ nano .env
66
+ ```
67
+
68
+ `.env` ํŒŒ์ผ ์˜ˆ์‹œ:
69
+ ```env
70
+ # Anthropic API Key (ํ•„์ˆ˜)
71
+ ANTHROPIC_API_KEY=your_api_key_here
72
+
73
+ # PDF ๊ฒฝ๋กœ (๋กœ์ปฌ ๋งฅ๋ถ ๊ฒฝ๋กœ)
74
+ PDF_SOURCE_PATH=/Users/seongjincho/Desktop/HYU-06-๊ณตํ•™๋ฐ•์‚ฌ ๋„์ „๊ธฐ/25.8.15(ํŽ€๋”๋ฉ˜ํ„ธ DB ๋ ˆํผ๋Ÿฐ์Šค)/data/
75
+
76
+ # ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ (๋ฌด๋ฃŒ: sentence-transformers)
77
+ EMBEDDING_MODEL=sentence-transformers
78
+ EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
79
+ ```
80
+
81
+ ### 3๏ธโƒฃ PDF ์ธ๋ฑ์‹ฑ (๋กœ์ปฌ ๋งฅ๋ถ์—์„œ ์‹คํ–‰)
82
+
83
+ ```bash
84
+ # PDF ํŒŒ์ผ๋“ค์„ ๋ฒกํ„ฐ DB๋กœ ์ธ๋ฑ์‹ฑ
85
+ python scripts/index_pdfs.py
86
+ ```
87
+
88
+ ์ด ๊ณผ์ •์€ ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:
89
+ 1. 2,639๊ฐœ PDF ํŒŒ์ผ ์ฝ๊ธฐ
90
+ 2. ํ…์ŠคํŠธ ์ถ”์ถœ ๋ฐ ์ฒญํ‚น
91
+ 3. ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ (๋ฌด๋ฃŒ ๋ชจ๋ธ ์‚ฌ์šฉ ์‹œ ์•ฝ 30-60๋ถ„ ์†Œ์š”)
92
+ 4. ChromaDB์— ์ €์žฅ
93
+
94
+ **์ฐธ๊ณ :**
95
+ - ์ฒ˜์Œ ์‹คํ–‰ ์‹œ Sentence Transformer ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ (~90MB)
96
+ - ์ธ๋ฑ์‹ฑ ์™„๋ฃŒ ํ›„ `data/chroma_db/` ํด๋”๊ฐ€ ์ƒ์„ฑ๋จ
97
+ - ์ด ํด๋”๋งŒ GitHub์— ์—…๋กœ๋“œํ•˜๋ฉด ๋จ (PDF ์›๋ณธ์€ ์ œ์™ธ)
98
+
99
+ ### 4๏ธโƒฃ API ์„œ๋ฒ„ ์‹คํ–‰
100
+
101
+ ```bash
102
+ # FastAPI ์„œ๋ฒ„ ์‹œ์ž‘
103
+ uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
104
+ ```
105
+
106
+ ์„œ๋ฒ„๊ฐ€ ์‹œ์ž‘๋˜๋ฉด:
107
+ - API Docs: http://localhost:8000/docs
108
+ - ReDoc: http://localhost:8000/redoc
109
+
110
+ ## ๐Ÿ“– API ์‚ฌ์šฉ๋ฒ•
111
+
112
+ ### ํ—ฌ์Šค ์ฒดํฌ
113
+
114
+ ```bash
115
+ curl http://localhost:8000/health
116
+ ```
117
+
118
+ ### ์งˆ๋ฌธํ•˜๊ธฐ (๋ฉ”ํƒ€์ธ์ง€ ํ™œ์„ฑํ™”)
119
+
120
+ ```bash
121
+ curl -X POST http://localhost:8000/query \
122
+ -H "Content-Type: application/json" \
123
+ -d '{
124
+ "question": "๊ธˆ์œต์œ„๊ธฐ์˜ ์ฃผ์š” ์›์ธ์€ ๋ฌด์—‡์ธ๊ฐ€์š”?",
125
+ "top_k": 5,
126
+ "enable_metacognition": true
127
+ }'
128
+ ```
129
+
130
+ ### Python์—์„œ ์‚ฌ์šฉ
131
+
132
+ ```python
133
+ import requests
134
+
135
+ response = requests.post(
136
+ "http://localhost:8000/query",
137
+ json={
138
+ "question": "ํฌํŠธํด๋ฆฌ์˜ค ๋‹ค๊ฐํ™”์˜ ํšจ๊ณผ๋Š”?",
139
+ "top_k": 5,
140
+ "enable_metacognition": True
141
+ }
142
+ )
143
+
144
+ result = response.json()
145
+ print(f"๋‹ต๋ณ€: {result['answer']}")
146
+ print(f"์ถœ์ฒ˜: {len(result['sources'])}๊ฐœ ๋ฌธ์„œ")
147
+ print(f"๋ฐ˜๋ณต ํšŸ์ˆ˜: {result['metacognition']['iterations']}")
148
+ ```
149
+
150
+ ## ๐Ÿง  ๋ฉ”ํƒ€์ธ์ง€ ์—์ด์ „ํŠธ๋ž€?
151
+
152
+ ๋ฉ”ํƒ€์ธ์ง€ ์—์ด์ „ํŠธ๋Š” ๋‹ค์Œ 4๋‹จ๊ณ„๋ฅผ ํ†ตํ•ด ๊ณ ํ’ˆ์งˆ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:
153
+
154
+ ### 1๏ธโƒฃ Planning (๊ณ„ํš)
155
+ - ์งˆ๋ฌธ ๋ถ„์„
156
+ - ๋‹ต๋ณ€ ์ „๋žต ์ˆ˜๋ฆฝ
157
+ - ํ•„์š”ํ•œ ์ •๋ณด ํŒŒ์•…
158
+
159
+ ### 2๏ธโƒฃ Monitoring (๊ฐ์‹œ)
160
+ - ๋‹ต๋ณ€ ์ƒ์„ฑ ๊ณผ์ • ๋ชจ๋‹ˆํ„ฐ๋ง
161
+ - Hallucination ๊ฐ์ง€
162
+ - ๋…ผ๋ฆฌ์  ํƒ€๋‹น์„ฑ ๊ฒ€์ฆ
163
+
164
+ ### 3๏ธโƒฃ Evaluation (ํ‰๊ฐ€)
165
+ - ์™„์ „์„ฑ, ์ •ํ™•์„ฑ, ๋ช…ํ™•์„ฑ, ์‹ ๋ขฐ์„ฑ ํ‰๊ฐ€
166
+ - 1-10 ์ ์ˆ˜ ๋ถ€์—ฌ
167
+ - ๊ฐœ์„  ํ•„์š” ๋ถ€๋ถ„ ์‹๋ณ„
168
+
169
+ ### 4๏ธโƒฃ Revision (์ˆ˜์ •)
170
+ - ํ”ผ๋“œ๋ฐฑ ๊ธฐ๋ฐ˜ ๋‹ต๋ณ€ ๊ฐœ์„ 
171
+ - ์ตœ๋Œ€ 2ํšŒ ๋ฐ˜๋ณต
172
+ - ์ ์ˆ˜ 8์  ์ด์ƒ ์‹œ ์ข…๋ฃŒ
173
+
174
+ ## ๐Ÿ”ง ๊ณ ๊ธ‰ ์„ค์ •
175
+
176
+ ### ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ๋ณ€๊ฒฝ
177
+
178
+ #### OpenAI ์‚ฌ์šฉ (์œ ๋ฃŒ, ๊ณ ํ’ˆ์งˆ)
179
+ ```env
180
+ EMBEDDING_MODEL=openai
181
+ EMBEDDING_MODEL_NAME=text-embedding-ada-002
182
+ OPENAI_API_KEY=your_openai_key
183
+ ```
184
+
185
+ #### Cohere ์‚ฌ์šฉ (๋ฌด๋ฃŒ ํ‹ฐ์–ด ์žˆ์Œ)
186
+ ```env
187
+ EMBEDDING_MODEL=cohere
188
+ EMBEDDING_MODEL_NAME=embed-multilingual-v3.0
189
+ COHERE_API_KEY=your_cohere_key
190
+ ```
191
+
192
+ ### ์ฒญํ‚น ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •
193
+
194
+ ```env
195
+ CHUNK_SIZE=1000 # ์ฒญํฌ ํฌ๊ธฐ (๊ธฐ๋ณธ: 1000์ž)
196
+ CHUNK_OVERLAP=200 # ์ฒญํฌ ๊ฒน์นจ (๊ธฐ๋ณธ: 200์ž)
197
+ ```
198
+
199
+ ## ๐Ÿ“Š ์„ฑ๋Šฅ ์ตœ์ ํ™”
200
+
201
+ ### ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰
202
+ - **์ธ๋ฑ์‹ฑ ์‹œ**: ~2-4GB (2,639๊ฐœ PDF ๊ธฐ์ค€)
203
+ - **API ์‹คํ–‰ ์‹œ**: ~1-2GB
204
+
205
+ ### ์‘๋‹ต ์‹œ๊ฐ„
206
+ - **๋ฉ”ํƒ€์ธ์ง€ ๋น„ํ™œ์„ฑํ™”**: ~2-5์ดˆ
207
+ - **๋ฉ”ํƒ€์ธ์ง€ ํ™œ์„ฑํ™”**: ~10-30์ดˆ (ํ’ˆ์งˆ โฌ†๏ธ)
208
+
209
+ ### ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ
210
+ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ ์‹œ ๋ฐฐ์น˜ ํฌ๊ธฐ ์กฐ์ •:
211
+ ```python
212
+ # scripts/index_pdfs.py ์ˆ˜์ •
213
+ embeddings = embedder.embed_batch(texts, batch_size=64) # ๊ธฐ๋ณธ: 32
214
+ ```
215
+
216
+ ## ๐Ÿ› ๋ฌธ์ œ ํ•ด๊ฒฐ
217
+
218
+ ### PDF ๊ฒฝ๋กœ ์˜ค๋ฅ˜
219
+ ```
220
+ FileNotFoundError: Directory not found
221
+ ```
222
+ โ†’ `.env` ํŒŒ์ผ์˜ `PDF_SOURCE_PATH` ํ™•์ธ
223
+
224
+ ### API ํ‚ค ์˜ค๋ฅ˜
225
+ ```
226
+ AuthenticationError: Invalid API key
227
+ ```
228
+ โ†’ `.env` ํŒŒ์ผ์˜ `ANTHROPIC_API_KEY` ํ™•์ธ
229
+
230
+ ### Vector DB๊ฐ€ ๋น„์–ด์žˆ์Œ
231
+ ```
232
+ total_documents: 0
233
+ ```
234
+ โ†’ `python scripts/index_pdfs.py` ๋จผ์ € ์‹คํ–‰
235
+
236
+ ### ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ
237
+ โ†’ ๋ฐฐ์น˜ ํฌ๊ธฐ ์ค„์ด๊ธฐ: `batch_size=16`
238
+
239
+ ## ๐Ÿ“ˆ ๋‹ค์Œ ๋‹จ๊ณ„
240
+
241
+ 1. **๋ฒกํ„ฐ DB ์—…๋กœ๋“œ**
242
+ ```bash
243
+ git add data/chroma_db/
244
+ git commit -m "Add vector database"
245
+ git push
246
+ ```
247
+
248
+ 2. **ํด๋ผ์šฐ๋“œ ๋ฐฐํฌ** (์„ ํƒ์‚ฌํ•ญ)
249
+ - AWS EC2 / GCP / Azure
250
+ - Docker ์ปจํ…Œ์ด๋„ˆํ™”
251
+ - API ํ‚ค ๊ด€๋ฆฌ (AWS Secrets Manager ๋“ฑ)
252
+
253
+ 3. **ํ”„๋ก ํŠธ์—”๋“œ ๊ตฌ์ถ•** (์„ ํƒ์‚ฌํ•ญ)
254
+ - Streamlit / Gradio
255
+ - React / Vue.js
256
+
257
+ ## ๐Ÿค ๊ธฐ์—ฌ
258
+
259
+ ์ด์Šˆ ๋ฐ PR ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค!
260
+
261
+ ## ๐Ÿ“„ ๋ผ์ด์„ ์Šค
262
+
263
+ MIT License
264
+
265
+ ## ๐Ÿ‘จโ€๐ŸŽ“ ์ž‘์„ฑ์ž
266
+
267
+ ์กฐ์„ฑ์ง„ (Seongjin Cho)
268
+ - ํ•œ์–‘๋Œ€ํ•™๊ต ๊ณตํ•™๋ฐ•์‚ฌ ๊ณผ์ •
269
+ - ๊ธˆ์œต/๊ฒฝ์ œ ์—ฐ๊ตฌ
270
+
271
+ ---
272
+
273
+ **โš ๏ธ ์ค‘์š” ์•Œ๋ฆผ:**
274
+ - API ํ‚ค๋ฅผ ์ ˆ๋Œ€ GitHub์— ์ปค๋ฐ‹ํ•˜์ง€ ๋งˆ์„ธ์š”!
275
+ - `.env` ํŒŒ์ผ์€ `.gitignore`์— ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค
276
+ - PDF ์›๋ณธ์€ ์šฉ๋Ÿ‰์ด ํฌ๋ฏ€๋กœ ๋ฒกํ„ฐ DB๋งŒ ์—…๋กœ๋“œํ•˜์„ธ์š”
USAGE_GUIDE.md ADDED
@@ -0,0 +1,366 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ๐Ÿ“– ์‚ฌ์šฉ ๊ฐ€์ด๋“œ (ํ•œ๊ตญ์–ด)
2
+
3
+ ## ๋ชฉ์ฐจ
4
+ 1. [์„ค์น˜](#1-์„ค์น˜)
5
+ 2. [PDF ์ธ๋ฑ์‹ฑ](#2-pdf-์ธ๋ฑ์‹ฑ)
6
+ 3. [API ์„œ๋ฒ„ ์‹คํ–‰](#3-api-์„œ๋ฒ„-์‹คํ–‰)
7
+ 4. [์งˆ๋ฌธํ•˜๊ธฐ](#4-์งˆ๋ฌธํ•˜๊ธฐ)
8
+ 5. [๊ณ ๊ธ‰ ์‚ฌ์šฉ๋ฒ•](#5-๊ณ ๊ธ‰-์‚ฌ์šฉ๋ฒ•)
9
+ 6. [๋ฌธ์ œ ํ•ด๊ฒฐ](#6-๋ฌธ์ œ-ํ•ด๊ฒฐ)
10
+
11
+ ---
12
+
13
+ ## 1. ์„ค์น˜
14
+
15
+ ### 1-1. Python ํ™˜๊ฒฝ ํ™•์ธ
16
+ ```bash
17
+ python --version # 3.8 ์ด์ƒ ํ•„์š”
18
+ ```
19
+
20
+ ### 1-2. ๊ฐ€์ƒํ™˜๊ฒฝ ์ƒ์„ฑ
21
+ ```bash
22
+ # ๊ฐ€์ƒํ™˜๊ฒฝ ์ƒ์„ฑ
23
+ python -m venv venv
24
+
25
+ # ํ™œ์„ฑํ™” (๋งฅ/๋ฆฌ๋ˆ…์Šค)
26
+ source venv/bin/activate
27
+
28
+ # ํ™œ์„ฑํ™” (์œˆ๋„์šฐ)
29
+ venv\Scripts\activate
30
+ ```
31
+
32
+ ### 1-3. ์˜์กด์„ฑ ์„ค์น˜
33
+ ```bash
34
+ pip install -r requirements.txt
35
+ ```
36
+
37
+ **์„ค์น˜๋˜๋Š” ์ฃผ์š” ํŒจํ‚ค์ง€:**
38
+ - FastAPI (์›น ์„œ๋ฒ„)
39
+ - Anthropic (Claude API)
40
+ - ChromaDB (๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค)
41
+ - Sentence Transformers (์ž„๋ฒ ๋”ฉ)
42
+ - PyPDF2, pdfplumber (PDF ์ฒ˜๋ฆฌ)
43
+
44
+ ---
45
+
46
+ ## 2. PDF ์ธ๋ฑ์‹ฑ
47
+
48
+ ### 2-1. ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •
49
+
50
+ `.env` ํŒŒ์ผ ์ƒ์„ฑ:
51
+ ```bash
52
+ cp .env.example .env
53
+ ```
54
+
55
+ `.env` ํŒŒ์ผ ํŽธ์ง‘:
56
+ ```env
57
+ # ํ•„์ˆ˜: Anthropic API ํ‚ค
58
+ ANTHROPIC_API_KEY=sk-ant-api03-xxx...
59
+
60
+ # ํ•„์ˆ˜: PDF ํŒŒ์ผ ๊ฒฝ๋กœ (๋กœ์ปฌ ๋งฅ๋ถ)
61
+ PDF_SOURCE_PATH=/Users/seongjincho/Desktop/HYU-06-๊ณตํ•™๋ฐ•์‚ฌ ๋„์ „๊ธฐ/25.8.15(ํŽ€๋”๋ฉ˜ํ„ธ DB ๋ ˆํผ๋Ÿฐ์Šค)/data/
62
+
63
+ # ์„ ํƒ: ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ (๊ธฐ๋ณธ๊ฐ’ ์‚ฌ์šฉ ๊ถŒ์žฅ)
64
+ EMBEDDING_MODEL=sentence-transformers
65
+ EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
66
+ ```
67
+
68
+ ### 2-2. ์ธ๋ฑ์‹ฑ ์‹คํ–‰
69
+
70
+ ```bash
71
+ python scripts/index_pdfs.py
72
+ ```
73
+
74
+ **์˜ˆ์ƒ ์†Œ์š” ์‹œ๊ฐ„:**
75
+ - 2,639๊ฐœ PDF ํŒŒ์ผ
76
+ - ์•ฝ 30-60๋ถ„ (๋ฌด๋ฃŒ ๋ชจ๋ธ ์‚ฌ์šฉ ์‹œ)
77
+ - ์ฒ˜์Œ ์‹คํ–‰ ์‹œ ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ ์ถ”๊ฐ€ (~90MB)
78
+
79
+ **์ง„ํ–‰ ๊ณผ์ •:**
80
+ ```
81
+ [1/4] PDF ํŒŒ์ผ ์ฒ˜๋ฆฌ ์ค‘...
82
+ - ์ „์ฒด ๋ฌธ์„œ: 2639๊ฐœ
83
+ - ์ „์ฒด ํŽ˜์ด์ง€: 50000+ํŽ˜์ด์ง€
84
+
85
+ [2/4] ํ…์ŠคํŠธ ์ฒญํ‚น ์ค‘...
86
+ - ์ „์ฒด ์ฒญํฌ: 30000+๊ฐœ
87
+
88
+ [3/4] ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ ์ค‘...
89
+ - ์ž„๋ฒ ๋”ฉ ๊ฐœ์ˆ˜: 30000+๊ฐœ
90
+ - ์ž„๋ฒ ๋”ฉ ์ฐจ์›: 384์ฐจ์›
91
+
92
+ [4/4] Vector DB์— ์ €์žฅ ์ค‘...
93
+ - ์ €์žฅ ์™„๋ฃŒ!
94
+ ```
95
+
96
+ ### 2-3. ์ธ๋ฑ์‹ฑ ํ™•์ธ
97
+
98
+ ```bash
99
+ python scripts/check_vector_db.py
100
+ ```
101
+
102
+ ---
103
+
104
+ ## 3. API ์„œ๋ฒ„ ์‹คํ–‰
105
+
106
+ ### 3-1. ์„œ๋ฒ„ ์‹œ์ž‘
107
+
108
+ ```bash
109
+ uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
110
+ ```
111
+
112
+ **๋˜๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ:**
113
+ ```bash
114
+ python app/main.py
115
+ ```
116
+
117
+ ### 3-2. ์„œ๋ฒ„ ํ™•์ธ
118
+
119
+ ๋ธŒ๋ผ์šฐ์ €์—์„œ:
120
+ - API ๋ฌธ์„œ: http://localhost:8000/docs
121
+ - ๋Œ€์ฒด ๋ฌธ์„œ: http://localhost:8000/redoc
122
+
123
+ ํ„ฐ๋ฏธ๋„์—์„œ:
124
+ ```bash
125
+ curl http://localhost:8000/health
126
+ ```
127
+
128
+ ---
129
+
130
+ ## 4. ์งˆ๋ฌธํ•˜๊ธฐ
131
+
132
+ ### 4-1. ์›น UI๋กœ ์งˆ๋ฌธ (๊ฐ€์žฅ ์‰ฌ์›€)
133
+
134
+ 1. http://localhost:8000/docs ์ ‘์†
135
+ 2. `POST /query` ํด๋ฆญ
136
+ 3. "Try it out" ํด๋ฆญ
137
+ 4. Request body ์ž…๋ ฅ:
138
+ ```json
139
+ {
140
+ "question": "๊ธˆ์œต์œ„๊ธฐ์˜ ์ฃผ์š” ์›์ธ์€?",
141
+ "top_k": 5,
142
+ "enable_metacognition": true
143
+ }
144
+ ```
145
+ 5. "Execute" ํด๋ฆญ
146
+
147
+ ### 4-2. ํ„ฐ๋ฏธ๋„์—์„œ ์งˆ๋ฌธ
148
+
149
+ ```bash
150
+ curl -X POST http://localhost:8000/query \
151
+ -H "Content-Type: application/json" \
152
+ -d '{
153
+ "question": "ํฌํŠธํด๋ฆฌ์˜ค ๋‹ค๊ฐํ™”์˜ ํšจ๊ณผ๋Š”?",
154
+ "top_k": 5,
155
+ "enable_metacognition": true
156
+ }'
157
+ ```
158
+
159
+ ### 4-3. Python ์Šคํฌ๋ฆฝํŠธ๋กœ ์งˆ๋ฌธ
160
+
161
+ ```bash
162
+ python scripts/test_query.py
163
+ ```
164
+
165
+ ๋˜๋Š” Python ์ฝ”๋“œ์—์„œ:
166
+ ```python
167
+ import requests
168
+
169
+ response = requests.post(
170
+ "http://localhost:8000/query",
171
+ json={
172
+ "question": "์ค‘์•™์€ํ–‰ ๊ธˆ๋ฆฌ ์ •์ฑ…์˜ ํšจ๊ณผ๋Š”?",
173
+ "top_k": 5,
174
+ "enable_metacognition": True
175
+ }
176
+ )
177
+
178
+ result = response.json()
179
+ print(result["answer"])
180
+ ```
181
+
182
+ ---
183
+
184
+ ## 5. ๊ณ ๊ธ‰ ์‚ฌ์šฉ๋ฒ•
185
+
186
+ ### 5-1. ๋ฉ”ํƒ€์ธ์ง€ ๋น„ํ™œ์„ฑํ™” (๋น ๋ฅธ ์‘๋‹ต)
187
+
188
+ ```json
189
+ {
190
+ "question": "์งˆ๋ฌธ",
191
+ "top_k": 5,
192
+ "enable_metacognition": false
193
+ }
194
+ ```
195
+
196
+ **์ฐจ์ด์ :**
197
+ - ๋ฉ”ํƒ€์ธ์ง€ ํ™œ์„ฑํ™”: 10-30์ดˆ, ๊ณ ํ’ˆ์งˆ
198
+ - ๋ฉ”ํƒ€์ธ์ง€ ๋น„ํ™œ์„ฑํ™”: 2-5์ดˆ, ์ผ๋ฐ˜ ํ’ˆ์งˆ
199
+
200
+ ### 5-2. ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ๊ฐœ์ˆ˜ ์กฐ์ •
201
+
202
+ ```json
203
+ {
204
+ "question": "์งˆ๋ฌธ",
205
+ "top_k": 10, // ๋” ๋งŽ์€ ๋ฌธ์„œ ๊ฒ€์ƒ‰
206
+ "enable_metacognition": true
207
+ }
208
+ ```
209
+
210
+ ### 5-3. ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง
211
+
212
+ ํŠน์ • ์ €์ž์˜ ๋…ผ๋ฌธ๋งŒ ๊ฒ€์ƒ‰:
213
+ ```json
214
+ {
215
+ "question": "์งˆ๋ฌธ",
216
+ "top_k": 5,
217
+ "filter_metadata": {
218
+ "author": "John Doe"
219
+ }
220
+ }
221
+ ```
222
+
223
+ ### 5-4. ์‘๋‹ต ๊ตฌ์กฐ ์ดํ•ด
224
+
225
+ ```json
226
+ {
227
+ "question": "์›๋ณธ ์งˆ๋ฌธ",
228
+ "answer": "์ƒ์„ฑ๋œ ๋‹ต๋ณ€",
229
+ "sources": [
230
+ {
231
+ "text": "๋ฌธ์„œ ๋‚ด์šฉ...",
232
+ "source_filename": "paper123.pdf",
233
+ "similarity": 0.89,
234
+ "metadata": {
235
+ "title": "๋…ผ๋ฌธ ์ œ๋ชฉ",
236
+ "author": "์ €์ž"
237
+ }
238
+ }
239
+ ],
240
+ "metacognition": {
241
+ "thinking_history": [...],
242
+ "iterations": 2
243
+ },
244
+ "search_stats": {
245
+ "documents_found": 5,
246
+ "top_similarity": 0.89
247
+ }
248
+ }
249
+ ```
250
+
251
+ ---
252
+
253
+ ## 6. ๋ฌธ์ œ ํ•ด๊ฒฐ
254
+
255
+ ### ๋ฌธ์ œ 1: PDF ๊ฒฝ๋กœ ์˜ค๋ฅ˜
256
+ ```
257
+ FileNotFoundError: Directory not found
258
+ ```
259
+
260
+ **ํ•ด๊ฒฐ:**
261
+ ```bash
262
+ # .env ํŒŒ์ผ์—์„œ PDF ๊ฒฝ๋กœ ํ™•์ธ
263
+ nano .env
264
+
265
+ # ๊ฒฝ๋กœ๊ฐ€ ์ •ํ™•ํ•œ์ง€ ํ™•์ธ
266
+ ls "/Users/seongjincho/Desktop/..."
267
+ ```
268
+
269
+ ### ๋ฌธ์ œ 2: API ํ‚ค ์˜ค๋ฅ˜
270
+ ```
271
+ AuthenticationError: Invalid API key
272
+ ```
273
+
274
+ **ํ•ด๊ฒฐ:**
275
+ ```bash
276
+ # .env ํŒŒ์ผ ํ™•์ธ
277
+ nano .env
278
+
279
+ # ANTHROPIC_API_KEY๊ฐ€ ์˜ฌ๋ฐ”๋ฅธ์ง€ ํ™•์ธ
280
+ # ํ‚ค๋Š” sk-ant-api03-๋กœ ์‹œ์ž‘๏ฟฝ๏ฟฝ๏ฟฝ์•ผ ํ•จ
281
+ ```
282
+
283
+ ### ๋ฌธ์ œ 3: Vector DB๊ฐ€ ๋น„์–ด์žˆ์Œ
284
+ ```
285
+ total_documents: 0
286
+ ```
287
+
288
+ **ํ•ด๊ฒฐ:**
289
+ ```bash
290
+ # ์ธ๋ฑ์‹ฑ ๋จผ์ € ์‹คํ–‰
291
+ python scripts/index_pdfs.py
292
+
293
+ # ํ™•์ธ
294
+ python scripts/check_vector_db.py
295
+ ```
296
+
297
+ ### ๋ฌธ์ œ 4: ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ
298
+ ```
299
+ MemoryError
300
+ ```
301
+
302
+ **ํ•ด๊ฒฐ:**
303
+ ```bash
304
+ # scripts/index_pdfs.py ์ˆ˜์ •
305
+ # ๋ฐฐ์น˜ ํฌ๊ธฐ ์ค„์ด๊ธฐ
306
+ embeddings = embedder.embed_batch(texts, batch_size=16) # 32 โ†’ 16
307
+ ```
308
+
309
+ ### ๋ฌธ์ œ 5: ์„œ๋ฒ„๊ฐ€ ์‹œ์ž‘๋˜์ง€ ์•Š์Œ
310
+ ```
311
+ Address already in use
312
+ ```
313
+
314
+ **ํ•ด๊ฒฐ:**
315
+ ```bash
316
+ # ํฌํŠธ ๋ณ€๊ฒฝ
317
+ uvicorn app.main:app --reload --port 8001
318
+
319
+ # ๋˜๋Š” .env ํŒŒ์ผ์—์„œ
320
+ API_PORT=8001
321
+ ```
322
+
323
+ ### ๋ฌธ์ œ 6: ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ ์‹คํŒจ
324
+ ```
325
+ ConnectionError
326
+ ```
327
+
328
+ **ํ•ด๊ฒฐ:**
329
+ ```bash
330
+ # ์ˆ˜๋™์œผ๋กœ ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ
331
+ python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
332
+ ```
333
+
334
+ ---
335
+
336
+ ## ๐Ÿ’ก ํŒ
337
+
338
+ ### ์„ฑ๋Šฅ ์ตœ์ ํ™”
339
+ 1. **SSD ์‚ฌ์šฉ**: Vector DB๋Š” SSD์— ์ €์žฅ ๊ถŒ์žฅ
340
+ 2. **๋ฉ”๋ชจ๋ฆฌ**: ์ตœ์†Œ 8GB RAM ๊ถŒ์žฅ
341
+ 3. **๋ฐฐ์น˜ ํฌ๊ธฐ**: GPU ์—†์œผ๋ฉด batch_size=16-32
342
+
343
+ ### ๋น„์šฉ ์ ˆ๊ฐ
344
+ 1. **๋ฌด๋ฃŒ ์ž„๋ฒ ๋”ฉ**: Sentence Transformers ์‚ฌ์šฉ
345
+ 2. **๋ฉ”ํƒ€์ธ์ง€ ๋น„ํ™œ์„ฑํ™”**: ๋น ๋ฅธ ํ…Œ์ŠคํŠธ ์‹œ
346
+
347
+ ### ํ’ˆ์งˆ ํ–ฅ์ƒ
348
+ 1. **๋ฉ”ํƒ€์ธ์ง€ ํ™œ์„ฑํ™”**: ๊ณ ํ’ˆ์งˆ ๋‹ต๋ณ€ ํ•„์š” ์‹œ
349
+ 2. **top_k ์ฆ๊ฐ€**: ๋” ๋งŽ์€ ๋ฌธ์„œ ์ฐธ๊ณ 
350
+ 3. **์ฒญํฌ ํฌ๊ธฐ ์กฐ์ •**: ๊ธด ๋ฌธ๋งฅ ํ•„์š” ์‹œ
351
+
352
+ ---
353
+
354
+ ## ๐Ÿ“ž ์ง€์›
355
+
356
+ ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์œผ๋ฉด:
357
+ 1. GitHub Issues ๋“ฑ๋ก
358
+ 2. ๋กœ๊ทธ ํŒŒ์ผ ์ฒจ๋ถ€
359
+ 3. ์—๋Ÿฌ ๋ฉ”์‹œ์ง€ ์ „์ฒด ๋ณต์‚ฌ
360
+
361
+ **๋กœ๊ทธ ํ™•์ธ:**
362
+ ```bash
363
+ # API ์„œ๋ฒ„ ๋กœ๊ทธ๋Š” ํ„ฐ๋ฏธ๋„์— ์ถœ๋ ฅ๋จ
364
+ # ํ•„์š”์‹œ ํŒŒ์ผ๋กœ ์ €์žฅ
365
+ uvicorn app.main:app > server.log 2>&1
366
+ ```
app/__init__.py ADDED
File without changes
app/api/__init__.py ADDED
File without changes
app/api/models.py ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ API ์š”์ฒญ/์‘๋‹ต ๋ชจ๋ธ ์ •์˜ (Pydantic)
3
+ """
4
+
5
+ from pydantic import BaseModel, Field
6
+ from typing import List, Dict, Optional, Any
7
+
8
+
9
+ class QueryRequest(BaseModel):
10
+ """์งˆ๋ฌธ ์š”์ฒญ ๋ชจ๋ธ"""
11
+ question: str = Field(..., description="์‚ฌ์šฉ์ž ์งˆ๋ฌธ")
12
+ top_k: int = Field(default=5, ge=1, le=20, description="๊ฒ€์ƒ‰ํ•  ๋ฌธ์„œ ๊ฐœ์ˆ˜")
13
+ enable_metacognition: bool = Field(default=True, description="๋ฉ”ํƒ€์ธ์ง€ ๊ณผ์ • ํ™œ์„ฑํ™” ์—ฌ๋ถ€")
14
+ filter_metadata: Optional[Dict[str, str]] = Field(default=None, description="๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ•„ํ„ฐ")
15
+
16
+ class Config:
17
+ json_schema_extra = {
18
+ "example": {
19
+ "question": "๊ธˆ์œต์œ„๊ธฐ์˜ ์ฃผ์š” ์›์ธ์€ ๋ฌด์—‡์ธ๊ฐ€์š”?",
20
+ "top_k": 5,
21
+ "enable_metacognition": True
22
+ }
23
+ }
24
+
25
+
26
+ class SourceDocument(BaseModel):
27
+ """์ถœ์ฒ˜ ๋ฌธ์„œ ๋ชจ๋ธ"""
28
+ text: str = Field(..., description="๋ฌธ์„œ ํ…์ŠคํŠธ")
29
+ source_filename: str = Field(..., description="์ถœ์ฒ˜ ํŒŒ์ผ๋ช…")
30
+ similarity: float = Field(..., description="์œ ์‚ฌ๋„ ์ ์ˆ˜")
31
+ metadata: Dict[str, Any] = Field(default_factory=dict, description="๋ฌธ์„œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ")
32
+
33
+
34
+ class MetaCognitionInfo(BaseModel):
35
+ """๋ฉ”ํƒ€์ธ์ง€ ์ •๋ณด ๋ชจ๋ธ"""
36
+ thinking_history: List[Dict[str, Any]] = Field(..., description="์‚ฌ๊ณ  ๊ณผ์ • ํžˆ์Šคํ† ๋ฆฌ")
37
+ iterations: int = Field(..., description="๊ฐœ์„  ๋ฐ˜๋ณต ํšŸ์ˆ˜")
38
+
39
+
40
+ class SearchStats(BaseModel):
41
+ """๊ฒ€์ƒ‰ ํ†ต๊ณ„ ๋ชจ๋ธ"""
42
+ documents_found: int = Field(..., description="๋ฐœ๊ฒฌ๋œ ๋ฌธ์„œ ์ˆ˜")
43
+ top_similarity: float = Field(..., description="์ตœ๊ณ  ์œ ์‚ฌ๋„ ์ ์ˆ˜")
44
+
45
+
46
+ class QueryResponse(BaseModel):
47
+ """์งˆ๋ฌธ ์‘๋‹ต ๋ชจ๋ธ"""
48
+ question: str = Field(..., description="์›๋ณธ ์งˆ๋ฌธ")
49
+ answer: str = Field(..., description="์ƒ์„ฑ๋œ ๋‹ต๋ณ€")
50
+ sources: List[SourceDocument] = Field(..., description="์ฐธ๊ณ ํ•œ ์ถœ์ฒ˜ ๋ฌธ์„œ๋“ค")
51
+ metacognition: Optional[MetaCognitionInfo] = Field(default=None, description="๋ฉ”ํƒ€์ธ์ง€ ์ •๋ณด")
52
+ search_stats: SearchStats = Field(..., description="๊ฒ€์ƒ‰ ํ†ต๊ณ„")
53
+
54
+ class Config:
55
+ json_schema_extra = {
56
+ "example": {
57
+ "question": "๊ธˆ์œต์œ„๊ธฐ์˜ ์ฃผ์š” ์›์ธ์€ ๋ฌด์—‡์ธ๊ฐ€์š”?",
58
+ "answer": "2008๋…„ ๊ธˆ์œต์œ„๊ธฐ์˜ ์ฃผ์š” ์›์ธ์€...",
59
+ "sources": [
60
+ {
61
+ "text": "๋…ผ๋ฌธ ๋‚ด์šฉ...",
62
+ "source_filename": "financial_crisis_2008.pdf",
63
+ "similarity": 0.89,
64
+ "metadata": {"author": "John Doe"}
65
+ }
66
+ ],
67
+ "search_stats": {
68
+ "documents_found": 5,
69
+ "top_similarity": 0.89
70
+ }
71
+ }
72
+ }
73
+
74
+
75
+ class HealthResponse(BaseModel):
76
+ """ํ—ฌ์Šค ์ฒดํฌ ์‘๋‹ต"""
77
+ status: str = Field(..., description="์„œ๋ฒ„ ์ƒํƒœ")
78
+ vector_store: Dict[str, Any] = Field(..., description="๋ฒกํ„ฐ ์Šคํ† ์–ด ์ •๋ณด")
79
+ embedding_model: Dict[str, Any] = Field(..., description="์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์ •๋ณด")
80
+
81
+
82
+ class ErrorResponse(BaseModel):
83
+ """์—๋Ÿฌ ์‘๋‹ต"""
84
+ error: str = Field(..., description="์—๋Ÿฌ ๋ฉ”์‹œ์ง€")
85
+ detail: Optional[str] = Field(default=None, description="์ƒ์„ธ ์ •๋ณด")
app/api/routes.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ FastAPI ๋ผ์šฐํŠธ ์ •์˜
3
+ """
4
+
5
+ from fastapi import APIRouter, HTTPException, status
6
+ from loguru import logger
7
+
8
+ from app.api.models import (
9
+ QueryRequest,
10
+ QueryResponse,
11
+ HealthResponse,
12
+ ErrorResponse
13
+ )
14
+
15
+ # ๋ผ์šฐํ„ฐ ์ƒ์„ฑ
16
+ router = APIRouter()
17
+
18
+ # RAG ํŒŒ์ดํ”„๋ผ์ธ์€ main.py์—์„œ ์ฃผ์ž…๋จ
19
+ rag_pipeline = None
20
+
21
+
22
+ def set_rag_pipeline(pipeline):
23
+ """RAG ํŒŒ์ดํ”„๋ผ์ธ ์„ค์ •"""
24
+ global rag_pipeline
25
+ rag_pipeline = pipeline
26
+
27
+
28
+ @router.get("/", tags=["Root"])
29
+ async def root():
30
+ """API ๋ฃจํŠธ ์—”๋“œํฌ์ธํŠธ"""
31
+ return {
32
+ "message": "Financial RAG API with Metacognitive Agent",
33
+ "version": "1.0.0",
34
+ "endpoints": {
35
+ "health": "/health",
36
+ "query": "/query",
37
+ "docs": "/docs"
38
+ }
39
+ }
40
+
41
+
42
+ @router.get(
43
+ "/health",
44
+ response_model=HealthResponse,
45
+ tags=["Health"],
46
+ summary="ํ—ฌ์Šค ์ฒดํฌ"
47
+ )
48
+ async def health_check():
49
+ """
50
+ ์‹œ์Šคํ…œ ์ƒํƒœ ํ™•์ธ
51
+
52
+ Returns:
53
+ ์‹œ์Šคํ…œ ํ†ต๊ณ„ ๋ฐ ์ƒํƒœ ์ •๋ณด
54
+ """
55
+ try:
56
+ if not rag_pipeline:
57
+ raise HTTPException(
58
+ status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
59
+ detail="RAG pipeline not initialized"
60
+ )
61
+
62
+ stats = rag_pipeline.get_statistics()
63
+
64
+ return HealthResponse(
65
+ status="healthy",
66
+ vector_store=stats["vector_store"],
67
+ embedding_model=stats["embedding_model"]
68
+ )
69
+
70
+ except Exception as e:
71
+ logger.error(f"Health check failed: {str(e)}")
72
+ raise HTTPException(
73
+ status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
74
+ detail=str(e)
75
+ )
76
+
77
+
78
+ @router.post(
79
+ "/query",
80
+ response_model=QueryResponse,
81
+ tags=["Query"],
82
+ summary="์งˆ๋ฌธํ•˜๊ธฐ",
83
+ description="๊ธˆ์œต/๊ฒฝ์ œ ๊ด€๋ จ ์งˆ๋ฌธ์— ๋Œ€ํ•ด RAG ์‹œ์Šคํ…œ์œผ๋กœ ๋‹ต๋ณ€ ์ƒ์„ฑ"
84
+ )
85
+ async def query(request: QueryRequest):
86
+ """
87
+ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€ ์ƒ์„ฑ
88
+
89
+ Args:
90
+ request: ์งˆ๋ฌธ ์š”์ฒญ (question, top_k, enable_metacognition ๋“ฑ)
91
+
92
+ Returns:
93
+ ๋‹ต๋ณ€, ์ถœ์ฒ˜ ๋ฌธ์„œ, ๋ฉ”ํƒ€์ธ์ง€ ์ •๋ณด ๋“ฑ
94
+ """
95
+ try:
96
+ if not rag_pipeline:
97
+ raise HTTPException(
98
+ status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
99
+ detail="RAG pipeline not initialized"
100
+ )
101
+
102
+ logger.info(f"Received query: {request.question}")
103
+
104
+ # RAG ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์งˆ๋ฌธ ์ฒ˜๋ฆฌ
105
+ result = await rag_pipeline.query(
106
+ question=request.question,
107
+ top_k=request.top_k,
108
+ enable_metacognition=request.enable_metacognition,
109
+ filter_metadata=request.filter_metadata
110
+ )
111
+
112
+ logger.info(f"Query processed successfully")
113
+
114
+ return QueryResponse(**result)
115
+
116
+ except Exception as e:
117
+ logger.error(f"Query failed: {str(e)}")
118
+ raise HTTPException(
119
+ status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
120
+ detail=f"Query processing failed: {str(e)}"
121
+ )
122
+
123
+
124
+ @router.get(
125
+ "/stats",
126
+ tags=["Stats"],
127
+ summary="ํ†ต๊ณ„ ์ •๋ณด"
128
+ )
129
+ async def get_stats():
130
+ """
131
+ RAG ์‹œ์Šคํ…œ ํ†ต๊ณ„ ์ •๋ณด
132
+
133
+ Returns:
134
+ ๋ฒกํ„ฐ ์Šคํ† ์–ด, ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ๋“ฑ์˜ ํ†ต๊ณ„
135
+ """
136
+ try:
137
+ if not rag_pipeline:
138
+ raise HTTPException(
139
+ status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
140
+ detail="RAG pipeline not initialized"
141
+ )
142
+
143
+ stats = rag_pipeline.get_statistics()
144
+ return stats
145
+
146
+ except Exception as e:
147
+ logger.error(f"Stats retrieval failed: {str(e)}")
148
+ raise HTTPException(
149
+ status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
150
+ detail=str(e)
151
+ )
app/main.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ FastAPI ๋ฉ”์ธ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜
3
+
4
+ ์‹คํ–‰ ๋ฐฉ๋ฒ•:
5
+ uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
6
+ """
7
+
8
+ from fastapi import FastAPI
9
+ from fastapi.middleware.cors import CORSMiddleware
10
+ from loguru import logger
11
+ import sys
12
+
13
+ from app.api import routes
14
+ from app.metacognitive_agent import MetaCognitiveAgent
15
+ from app.rag_pipeline import RAGPipeline
16
+ from services.vector_store import VectorStore
17
+ from services.embedder import Embedder
18
+ from utils.config import settings
19
+
20
+ # ๋กœ๊น… ์„ค์ •
21
+ logger.remove()
22
+ logger.add(
23
+ sys.stdout,
24
+ format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan>:<cyan>{function}</cyan> - <level>{message}</level>",
25
+ level="INFO"
26
+ )
27
+
28
+ # FastAPI ์•ฑ ์ƒ์„ฑ
29
+ app = FastAPI(
30
+ title="Financial RAG API",
31
+ description="๊ธˆ์œต/๊ฒฝ์ œ ๋…ผ๋ฌธ ๊ธฐ๋ฐ˜ RAG ์‹œ์Šคํ…œ with ๋ฉ”ํƒ€์ธ์ง€ ์—์ด์ „ํŠธ",
32
+ version="1.0.0",
33
+ docs_url="/docs",
34
+ redoc_url="/redoc"
35
+ )
36
+
37
+ # CORS ์„ค์ •
38
+ app.add_middleware(
39
+ CORSMiddleware,
40
+ allow_origins=["*"], # ํ”„๋กœ๋•์…˜์—์„œ๋Š” ํŠน์ • ๋„๋ฉ”์ธ์œผ๋กœ ์ œํ•œ
41
+ allow_credentials=True,
42
+ allow_methods=["*"],
43
+ allow_headers=["*"],
44
+ )
45
+
46
+
47
+ @app.on_event("startup")
48
+ async def startup_event():
49
+ """์„œ๋ฒ„ ์‹œ์ž‘ ์‹œ ์ดˆ๊ธฐํ™”"""
50
+ logger.info("=" * 80)
51
+ logger.info("Financial RAG API ์‹œ์ž‘ ์ค‘...")
52
+ logger.info("=" * 80)
53
+
54
+ try:
55
+ # 1. Vector Store ์ดˆ๊ธฐํ™”
56
+ logger.info("1๏ธโƒฃ Vector Store ์ดˆ๊ธฐํ™” ์ค‘...")
57
+ vector_store = VectorStore(
58
+ persist_directory=settings.chroma_persist_directory,
59
+ collection_name=settings.collection_name
60
+ )
61
+ logger.info(f"โœ… Vector Store ์ดˆ๊ธฐํ™” ์™„๋ฃŒ ({vector_store.collection.count()}๊ฐœ ๋ฌธ์„œ)")
62
+
63
+ # 2. Embedder ์ดˆ๊ธฐํ™”
64
+ logger.info("2๏ธโƒฃ Embedder ์ดˆ๊ธฐํ™” ์ค‘...")
65
+ embedder = Embedder(
66
+ model_type=settings.embedding_model,
67
+ model_name=settings.embedding_model_name,
68
+ openai_api_key=settings.openai_api_key,
69
+ cohere_api_key=settings.cohere_api_key
70
+ )
71
+ logger.info(f"โœ… Embedder ์ดˆ๊ธฐํ™” ์™„๋ฃŒ ({embedder.get_embedding_dimension()}์ฐจ์›)")
72
+
73
+ # 3. Metacognitive Agent ์ดˆ๊ธฐํ™”
74
+ logger.info("3๏ธโƒฃ Metacognitive Agent ์ดˆ๊ธฐํ™” ์ค‘...")
75
+ agent = MetaCognitiveAgent(api_key=settings.anthropic_api_key)
76
+ logger.info(f"โœ… Agent ์ดˆ๊ธฐํ™” ์™„๋ฃŒ ({agent.model})")
77
+
78
+ # 4. RAG Pipeline ์ƒ์„ฑ
79
+ logger.info("4๏ธโƒฃ RAG Pipeline ์ƒ์„ฑ ์ค‘...")
80
+ rag_pipeline = RAGPipeline(
81
+ vector_store=vector_store,
82
+ embedder=embedder,
83
+ metacognitive_agent=agent
84
+ )
85
+ logger.info("โœ… RAG Pipeline ์ƒ์„ฑ ์™„๋ฃŒ")
86
+
87
+ # ๋ผ์šฐํ„ฐ์— ํŒŒ์ดํ”„๋ผ์ธ ์„ค์ •
88
+ routes.set_rag_pipeline(rag_pipeline)
89
+
90
+ logger.info("=" * 80)
91
+ logger.info("โœจ API ์„œ๋ฒ„ ์ค€๋น„ ์™„๋ฃŒ!")
92
+ logger.info(f"๐Ÿ“š Vector DB: {vector_store.collection.count()}๊ฐœ ๋ฌธ์„œ")
93
+ logger.info(f"๐Ÿค– Model: {agent.model}")
94
+ logger.info(f"๐Ÿ”— API Docs: http://{settings.api_host}:{settings.api_port}/docs")
95
+ logger.info("=" * 80)
96
+
97
+ except Exception as e:
98
+ logger.error(f"โŒ ์ดˆ๊ธฐํ™” ์‹คํŒจ: {str(e)}")
99
+ raise
100
+
101
+
102
+ @app.on_event("shutdown")
103
+ async def shutdown_event():
104
+ """์„œ๋ฒ„ ์ข…๋ฃŒ ์‹œ ์ •๋ฆฌ"""
105
+ logger.info("API ์„œ๋ฒ„ ์ข…๋ฃŒ ์ค‘...")
106
+
107
+
108
+ # ๋ผ์šฐํ„ฐ ๋“ฑ๋ก
109
+ app.include_router(routes.router)
110
+
111
+
112
+ if __name__ == "__main__":
113
+ import uvicorn
114
+
115
+ uvicorn.run(
116
+ "app.main:app",
117
+ host=settings.api_host,
118
+ port=settings.api_port,
119
+ reload=True,
120
+ log_level="info"
121
+ )
app/metacognitive_agent.py ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ๋ฉ”ํƒ€์ธ์ง€ ์—์ด์ „ํŠธ (Metacognitive Agent)
3
+
4
+ ์ด ์—์ด์ „ํŠธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฉ”ํƒ€์ธ์ง€ ์ „๋žต์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:
5
+ 1. Planning (๊ณ„ํš): ๋‹ต๋ณ€ ์ „๋žต ์ˆ˜๋ฆฝ
6
+ 2. Monitoring (๊ฐ์‹œ): ๋‹ต๋ณ€ ๊ณผ์ • ๋ชจ๋‹ˆํ„ฐ๋ง
7
+ 3. Evaluation (ํ‰๊ฐ€): ๋‹ต๋ณ€ ํ’ˆ์งˆ ํ‰๊ฐ€
8
+ 4. Revision (์ˆ˜์ •): ํ•„์š”์‹œ ๋‹ต๋ณ€ ๊ฐœ์„ 
9
+ """
10
+
11
+ from typing import List, Dict, Optional
12
+ from anthropic import Anthropic
13
+ from loguru import logger
14
+ import json
15
+
16
+
17
+ class MetaCognitiveAgent:
18
+ """๋ฉ”ํƒ€์ธ์ง€ ๋Šฅ๋ ฅ์„ ๊ฐ€์ง„ AI ์—์ด์ „ํŠธ"""
19
+
20
+ def __init__(self, api_key: str):
21
+ """
22
+ Args:
23
+ api_key: Anthropic API ํ‚ค
24
+ """
25
+ self.client = Anthropic(api_key=api_key)
26
+ self.thinking_history = []
27
+ self.model = "claude-3-5-sonnet-20241022"
28
+
29
+ # ๋ฉ”ํƒ€์ธ์ง€ ํ”„๋กฌํ”„ํŠธ
30
+ self.reflection_prompts = {
31
+ "planning": """
32
+ ๋‹น์‹ ์€ ๊ธˆ์œต/๊ฒฝ์ œ ๋ถ„์•ผ์˜ ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ธฐ ์œ„ํ•œ ์ „๋žต์„ ์ˆ˜๋ฆฝํ•˜์„ธ์š”.
33
+
34
+ ์งˆ๋ฌธ: {query}
35
+
36
+ ๊ฒ€์ƒ‰๋œ ๊ด€๋ จ ๋ฌธ์„œ:
37
+ {context}
38
+
39
+ ๋‹ค์Œ ์‚ฌํ•ญ์„ ๊ณ ๋ คํ•˜์—ฌ ๋‹ต๋ณ€ ๊ณ„ํš์„ ์„ธ์šฐ์„ธ์š”:
40
+ 1. ์งˆ๋ฌธ์ด ์š”๊ตฌํ•˜๋Š” ํ•ต์‹ฌ ์ •๋ณด๋Š” ๋ฌด์—‡์ธ๊ฐ€?
41
+ 2. ์ œ๊ณต๋œ ๋ฌธ์„œ๋“ค์ด ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•œ๊ฐ€?
42
+ 3. ์–ด๋–ค ์ •๋ณด๋ฅผ ์šฐ์„ ์ ์œผ๋กœ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”๊ฐ€?
43
+ 4. ์ฃผ์˜ํ•ด์•ผ ํ•  ์ ์ด๋‚˜ ํ•œ๊ณ„๋Š” ๋ฌด์—‡์ธ๊ฐ€?
44
+
45
+ ๊ณ„ํš์„ JSON ํ˜•์‹์œผ๋กœ ์ž‘์„ฑํ•˜์„ธ์š”:
46
+ {{
47
+ "key_information": "์งˆ๋ฌธ์˜ ํ•ต์‹ฌ ์ •๋ณด",
48
+ "context_adequacy": "๋ฌธ์„œ์˜ ์ถฉ๋ถ„์„ฑ (์ถฉ๋ถ„/๋ถ€์กฑ/๋ถˆํ™•์‹ค)",
49
+ "strategy": "๋‹ต๋ณ€ ์ „๋žต",
50
+ "limitations": "์ฃผ์˜์‚ฌํ•ญ ๋ฐ ํ•œ๊ณ„"
51
+ }}
52
+ """,
53
+
54
+ "monitoring": """
55
+ ํ˜„์žฌ ์ƒ์„ฑ ์ค‘์ธ ๋‹ต๋ณ€์„ ๊ฒ€ํ† ํ•˜์„ธ์š”.
56
+
57
+ ์งˆ๋ฌธ: {query}
58
+ ํ˜„์žฌ ๋‹ต๋ณ€: {response}
59
+
60
+ ๋‹ค์Œ์„ ํ™•์ธํ•˜์„ธ์š”:
61
+ 1. ๋‹ต๋ณ€์ด ์งˆ๋ฌธ์— ์ง์ ‘์ ์œผ๋กœ ๋Œ€๋‹ตํ•˜๊ณ  ์žˆ๋Š”๊ฐ€?
62
+ 2. ์ œ๊ณต๋œ ๋ฌธ์„œ์˜ ์ •๋ณด๋ฅผ ์ •ํ™•ํžˆ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š”๊ฐ€?
63
+ 3. ์ถ”๋ก ์ด ๋…ผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ๊ฐ€?
64
+ 4. Hallucination(๊ทผ๊ฑฐ ์—†๋Š” ์ •๋ณด)์ด ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š์€๊ฐ€?
65
+
66
+ ํ‰๊ฐ€๋ฅผ JSON ํ˜•์‹์œผ๋กœ ์ž‘์„ฑํ•˜์„ธ์š”:
67
+ {{
68
+ "relevance": "์งˆ๋ฌธ๊ณผ์˜ ๊ด€๋ จ์„ฑ (๋†’์Œ/์ค‘๊ฐ„/๋‚ฎ์Œ)",
69
+ "accuracy": "์ •ํ™•์„ฑ (๋†’์Œ/์ค‘๊ฐ„/๋‚ฎ์Œ)",
70
+ "logic": "๋…ผ๋ฆฌ์„ฑ (ํƒ€๋‹นํ•จ/๋ณดํ†ต/๋ฌธ์ œ์žˆ์Œ)",
71
+ "hallucination_risk": "Hallucination ์œ„ํ—˜๋„ (๋‚ฎ์Œ/์ค‘๊ฐ„/๋†’์Œ)",
72
+ "issues": ["๋ฐœ๊ฒฌ๋œ ๋ฌธ์ œ์ ๋“ค"]
73
+ }}
74
+ """,
75
+
76
+ "evaluation": """
77
+ ์ตœ์ข… ๋‹ต๋ณ€์„ ํ‰๊ฐ€ํ•˜์„ธ์š”.
78
+
79
+ ์งˆ๋ฌธ: {query}
80
+ ๋‹ต๋ณ€: {response}
81
+ ์‚ฌ์šฉ๋œ ์ถœ์ฒ˜: {sources}
82
+
83
+ ๋‹ค์Œ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ฐ€ํ•˜์„ธ์š”:
84
+ 1. ์™„์ „์„ฑ: ์งˆ๋ฌธ์— ์™„์ „ํžˆ ๋‹ตํ–ˆ๋Š”๊ฐ€?
85
+ 2. ์ •ํ™•์„ฑ: ์ •๋ณด๊ฐ€ ์ •ํ™•ํ•œ๊ฐ€?
86
+ 3. ๋ช…ํ™•์„ฑ: ๋‹ต๋ณ€์ด ๋ช…ํ™•ํ•˜๊ณ  ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด๊ฐ€?
87
+ 4. ์‹ ๋ขฐ์„ฑ: ์ถœ์ฒ˜๊ฐ€ ๋ช…ํ™•ํ•˜๊ณ  ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
88
+
89
+ ํ‰๊ฐ€๋ฅผ JSON ํ˜•์‹์œผ๋กœ ์ž‘์„ฑํ•˜์„ธ์š”:
90
+ {{
91
+ "completeness": "์™„์ „์„ฑ ์ ์ˆ˜ (1-10)",
92
+ "accuracy": "์ •ํ™•์„ฑ ์ ์ˆ˜ (1-10)",
93
+ "clarity": "๋ช…ํ™•์„ฑ ์ ์ˆ˜ (1-10)",
94
+ "reliability": "์‹ ๋ขฐ์„ฑ ์ ์ˆ˜ (1-10)",
95
+ "overall_score": "์ „์ฒด ์ ์ˆ˜ (1-10)",
96
+ "feedback": "๊ฐœ์„ ์ด ํ•„์š”ํ•œ ๋ถ€๋ถ„"
97
+ }}
98
+ """,
99
+
100
+ "revision": """
101
+ ๋‹ต๋ณ€์„ ๊ฐœ์„ ํ•˜์„ธ์š”.
102
+
103
+ ์›๋ณธ ๋‹ต๋ณ€: {response}
104
+ ํ‰๊ฐ€ ํ”ผ๋“œ๋ฐฑ: {feedback}
105
+
106
+ ํ”ผ๋“œ๋ฐฑ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ต๋ณ€์„ ๊ฐœ์„ ํ•˜์„ธ์š”. ํŠนํžˆ:
107
+ 1. ๋ถ€์ •ํ™•ํ•œ ์ •๋ณด ์ˆ˜์ •
108
+ 2. ๋ถˆ์™„์ „ํ•œ ๋ถ€๋ถ„ ๋ณด์™„
109
+ 3. ๋ถˆ๋ช…ํ™•ํ•œ ํ‘œํ˜„ ๊ฐœ์„ 
110
+ 4. ๊ทผ๊ฑฐ ์—†๋Š” ์ฃผ์žฅ ์ œ๊ฑฐ
111
+
112
+ ๊ฐœ์„ ๋œ ๋‹ต๋ณ€๋งŒ ์ œ๊ณตํ•˜์„ธ์š”.
113
+ """
114
+ }
115
+
116
+ async def think_and_reflect(
117
+ self,
118
+ query: str,
119
+ context_documents: List[Dict],
120
+ max_iterations: int = 2
121
+ ) -> Dict:
122
+ """
123
+ ๋ฉ”ํƒ€์ธ์ง€ ๊ณผ์ •์„ ํ†ตํ•œ ๋‹ต๋ณ€ ์ƒ์„ฑ
124
+
125
+ Args:
126
+ query: ์‚ฌ์šฉ์ž ์งˆ๋ฌธ
127
+ context_documents: ๊ฒ€์ƒ‰๋œ ๊ด€๋ จ ๋ฌธ์„œ๋“ค
128
+ max_iterations: ์ตœ๋Œ€ ๊ฐœ์„  ๋ฐ˜๋ณต ํšŸ์ˆ˜
129
+
130
+ Returns:
131
+ ์ตœ์ข… ๋‹ต๋ณ€ ๋ฐ ๋ฉ”ํƒ€์ธ์ง€ ๊ณผ์ • ์ •๋ณด
132
+ """
133
+ self.thinking_history = []
134
+
135
+ # ์ปจํ…์ŠคํŠธ ํฌ๋งทํŒ…
136
+ context_text = self._format_context(context_documents)
137
+
138
+ # 1๋‹จ๊ณ„: ๊ณ„ํš ์ˆ˜๋ฆฝ (Planning)
139
+ logger.info("1๏ธโƒฃ Planning: ๋‹ต๋ณ€ ์ „๋žต ์ˆ˜๋ฆฝ ์ค‘...")
140
+ plan = await self._plan(query, context_text)
141
+ self.thinking_history.append({"step": "planning", "content": plan})
142
+
143
+ # 2๋‹จ๊ณ„: ์ดˆ๊ธฐ ์‘๋‹ต ์ƒ์„ฑ
144
+ logger.info("2๏ธโƒฃ Generating: ์ดˆ๊ธฐ ๋‹ต๋ณ€ ์ƒ์„ฑ ์ค‘...")
145
+ initial_response = await self._generate_response(query, context_text, plan)
146
+ self.thinking_history.append({"step": "initial_response", "content": initial_response})
147
+
148
+ # 3๋‹จ๊ณ„: ๋ชจ๋‹ˆํ„ฐ๋ง (Monitoring)
149
+ logger.info("3๏ธโƒฃ Monitoring: ๋‹ต๋ณ€ ๊ฒ€ํ†  ์ค‘...")
150
+ monitoring_result = await self._monitor(query, initial_response)
151
+ self.thinking_history.append({"step": "monitoring", "content": monitoring_result})
152
+
153
+ current_response = initial_response
154
+
155
+ # 4๋‹จ๊ณ„: ๋ฐ˜๋ณต์  ๊ฐœ์„ 
156
+ for iteration in range(max_iterations):
157
+ # ํ‰๊ฐ€ (Evaluation)
158
+ logger.info(f"4๏ธโƒฃ Evaluation [{iteration + 1}/{max_iterations}]: ๋‹ต๋ณ€ ํ‰๊ฐ€ ์ค‘...")
159
+ evaluation = await self._evaluate(
160
+ query,
161
+ current_response,
162
+ [doc.get('source_filename', 'unknown') for doc in context_documents]
163
+ )
164
+ self.thinking_history.append({"step": f"evaluation_{iteration}", "content": evaluation})
165
+
166
+ # ํ‰๊ฐ€ ์ ์ˆ˜๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋†’์œผ๋ฉด ์ข…๋ฃŒ
167
+ try:
168
+ eval_data = json.loads(evaluation)
169
+ overall_score = float(eval_data.get('overall_score', 0))
170
+
171
+ if overall_score >= 8.0:
172
+ logger.info(f"โœ… ์ถฉ๋ถ„ํ•œ ํ’ˆ์งˆ ๋‹ฌ์„ฑ (์ ์ˆ˜: {overall_score}/10)")
173
+ break
174
+ except:
175
+ pass
176
+
177
+ # ๊ฐœ์„  (Revision)
178
+ logger.info(f"5๏ธโƒฃ Revision [{iteration + 1}/{max_iterations}]: ๋‹ต๋ณ€ ๊ฐœ์„  ์ค‘...")
179
+ current_response = await self._revise(current_response, evaluation)
180
+ self.thinking_history.append({"step": f"revision_{iteration}", "content": current_response})
181
+
182
+ return {
183
+ "query": query,
184
+ "final_response": current_response,
185
+ "thinking_history": self.thinking_history,
186
+ "context_documents": context_documents,
187
+ "iterations": len([h for h in self.thinking_history if "revision" in h["step"]])
188
+ }
189
+
190
+ async def _plan(self, query: str, context: str) -> str:
191
+ """๊ณ„ํš ์ˆ˜๋ฆฝ"""
192
+ prompt = self.reflection_prompts["planning"].format(
193
+ query=query,
194
+ context=context
195
+ )
196
+
197
+ message = self.client.messages.create(
198
+ model=self.model,
199
+ max_tokens=1024,
200
+ messages=[{"role": "user", "content": prompt}]
201
+ )
202
+
203
+ return message.content[0].text
204
+
205
+ async def _generate_response(self, query: str, context: str, plan: str) -> str:
206
+ """์ดˆ๊ธฐ ์‘๋‹ต ์ƒ์„ฑ"""
207
+ prompt = f"""
208
+ ๋‹น์‹ ์€ ๊ธˆ์œต/๊ฒฝ์ œ ๋ถ„์•ผ์˜ ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค.
209
+
210
+ ๋‹ต๋ณ€ ๊ณ„ํš:
211
+ {plan}
212
+
213
+ ์งˆ๋ฌธ: {query}
214
+
215
+ ์ฐธ๊ณ  ๋ฌธ์„œ:
216
+ {context}
217
+
218
+ ์œ„ ๊ณ„ํš์„ ๋ฐ”ํƒ•์œผ๋กœ ์งˆ๋ฌธ์— ๋‹ต๋ณ€ํ•˜์„ธ์š”. ๋ฐ˜๋“œ์‹œ:
219
+ 1. ์ œ๊ณต๋œ ๋ฌธ์„œ์˜ ์ •๋ณด๋งŒ ์‚ฌ์šฉํ•˜์„ธ์š”
220
+ 2. ํ™•์‹คํ•˜์ง€ ์•Š์€ ์ •๋ณด๋Š” ์ถ”์ธกํ•˜์ง€ ๋งˆ์„ธ์š”
221
+ 3. ์ถœ์ฒ˜๋ฅผ ๋ช…ํ™•ํžˆ ๋ฐํžˆ์„ธ์š”
222
+ 4. ํ•œ๊ตญ์–ด๋กœ ๋‹ต๋ณ€ํ•˜์„ธ์š”
223
+ """
224
+
225
+ message = self.client.messages.create(
226
+ model=self.model,
227
+ max_tokens=2048,
228
+ messages=[{"role": "user", "content": prompt}]
229
+ )
230
+
231
+ return message.content[0].text
232
+
233
+ async def _monitor(self, query: str, response: str) -> str:
234
+ """๋‹ต๋ณ€ ๋ชจ๋‹ˆํ„ฐ๋ง"""
235
+ prompt = self.reflection_prompts["monitoring"].format(
236
+ query=query,
237
+ response=response
238
+ )
239
+
240
+ message = self.client.messages.create(
241
+ model=self.model,
242
+ max_tokens=1024,
243
+ messages=[{"role": "user", "content": prompt}]
244
+ )
245
+
246
+ return message.content[0].text
247
+
248
+ async def _evaluate(self, query: str, response: str, sources: List[str]) -> str:
249
+ """๋‹ต๋ณ€ ํ‰๊ฐ€"""
250
+ prompt = self.reflection_prompts["evaluation"].format(
251
+ query=query,
252
+ response=response,
253
+ sources=", ".join(sources)
254
+ )
255
+
256
+ message = self.client.messages.create(
257
+ model=self.model,
258
+ max_tokens=1024,
259
+ messages=[{"role": "user", "content": prompt}]
260
+ )
261
+
262
+ return message.content[0].text
263
+
264
+ async def _revise(self, response: str, feedback: str) -> str:
265
+ """๋‹ต๋ณ€ ๊ฐœ์„ """
266
+ prompt = self.reflection_prompts["revision"].format(
267
+ response=response,
268
+ feedback=feedback
269
+ )
270
+
271
+ message = self.client.messages.create(
272
+ model=self.model,
273
+ max_tokens=2048,
274
+ messages=[{"role": "user", "content": prompt}]
275
+ )
276
+
277
+ return message.content[0].text
278
+
279
+ def _format_context(self, documents: List[Dict]) -> str:
280
+ """๋ฌธ์„œ๋“ค์„ ์ปจํ…์ŠคํŠธ ํ…์ŠคํŠธ๋กœ ํฌ๋งทํŒ…"""
281
+ formatted = []
282
+ for i, doc in enumerate(documents, 1):
283
+ text = doc.get('text', doc.get('document', ''))
284
+ metadata = doc.get('metadata', {})
285
+ source = metadata.get('source_filename', 'Unknown')
286
+
287
+ formatted.append(f"[๋ฌธ์„œ {i}] {source}\n{text}\n")
288
+
289
+ return "\n".join(formatted)
app/rag_pipeline.py ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ RAG (Retrieval-Augmented Generation) ํŒŒ์ดํ”„๋ผ์ธ
3
+
4
+ ๋ฒกํ„ฐ ๊ฒ€์ƒ‰ + ๋ฉ”ํƒ€์ธ์ง€ ์—์ด์ „ํŠธ๋ฅผ ๊ฒฐํ•ฉํ•œ RAG ์‹œ์Šคํ…œ
5
+ """
6
+
7
+ from typing import List, Dict, Optional
8
+ from loguru import logger
9
+
10
+ from services.vector_store import VectorStore
11
+ from services.embedder import Embedder
12
+ from app.metacognitive_agent import MetaCognitiveAgent
13
+ from utils.config import settings
14
+
15
+
16
+ class RAGPipeline:
17
+ """RAG ํŒŒ์ดํ”„๋ผ์ธ ํด๋ž˜์Šค"""
18
+
19
+ def __init__(
20
+ self,
21
+ vector_store: VectorStore,
22
+ embedder: Embedder,
23
+ metacognitive_agent: MetaCognitiveAgent
24
+ ):
25
+ """
26
+ Args:
27
+ vector_store: ๋ฒกํ„ฐ ์ €์žฅ์†Œ
28
+ embedder: ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ๊ธฐ
29
+ metacognitive_agent: ๋ฉ”ํƒ€์ธ์ง€ ์—์ด์ „ํŠธ
30
+ """
31
+ self.vector_store = vector_store
32
+ self.embedder = embedder
33
+ self.agent = metacognitive_agent
34
+
35
+ async def query(
36
+ self,
37
+ question: str,
38
+ top_k: int = 5,
39
+ enable_metacognition: bool = True,
40
+ filter_metadata: Optional[Dict[str, str]] = None
41
+ ) -> Dict:
42
+ """
43
+ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€ ์ƒ์„ฑ
44
+
45
+ Args:
46
+ question: ์‚ฌ์šฉ์ž ์งˆ๋ฌธ
47
+ top_k: ๊ฒ€์ƒ‰ํ•  ๋ฌธ์„œ ๊ฐœ์ˆ˜
48
+ enable_metacognition: ๋ฉ”ํƒ€์ธ์ง€ ๊ณผ์ • ํ™œ์„ฑํ™” ์—ฌ๋ถ€
49
+ filter_metadata: ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ•„ํ„ฐ
50
+
51
+ Returns:
52
+ ๋‹ต๋ณ€ ๋ฐ ๊ด€๋ จ ์ •๋ณด
53
+ """
54
+ logger.info(f"RAG Query: {question}")
55
+
56
+ # 1. ์งˆ๋ฌธ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
57
+ logger.info("1๏ธโƒฃ ์งˆ๋ฌธ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ ์ค‘...")
58
+ query_embedding = self.embedder.embed_text(question)
59
+
60
+ # 2. ๊ด€๋ จ ๋ฌธ์„œ ๊ฒ€์ƒ‰
61
+ logger.info(f"2๏ธโƒฃ ๊ด€๋ จ ๋ฌธ์„œ ๊ฒ€์ƒ‰ ์ค‘ (top_k={top_k})...")
62
+ search_results = self.vector_store.search(
63
+ query_embedding=query_embedding,
64
+ top_k=top_k,
65
+ filter_metadata=filter_metadata
66
+ )
67
+
68
+ # ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ํฌ๋งทํŒ…
69
+ context_documents = []
70
+ for doc, metadata, distance in zip(
71
+ search_results['documents'],
72
+ search_results['metadatas'],
73
+ search_results['distances']
74
+ ):
75
+ context_documents.append({
76
+ 'text': doc,
77
+ 'metadata': metadata,
78
+ 'similarity': 1 - distance, # distance๋ฅผ similarity๋กœ ๋ณ€ํ™˜
79
+ 'source_filename': metadata.get('source_filename', 'unknown')
80
+ })
81
+
82
+ logger.info(f"๊ฒ€์ƒ‰ ์™„๋ฃŒ: {len(context_documents)}๊ฐœ ๋ฌธ์„œ ๋ฐœ๊ฒฌ")
83
+
84
+ # 3. ๋ฉ”ํƒ€์ธ์ง€ ์—์ด์ „ํŠธ๋กœ ๋‹ต๋ณ€ ์ƒ์„ฑ
85
+ if enable_metacognition:
86
+ logger.info("3๏ธโƒฃ ๋ฉ”ํƒ€์ธ์ง€ ์—์ด์ „ํŠธ๋กœ ๋‹ต๋ณ€ ์ƒ์„ฑ ์ค‘...")
87
+ result = await self.agent.think_and_reflect(
88
+ query=question,
89
+ context_documents=context_documents
90
+ )
91
+
92
+ return {
93
+ "question": question,
94
+ "answer": result["final_response"],
95
+ "sources": context_documents,
96
+ "metacognition": {
97
+ "thinking_history": result["thinking_history"],
98
+ "iterations": result["iterations"]
99
+ },
100
+ "search_stats": {
101
+ "documents_found": len(context_documents),
102
+ "top_similarity": context_documents[0]['similarity'] if context_documents else 0
103
+ }
104
+ }
105
+ else:
106
+ # ๋ฉ”ํƒ€์ธ์ง€ ์—†์ด ๊ฐ„๋‹จํ•œ ๋‹ต๋ณ€
107
+ logger.info("3๏ธโƒฃ ๊ฐ„๋‹จํ•œ ๋‹ต๋ณ€ ์ƒ์„ฑ ์ค‘...")
108
+ simple_response = await self._generate_simple_response(question, context_documents)
109
+
110
+ return {
111
+ "question": question,
112
+ "answer": simple_response,
113
+ "sources": context_documents,
114
+ "search_stats": {
115
+ "documents_found": len(context_documents),
116
+ "top_similarity": context_documents[0]['similarity'] if context_documents else 0
117
+ }
118
+ }
119
+
120
+ async def _generate_simple_response(self, question: str, context_documents: List[Dict]) -> str:
121
+ """๋ฉ”ํƒ€์ธ์ง€ ์—†๋Š” ๊ฐ„๋‹จํ•œ ๋‹ต๋ณ€ ์ƒ์„ฑ"""
122
+ # ์ปจํ…์ŠคํŠธ ํฌ๋งทํŒ…
123
+ context_text = "\n\n".join([
124
+ f"[์ถœ์ฒ˜: {doc['source_filename']}]\n{doc['text']}"
125
+ for doc in context_documents
126
+ ])
127
+
128
+ prompt = f"""
129
+ ๋‹น์‹ ์€ ๊ธˆ์œต/๊ฒฝ์ œ ๋ถ„์•ผ์˜ ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค.
130
+
131
+ ์งˆ๋ฌธ: {question}
132
+
133
+ ์ฐธ๊ณ  ๋ฌธ์„œ:
134
+ {context_text}
135
+
136
+ ์œ„ ๋ฌธ์„œ๋“ค์„ ์ฐธ๊ณ ํ•˜์—ฌ ์งˆ๋ฌธ์— ๋‹ต๋ณ€ํ•˜์„ธ์š”. ๋ฐ˜๋“œ์‹œ:
137
+ 1. ์ œ๊ณต๋œ ๋ฌธ์„œ์˜ ์ •๋ณด๋งŒ ์‚ฌ์šฉํ•˜์„ธ์š”
138
+ 2. ํ™•์‹คํ•˜์ง€ ์•Š์€ ์ •๋ณด๋Š” ์ถ”์ธกํ•˜์ง€ ๋งˆ์„ธ์š”
139
+ 3. ์ถœ์ฒ˜๋ฅผ ๋ช…ํ™•ํžˆ ๋ฐํžˆ์„ธ์š”
140
+ 4. ํ•œ๊ตญ์–ด๋กœ ๋‹ต๋ณ€ํ•˜์„ธ์š”
141
+ """
142
+
143
+ message = self.agent.client.messages.create(
144
+ model=self.agent.model,
145
+ max_tokens=2048,
146
+ messages=[{"role": "user", "content": prompt}]
147
+ )
148
+
149
+ return message.content[0].text
150
+
151
+ def get_statistics(self) -> Dict:
152
+ """RAG ์‹œ์Šคํ…œ ํ†ต๊ณ„"""
153
+ vector_stats = self.vector_store.get_collection_stats()
154
+
155
+ return {
156
+ "vector_store": vector_stats,
157
+ "embedding_model": {
158
+ "type": self.embedder.model_type,
159
+ "name": self.embedder.model_name,
160
+ "dimension": self.embedder.get_embedding_dimension()
161
+ },
162
+ "agent_model": self.agent.model
163
+ }
requirements.txt ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FastAPI and Web Server
2
+ fastapi==0.109.0
3
+ uvicorn[standard]==0.27.0
4
+ pydantic==2.5.3
5
+ pydantic-settings==2.1.0
6
+ python-multipart==0.0.6
7
+
8
+ # Anthropic Claude
9
+ anthropic==0.18.1
10
+
11
+ # PDF Processing
12
+ PyPDF2==3.0.1
13
+ pdfplumber==0.10.3
14
+ pymupdf==1.23.8
15
+
16
+ # Vector Database
17
+ chromadb==0.4.22
18
+ sentence-transformers==2.3.1
19
+
20
+ # Embeddings (multiple options)
21
+ openai==1.10.0
22
+ cohere==4.47
23
+
24
+ # Text Processing
25
+ langchain==0.1.4
26
+ langchain-community==0.0.17
27
+ tiktoken==0.5.2
28
+
29
+ # Utilities
30
+ python-dotenv==1.0.0
31
+ tqdm==4.66.1
32
+ numpy==1.26.3
33
+ pandas==2.1.4
34
+
35
+ # Logging and Monitoring
36
+ loguru==0.7.2
scripts/__init__.py ADDED
File without changes
scripts/check_vector_db.py ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Vector DB ์ƒํƒœ ํ™•์ธ ์Šคํฌ๋ฆฝํŠธ
3
+
4
+ ์ธ๋ฑ์‹ฑ์ด ์™„๋ฃŒ๋œ ํ›„ ๋ฒกํ„ฐ DB์˜ ๋‚ด์šฉ์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
5
+ """
6
+
7
+ import sys
8
+ from pathlib import Path
9
+
10
+ # ํ”„๋กœ์ ํŠธ ๋ฃจํŠธ๋ฅผ Python ๊ฒฝ๋กœ์— ์ถ”๊ฐ€
11
+ project_root = Path(__file__).parent.parent
12
+ sys.path.insert(0, str(project_root))
13
+
14
+ from dotenv import load_dotenv
15
+ from services.vector_store import VectorStore
16
+ from utils.config import settings
17
+
18
+
19
+ def main():
20
+ """Vector DB ์ƒํƒœ ํ™•์ธ"""
21
+ load_dotenv()
22
+
23
+ print("=" * 80)
24
+ print("Vector DB ์ƒํƒœ ํ™•์ธ")
25
+ print("=" * 80)
26
+
27
+ # Vector Store ์ดˆ๊ธฐํ™”
28
+ vector_store = VectorStore(
29
+ persist_directory=settings.chroma_persist_directory,
30
+ collection_name=settings.collection_name
31
+ )
32
+
33
+ # ํ†ต๊ณ„ ์ •๋ณด
34
+ stats = vector_store.get_collection_stats()
35
+ print(f"\n๐Ÿ“Š ๊ธฐ๋ณธ ์ •๋ณด:")
36
+ print(f" ์ปฌ๋ ‰์…˜๋ช…: {stats['collection_name']}")
37
+ print(f" ์ €์žฅ ๊ฒฝ๋กœ: {stats['persist_directory']}")
38
+ print(f" ์ „์ฒด ๋ฌธ์„œ: {stats['total_documents']}๊ฐœ")
39
+ print(f" ๋ฐ์ดํ„ฐ ์กด์žฌ: {'โœ… ์˜ˆ' if stats['has_data'] else 'โŒ ์•„๋‹ˆ์˜ค'}")
40
+
41
+ if not stats['has_data']:
42
+ print("\nโš ๏ธ Vector DB๊ฐ€ ๋น„์–ด์žˆ์Šต๋‹ˆ๋‹ค!")
43
+ print(" python scripts/index_pdfs.py ๋ฅผ ๋จผ์ € ์‹คํ–‰ํ•˜์„ธ์š”.")
44
+ return
45
+
46
+ # ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ํ™•์ธ
47
+ print(f"\n๐Ÿ“š ์ƒ˜ํ”Œ ๋ฌธ์„œ ํ™•์ธ:")
48
+ sample = vector_store.collection.peek(limit=3)
49
+
50
+ for i, (doc_id, doc, metadata) in enumerate(zip(
51
+ sample['ids'],
52
+ sample['documents'],
53
+ sample['metadatas']
54
+ ), 1):
55
+ print(f"\n[{i}] {doc_id}")
56
+ print(f" ์ถœ์ฒ˜: {metadata.get('source_filename', 'unknown')}")
57
+ print(f" ์ œ๋ชฉ: {metadata.get('title', 'N/A')}")
58
+ print(f" ์ €์ž: {metadata.get('author', 'N/A')}")
59
+ print(f" ๋‚ด์šฉ: {doc[:150]}...")
60
+
61
+ # ๊ฐ„๋‹จํ•œ ๊ฒ€์ƒ‰ ํ…Œ์ŠคํŠธ
62
+ print(f"\n๐Ÿ” ๊ฒ€์ƒ‰ ํ…Œ์ŠคํŠธ:")
63
+ test_query = "financial crisis"
64
+ print(f" ์ฟผ๋ฆฌ: '{test_query}'")
65
+
66
+ results = vector_store.search_by_text(test_query, top_k=3)
67
+
68
+ print(f" ๊ฒฐ๊ณผ: {len(results['documents'])}๊ฐœ ๋ฌธ์„œ ๋ฐœ๊ฒฌ")
69
+ for i, (doc, metadata, distance) in enumerate(zip(
70
+ results['documents'],
71
+ results['metadatas'],
72
+ results['distances']
73
+ ), 1):
74
+ similarity = 1 - distance
75
+ print(f"\n [{i}] {metadata.get('source_filename', 'unknown')}")
76
+ print(f" ์œ ์‚ฌ๋„: {similarity:.3f}")
77
+ print(f" ๋‚ด์šฉ: {doc[:100]}...")
78
+
79
+ print("\n" + "=" * 80)
80
+ print("โœ… Vector DB ํ™•์ธ ์™„๋ฃŒ!")
81
+ print("=" * 80)
82
+
83
+
84
+ if __name__ == "__main__":
85
+ main()
scripts/index_pdfs.py ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ๋กœ์ปฌ์—์„œ ์‹คํ–‰ํ•  PDF ์ธ๋ฑ์‹ฑ ์Šคํฌ๋ฆฝํŠธ
3
+
4
+ ์ด ์Šคํฌ๋ฆฝํŠธ๋ฅผ ๋งฅ๋ถ ๋กœ์ปฌ์—์„œ ์‹คํ–‰ํ•˜๋ฉด:
5
+ 1. ์ง€์ •๋œ ๊ฒฝ๋กœ์˜ ๋ชจ๋“  PDF ํŒŒ์ผ ์ฝ๊ธฐ
6
+ 2. ํ…์ŠคํŠธ ์ถ”์ถœ ๋ฐ ์ฒญํ‚น
7
+ 3. ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
8
+ 4. ChromaDB์— ์ €์žฅ
9
+
10
+ ์‚ฌ์šฉ๋ฒ•:
11
+ python scripts/index_pdfs.py
12
+ """
13
+
14
+ import sys
15
+ from pathlib import Path
16
+
17
+ # ํ”„๋กœ์ ํŠธ ๋ฃจํŠธ๋ฅผ Python ๊ฒฝ๋กœ์— ์ถ”๊ฐ€
18
+ project_root = Path(__file__).parent.parent
19
+ sys.path.insert(0, str(project_root))
20
+
21
+ from dotenv import load_dotenv
22
+ from loguru import logger
23
+ import time
24
+
25
+ from services.pdf_processor import PDFProcessor
26
+ from services.chunker import TextChunker
27
+ from services.embedder import Embedder
28
+ from services.vector_store import VectorStore
29
+ from utils.config import settings
30
+
31
+
32
+ def main():
33
+ """๋ฉ”์ธ ์ธ๋ฑ์‹ฑ ํ”„๋กœ์„ธ์Šค"""
34
+
35
+ # ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ๋กœ๋“œ
36
+ load_dotenv()
37
+
38
+ logger.info("=" * 80)
39
+ logger.info("PDF ์ธ๋ฑ์‹ฑ ์‹œ์ž‘")
40
+ logger.info("=" * 80)
41
+
42
+ start_time = time.time()
43
+
44
+ # 1. PDF ์ฒ˜๋ฆฌ
45
+ logger.info(f"\n[1/4] PDF ํŒŒ์ผ ์ฒ˜๋ฆฌ ์ค‘...")
46
+ logger.info(f"PDF ๊ฒฝ๋กœ: {settings.pdf_source_path}")
47
+
48
+ pdf_processor = PDFProcessor(settings.pdf_source_path)
49
+ documents = pdf_processor.process_all_pdfs()
50
+
51
+ if not documents:
52
+ logger.error("์ฒ˜๋ฆฌํ•  PDF ํŒŒ์ผ์ด ์—†์Šต๋‹ˆ๋‹ค. ๊ฒฝ๋กœ๋ฅผ ํ™•์ธํ•˜์„ธ์š”.")
53
+ return
54
+
55
+ # ํ†ต๊ณ„ ์ถœ๋ ฅ
56
+ stats = pdf_processor.get_statistics()
57
+ logger.info(f"\n์ฒ˜๋ฆฌ ์™„๋ฃŒ:")
58
+ logger.info(f" - ์ „์ฒด ๋ฌธ์„œ: {stats['total_documents']}๊ฐœ")
59
+ logger.info(f" - ์ „์ฒด ํŽ˜์ด์ง€: {stats['total_pages']}ํŽ˜์ด์ง€")
60
+ logger.info(f" - ํ‰๊ท  ํŽ˜์ด์ง€/๋ฌธ์„œ: {stats['avg_pages_per_doc']:.1f}ํŽ˜์ด์ง€")
61
+ logger.info(f" - ์ „์ฒด ๋ฌธ์ž ์ˆ˜: {stats['total_characters']:,}์ž")
62
+
63
+ # 2. ํ…์ŠคํŠธ ์ฒญํ‚น
64
+ logger.info(f"\n[2/4] ํ…์ŠคํŠธ ์ฒญํ‚น ์ค‘...")
65
+ logger.info(f"์ฒญํฌ ํฌ๊ธฐ: {settings.chunk_size}, ์˜ค๋ฒ„๋žฉ: {settings.chunk_overlap}")
66
+
67
+ chunker = TextChunker(
68
+ chunk_size=settings.chunk_size,
69
+ chunk_overlap=settings.chunk_overlap
70
+ )
71
+ chunks = chunker.chunk_all_documents(documents)
72
+
73
+ chunk_stats = chunker.get_chunk_statistics(chunks)
74
+ logger.info(f"\n์ฒญํ‚น ์™„๋ฃŒ:")
75
+ logger.info(f" - ์ „์ฒด ์ฒญํฌ: {chunk_stats['total_chunks']}๊ฐœ")
76
+ logger.info(f" - ํ‰๊ท  ์ฒญํฌ ๊ธธ์ด: {chunk_stats['avg_chunk_length']:.0f}์ž")
77
+ logger.info(f" - ๋ฌธ์„œ๋‹น ํ‰๊ท  ์ฒญํฌ: {chunk_stats['total_chunks'] / len(documents):.1f}๊ฐœ")
78
+
79
+ # 3. ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
80
+ logger.info(f"\n[3/4] ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ ์ค‘...")
81
+ logger.info(f"์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ: {settings.embedding_model} ({settings.embedding_model_name})")
82
+
83
+ embedder = Embedder(
84
+ model_type=settings.embedding_model,
85
+ model_name=settings.embedding_model_name,
86
+ openai_api_key=settings.openai_api_key,
87
+ cohere_api_key=settings.cohere_api_key
88
+ )
89
+
90
+ texts = [chunk['text'] for chunk in chunks]
91
+ embeddings = embedder.embed_batch(texts, batch_size=32)
92
+
93
+ logger.info(f"\n์ž„๋ฒ ๋”ฉ ์™„๋ฃŒ:")
94
+ logger.info(f" - ์ž„๋ฒ ๋”ฉ ๊ฐœ์ˆ˜: {len(embeddings)}๊ฐœ")
95
+ logger.info(f" - ์ž„๋ฒ ๋”ฉ ์ฐจ์›: {len(embeddings[0])}์ฐจ์›")
96
+
97
+ # 4. Vector DB์— ์ €์žฅ
98
+ logger.info(f"\n[4/4] Vector DB์— ์ €์žฅ ์ค‘...")
99
+ logger.info(f"์ €์žฅ ๊ฒฝ๋กœ: {settings.chroma_persist_directory}")
100
+
101
+ vector_store = VectorStore(
102
+ persist_directory=settings.chroma_persist_directory,
103
+ collection_name=settings.collection_name
104
+ )
105
+
106
+ # ๊ธฐ์กด ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์œผ๋ฉด ์‚ฌ์šฉ์ž์—๊ฒŒ ํ™•์ธ
107
+ current_count = vector_store.collection.count()
108
+ if current_count > 0:
109
+ logger.warning(f"\n๊ธฐ์กด ๋ฐ์ดํ„ฐ {current_count}๊ฐœ๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.")
110
+ response = input("๊ธฐ์กด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ญ์ œํ•˜๊ณ  ์ƒˆ๋กœ ์ธ๋ฑ์‹ฑํ•˜์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ? (y/N): ")
111
+ if response.lower() == 'y':
112
+ vector_store.reset_collection()
113
+ else:
114
+ logger.info("๊ธฐ์กด ๋ฐ์ดํ„ฐ์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.")
115
+
116
+ vector_store.add_documents(chunks, embeddings)
117
+
118
+ # ์ตœ์ข… ํ†ต๊ณ„
119
+ final_stats = vector_store.get_collection_stats()
120
+ logger.info(f"\n์ €์žฅ ์™„๋ฃŒ:")
121
+ logger.info(f" - ์ปฌ๋ ‰์…˜: {final_stats['collection_name']}")
122
+ logger.info(f" - ์ „์ฒด ๋ฌธ์„œ: {final_stats['total_documents']}๊ฐœ")
123
+ logger.info(f" - ์ €์žฅ ๊ฒฝ๋กœ: {final_stats['persist_directory']}")
124
+
125
+ # ์ด ์†Œ์š” ์‹œ๊ฐ„
126
+ elapsed_time = time.time() - start_time
127
+ logger.info(f"\n์ด ์†Œ์š” ์‹œ๊ฐ„: {elapsed_time:.1f}์ดˆ ({elapsed_time/60:.1f}๋ถ„)")
128
+
129
+ logger.info("\n" + "=" * 80)
130
+ logger.info("์ธ๋ฑ์‹ฑ ์™„๋ฃŒ! ๐ŸŽ‰")
131
+ logger.info("์ด์ œ FastAPI ์„œ๋ฒ„๋ฅผ ์‹คํ–‰ํ•˜์—ฌ RAG ์‹œ์Šคํ…œ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.")
132
+ logger.info("=" * 80)
133
+
134
+ # ๊ฐ„๋‹จํ•œ ๊ฒ€์ƒ‰ ํ…Œ์ŠคํŠธ
135
+ logger.info("\n๊ฒ€์ƒ‰ ํ…Œ์ŠคํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ?")
136
+ test_query = input("๊ฒ€์ƒ‰์–ด๋ฅผ ์ž…๋ ฅํ•˜์„ธ์š” (Enter๋ฅผ ๋ˆ„๋ฅด๋ฉด ๊ฑด๋„ˆ๋œ€): ").strip()
137
+
138
+ if test_query:
139
+ logger.info(f"\n'{test_query}' ๊ฒ€์ƒ‰ ์ค‘...")
140
+ results = vector_store.search_by_text(test_query, top_k=3)
141
+
142
+ logger.info(f"\n์ƒ์œ„ {len(results['documents'])}๊ฐœ ๊ฒฐ๊ณผ:")
143
+ for i, (doc, metadata, distance) in enumerate(zip(
144
+ results['documents'],
145
+ results['metadatas'],
146
+ results['distances']
147
+ ), 1):
148
+ logger.info(f"\n[{i}] {metadata['source_filename']} (similarity: {1-distance:.3f})")
149
+ logger.info(f"๋‚ด์šฉ: {doc[:200]}...")
150
+
151
+
152
+ if __name__ == "__main__":
153
+ main()
scripts/test_query.py ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ RAG ์‹œ์Šคํ…œ ํ…Œ์ŠคํŠธ ์Šคํฌ๋ฆฝํŠธ
3
+
4
+ API ์„œ๋ฒ„๊ฐ€ ์‹คํ–‰ ์ค‘์ผ ๋•Œ ์‚ฌ์šฉ
5
+ """
6
+
7
+ import requests
8
+ import json
9
+ from typing import Dict
10
+
11
+
12
+ def test_query(
13
+ question: str,
14
+ top_k: int = 5,
15
+ enable_metacognition: bool = True,
16
+ api_url: str = "http://localhost:8000"
17
+ ) -> Dict:
18
+ """
19
+ ์งˆ๋ฌธ ํ…Œ์ŠคํŠธ
20
+
21
+ Args:
22
+ question: ์งˆ๋ฌธ
23
+ top_k: ๊ฒ€์ƒ‰ํ•  ๋ฌธ์„œ ๊ฐœ์ˆ˜
24
+ enable_metacognition: ๋ฉ”ํƒ€์ธ์ง€ ํ™œ์„ฑํ™”
25
+ api_url: API URL
26
+
27
+ Returns:
28
+ ์‘๋‹ต ๋ฐ์ดํ„ฐ
29
+ """
30
+ print("=" * 80)
31
+ print(f"์งˆ๋ฌธ: {question}")
32
+ print("=" * 80)
33
+
34
+ # ์š”์ฒญ
35
+ response = requests.post(
36
+ f"{api_url}/query",
37
+ json={
38
+ "question": question,
39
+ "top_k": top_k,
40
+ "enable_metacognition": enable_metacognition
41
+ }
42
+ )
43
+
44
+ if response.status_code != 200:
45
+ print(f"โŒ ์˜ค๋ฅ˜: {response.status_code}")
46
+ print(response.text)
47
+ return {}
48
+
49
+ result = response.json()
50
+
51
+ # ๊ฒฐ๊ณผ ์ถœ๋ ฅ
52
+ print("\n๐Ÿ“ ๋‹ต๋ณ€:")
53
+ print("-" * 80)
54
+ print(result["answer"])
55
+ print("-" * 80)
56
+
57
+ print(f"\n๐Ÿ“š ์ฐธ๊ณ  ๋ฌธํ—Œ: {len(result['sources'])}๊ฐœ")
58
+ for i, source in enumerate(result['sources'][:3], 1):
59
+ print(f"\n[{i}] {source['source_filename']}")
60
+ print(f" ์œ ์‚ฌ๋„: {source['similarity']:.3f}")
61
+ print(f" ๋‚ด์šฉ: {source['text'][:100]}...")
62
+
63
+ if result.get('metacognition'):
64
+ print(f"\n๐Ÿง  ๋ฉ”ํƒ€์ธ์ง€ ์ •๋ณด:")
65
+ print(f" ๋ฐ˜๋ณต ํšŸ์ˆ˜: {result['metacognition']['iterations']}")
66
+ print(f" ์‚ฌ๊ณ  ๊ณผ์ • ๋‹จ๊ณ„: {len(result['metacognition']['thinking_history'])}")
67
+
68
+ print("\n" + "=" * 80)
69
+ return result
70
+
71
+
72
+ def test_health(api_url: str = "http://localhost:8000"):
73
+ """ํ—ฌ์Šค ์ฒดํฌ"""
74
+ print("๐Ÿฅ ํ—ฌ์Šค ์ฒดํฌ ์ค‘...")
75
+ response = requests.get(f"{api_url}/health")
76
+
77
+ if response.status_code == 200:
78
+ data = response.json()
79
+ print("โœ… ์„œ๋ฒ„ ์ •์ƒ")
80
+ print(f" Vector Store: {data['vector_store']['total_documents']}๊ฐœ ๋ฌธ์„œ")
81
+ print(f" Embedding: {data['embedding_model']['type']} ({data['embedding_model']['dimension']}์ฐจ์›)")
82
+ else:
83
+ print(f"โŒ ์„œ๋ฒ„ ์˜ค๋ฅ˜: {response.status_code}")
84
+
85
+
86
+ if __name__ == "__main__":
87
+ # ํ—ฌ์Šค ์ฒดํฌ
88
+ test_health()
89
+
90
+ print("\n")
91
+
92
+ # ์ƒ˜ํ”Œ ์งˆ๋ฌธ๋“ค
93
+ questions = [
94
+ "๊ธˆ์œต์œ„๊ธฐ์˜ ์ฃผ์š” ์›์ธ์€ ๋ฌด์—‡์ธ๊ฐ€์š”?",
95
+ "ํฌํŠธํด๋ฆฌ์˜ค ๋‹ค๊ฐํ™”์˜ ํšจ๊ณผ๋Š”?",
96
+ "์ค‘์•™์€ํ–‰์˜ ๊ธˆ๋ฆฌ ์ •์ฑ…์ด ์‹œ์žฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์€?",
97
+ ]
98
+
99
+ for question in questions:
100
+ try:
101
+ test_query(question, top_k=5, enable_metacognition=True)
102
+ print("\n\n")
103
+ except Exception as e:
104
+ print(f"โŒ ์˜ค๋ฅ˜: {str(e)}\n\n")
105
+
106
+ # ์‚ฌ์šฉ์ž ์ž…๋ ฅ ์งˆ๋ฌธ
107
+ print("\n์ปค์Šคํ…€ ์งˆ๋ฌธ์„ ์ž…๋ ฅํ•˜์„ธ์š” (Enter๋ฅผ ๋ˆ„๋ฅด๋ฉด ์ข…๋ฃŒ):")
108
+ while True:
109
+ question = input("\n์งˆ๋ฌธ: ").strip()
110
+ if not question:
111
+ break
112
+
113
+ try:
114
+ test_query(question, top_k=5, enable_metacognition=True)
115
+ except Exception as e:
116
+ print(f"โŒ ์˜ค๋ฅ˜: {str(e)}")
services/__init__.py ADDED
File without changes
services/chunker.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ํ…์ŠคํŠธ๋ฅผ ์ ์ ˆํ•œ ํฌ๊ธฐ๋กœ ๋ถ„ํ•  (Chunking)
3
+ """
4
+ from typing import List, Dict
5
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
6
+ from loguru import logger
7
+
8
+
9
+ class TextChunker:
10
+ """ํ…์ŠคํŠธ๋ฅผ ์˜๋ฏธ ์žˆ๋Š” ์ฒญํฌ๋กœ ๋ถ„ํ• ํ•˜๋Š” ํด๋ž˜์Šค"""
11
+
12
+ def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
13
+ """
14
+ Args:
15
+ chunk_size: ๊ฐ ์ฒญํฌ์˜ ์ตœ๋Œ€ ๋ฌธ์ž ์ˆ˜
16
+ chunk_overlap: ์ฒญํฌ ๊ฐ„ ๊ฒน์น˜๋Š” ๋ฌธ์ž ์ˆ˜ (์ปจํ…์ŠคํŠธ ๋ณด์กด)
17
+ """
18
+ self.chunk_size = chunk_size
19
+ self.chunk_overlap = chunk_overlap
20
+
21
+ # LangChain์˜ RecursiveCharacterTextSplitter ์‚ฌ์šฉ
22
+ # ๋ฌธ๋‹จ, ๋ฌธ์žฅ, ๋‹จ์–ด ์ˆœ์„œ๋กœ ๋ถ„ํ•  ์‹œ๋„
23
+ self.text_splitter = RecursiveCharacterTextSplitter(
24
+ chunk_size=chunk_size,
25
+ chunk_overlap=chunk_overlap,
26
+ length_function=len,
27
+ separators=["\n\n", "\n", ". ", " ", ""]
28
+ )
29
+
30
+ def chunk_document(self, doc_data: Dict[str, any]) -> List[Dict[str, any]]:
31
+ """
32
+ ๋‹จ์ผ ๋ฌธ์„œ๋ฅผ ์—ฌ๋Ÿฌ ์ฒญํฌ๋กœ ๋ถ„ํ• 
33
+
34
+ Args:
35
+ doc_data: PDF processor์—์„œ ์ถ”์ถœํ•œ ๋ฌธ์„œ ๋ฐ์ดํ„ฐ
36
+
37
+ Returns:
38
+ List of chunks with metadata
39
+ """
40
+ text = doc_data['text']
41
+ chunks = self.text_splitter.split_text(text)
42
+
43
+ chunked_docs = []
44
+ for i, chunk in enumerate(chunks):
45
+ chunked_docs.append({
46
+ 'text': chunk,
47
+ 'chunk_id': i,
48
+ 'source_filename': doc_data['filename'],
49
+ 'source_filepath': doc_data['filepath'],
50
+ 'total_chunks': len(chunks),
51
+ 'metadata': doc_data['metadata'],
52
+ 'page_count': doc_data['page_count']
53
+ })
54
+
55
+ return chunked_docs
56
+
57
+ def chunk_all_documents(self, documents: List[Dict[str, any]]) -> List[Dict[str, any]]:
58
+ """
59
+ ์—ฌ๋Ÿฌ ๋ฌธ์„œ๋ฅผ ๋ชจ๋‘ ์ฒญํฌ๋กœ ๋ถ„ํ• 
60
+
61
+ Args:
62
+ documents: PDF processor์—์„œ ์ถ”์ถœํ•œ ๋ฌธ์„œ ๋ฆฌ์ŠคํŠธ
63
+
64
+ Returns:
65
+ List of all chunks from all documents
66
+ """
67
+ all_chunks = []
68
+
69
+ logger.info(f"Chunking {len(documents)} documents...")
70
+
71
+ for doc in documents:
72
+ chunks = self.chunk_document(doc)
73
+ all_chunks.extend(chunks)
74
+
75
+ logger.info(f"Created {len(all_chunks)} chunks from {len(documents)} documents")
76
+ logger.info(f"Average {len(all_chunks) / len(documents):.1f} chunks per document")
77
+
78
+ return all_chunks
79
+
80
+ def get_chunk_statistics(self, chunks: List[Dict[str, any]]) -> Dict[str, any]:
81
+ """์ฒญํฌ ํ†ต๊ณ„ ์ •๋ณด"""
82
+ if not chunks:
83
+ return {}
84
+
85
+ chunk_lengths = [len(chunk['text']) for chunk in chunks]
86
+
87
+ return {
88
+ 'total_chunks': len(chunks),
89
+ 'avg_chunk_length': sum(chunk_lengths) / len(chunks),
90
+ 'min_chunk_length': min(chunk_lengths),
91
+ 'max_chunk_length': max(chunk_lengths),
92
+ 'total_characters': sum(chunk_lengths),
93
+ }
services/embedder.py ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜
3
+ """
4
+ from typing import List, Optional
5
+ import numpy as np
6
+ from loguru import logger
7
+ from sentence_transformers import SentenceTransformer
8
+ from tqdm import tqdm
9
+
10
+
11
+ class Embedder:
12
+ """ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํด๋ž˜์Šค"""
13
+
14
+ def __init__(
15
+ self,
16
+ model_type: str = "sentence-transformers",
17
+ model_name: str = "all-MiniLM-L6-v2",
18
+ openai_api_key: Optional[str] = None,
19
+ cohere_api_key: Optional[str] = None
20
+ ):
21
+ """
22
+ Args:
23
+ model_type: ์‚ฌ์šฉํ•  ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ํƒ€์ž… (sentence-transformers, openai, cohere)
24
+ model_name: ๋ชจ๋ธ ์ด๋ฆ„
25
+ openai_api_key: OpenAI API ํ‚ค (openai ์‚ฌ์šฉ ์‹œ)
26
+ cohere_api_key: Cohere API ํ‚ค (cohere ์‚ฌ์šฉ ์‹œ)
27
+ """
28
+ self.model_type = model_type
29
+ self.model_name = model_name
30
+
31
+ if model_type == "sentence-transformers":
32
+ logger.info(f"Loading Sentence Transformer model: {model_name}")
33
+ self.model = SentenceTransformer(model_name)
34
+ logger.info(f"Model loaded. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
35
+
36
+ elif model_type == "openai":
37
+ if not openai_api_key:
38
+ raise ValueError("OpenAI API key required for openai embeddings")
39
+ import openai
40
+ openai.api_key = openai_api_key
41
+ self.model = None
42
+ logger.info(f"Using OpenAI embeddings: {model_name}")
43
+
44
+ elif model_type == "cohere":
45
+ if not cohere_api_key:
46
+ raise ValueError("Cohere API key required for cohere embeddings")
47
+ import cohere
48
+ self.model = cohere.Client(cohere_api_key)
49
+ logger.info(f"Using Cohere embeddings: {model_name}")
50
+
51
+ else:
52
+ raise ValueError(f"Unknown model type: {model_type}")
53
+
54
+ def embed_text(self, text: str) -> List[float]:
55
+ """
56
+ ๋‹จ์ผ ํ…์ŠคํŠธ๋ฅผ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜
57
+
58
+ Args:
59
+ text: ์ž„๋ฒ ๋”ฉํ•  ํ…์ŠคํŠธ
60
+
61
+ Returns:
62
+ Embedding vector as list of floats
63
+ """
64
+ if self.model_type == "sentence-transformers":
65
+ embedding = self.model.encode(text, convert_to_numpy=True)
66
+ return embedding.tolist()
67
+
68
+ elif self.model_type == "openai":
69
+ import openai
70
+ response = openai.embeddings.create(
71
+ input=text,
72
+ model=self.model_name
73
+ )
74
+ return response.data[0].embedding
75
+
76
+ elif self.model_type == "cohere":
77
+ response = self.model.embed(
78
+ texts=[text],
79
+ model=self.model_name
80
+ )
81
+ return response.embeddings[0]
82
+
83
+ def embed_batch(self, texts: List[str], batch_size: int = 32) -> List[List[float]]:
84
+ """
85
+ ์—ฌ๋Ÿฌ ํ…์ŠคํŠธ๋ฅผ ๋ฐฐ์น˜๋กœ ์ž„๋ฒ ๋”ฉ
86
+
87
+ Args:
88
+ texts: ์ž„๋ฒ ๋”ฉํ•  ํ…์ŠคํŠธ ๋ฆฌ์ŠคํŠธ
89
+ batch_size: ๋ฐฐ์น˜ ํฌ๊ธฐ
90
+
91
+ Returns:
92
+ List of embedding vectors
93
+ """
94
+ logger.info(f"Embedding {len(texts)} texts with batch size {batch_size}")
95
+
96
+ if self.model_type == "sentence-transformers":
97
+ # Sentence Transformers๋Š” ์ž์ฒด ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ์ง€์›
98
+ embeddings = []
99
+ for i in tqdm(range(0, len(texts), batch_size), desc="Embedding batches"):
100
+ batch = texts[i:i + batch_size]
101
+ batch_embeddings = self.model.encode(
102
+ batch,
103
+ convert_to_numpy=True,
104
+ show_progress_bar=False
105
+ )
106
+ embeddings.extend(batch_embeddings.tolist())
107
+ return embeddings
108
+
109
+ elif self.model_type == "openai":
110
+ # OpenAI API ํ˜ธ์ถœ ์ œํ•œ ๊ณ ๋ ค
111
+ import openai
112
+ embeddings = []
113
+ for i in tqdm(range(0, len(texts), batch_size), desc="Embedding batches"):
114
+ batch = texts[i:i + batch_size]
115
+ response = openai.embeddings.create(
116
+ input=batch,
117
+ model=self.model_name
118
+ )
119
+ batch_embeddings = [item.embedding for item in response.data]
120
+ embeddings.extend(batch_embeddings)
121
+ return embeddings
122
+
123
+ elif self.model_type == "cohere":
124
+ # Cohere ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ
125
+ embeddings = []
126
+ for i in tqdm(range(0, len(texts), batch_size), desc="Embedding batches"):
127
+ batch = texts[i:i + batch_size]
128
+ response = self.model.embed(
129
+ texts=batch,
130
+ model=self.model_name
131
+ )
132
+ embeddings.extend(response.embeddings)
133
+ return embeddings
134
+
135
+ def get_embedding_dimension(self) -> int:
136
+ """์ž„๋ฒ ๋”ฉ ์ฐจ์› ๋ฐ˜ํ™˜"""
137
+ if self.model_type == "sentence-transformers":
138
+ return self.model.get_sentence_embedding_dimension()
139
+ elif self.model_type == "openai":
140
+ if "ada-002" in self.model_name:
141
+ return 1536
142
+ return 1536 # default
143
+ elif self.model_type == "cohere":
144
+ if "embed-multilingual" in self.model_name:
145
+ return 768
146
+ return 1024 # default
147
+ return 768
services/pdf_processor.py ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ PDF ํŒŒ์ผ ์ฒ˜๋ฆฌ ๋ฐ ํ…์ŠคํŠธ ์ถ”์ถœ
3
+ """
4
+ import os
5
+ from pathlib import Path
6
+ from typing import List, Dict, Optional
7
+ import PyPDF2
8
+ import pdfplumber
9
+ from loguru import logger
10
+ from tqdm import tqdm
11
+
12
+
13
+ class PDFProcessor:
14
+ """PDF ํŒŒ์ผ์—์„œ ํ…์ŠคํŠธ์™€ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋Š” ํด๋ž˜์Šค"""
15
+
16
+ def __init__(self, pdf_directory: str):
17
+ """
18
+ Args:
19
+ pdf_directory: PDF ํŒŒ์ผ๋“ค์ด ์žˆ๋Š” ๋””๋ ‰ํ† ๋ฆฌ ๊ฒฝ๋กœ
20
+ """
21
+ self.pdf_directory = Path(pdf_directory)
22
+ self.processed_docs = []
23
+
24
+ def get_pdf_files(self) -> List[Path]:
25
+ """๋””๋ ‰ํ† ๋ฆฌ์—์„œ ๋ชจ๋“  PDF ํŒŒ์ผ ์ฐพ๊ธฐ"""
26
+ if not self.pdf_directory.exists():
27
+ raise FileNotFoundError(f"Directory not found: {self.pdf_directory}")
28
+
29
+ pdf_files = list(self.pdf_directory.glob("**/*.pdf"))
30
+ logger.info(f"Found {len(pdf_files)} PDF files in {self.pdf_directory}")
31
+ return pdf_files
32
+
33
+ def extract_text_from_pdf(self, pdf_path: Path) -> Optional[Dict[str, any]]:
34
+ """
35
+ ๋‹จ์ผ PDF ํŒŒ์ผ์—์„œ ํ…์ŠคํŠธ ์ถ”์ถœ
36
+
37
+ Args:
38
+ pdf_path: PDF ํŒŒ์ผ ๊ฒฝ๋กœ
39
+
40
+ Returns:
41
+ Dict with 'text', 'metadata', 'filename', 'page_count'
42
+ """
43
+ try:
44
+ # pdfplumber๋ฅผ ์‚ฌ์šฉํ•œ ํ…์ŠคํŠธ ์ถ”์ถœ (๋” ์ •ํ™•ํ•จ)
45
+ with pdfplumber.open(pdf_path) as pdf:
46
+ text = ""
47
+ for page in pdf.pages:
48
+ page_text = page.extract_text()
49
+ if page_text:
50
+ text += page_text + "\n\n"
51
+
52
+ # PyPDF2๋กœ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ถ”์ถœ
53
+ with open(pdf_path, 'rb') as f:
54
+ pdf_reader = PyPDF2.PdfReader(f)
55
+ metadata = pdf_reader.metadata if pdf_reader.metadata else {}
56
+ page_count = len(pdf_reader.pages)
57
+
58
+ return {
59
+ 'text': text.strip(),
60
+ 'metadata': {
61
+ 'title': metadata.get('/Title', ''),
62
+ 'author': metadata.get('/Author', ''),
63
+ 'subject': metadata.get('/Subject', ''),
64
+ 'creator': metadata.get('/Creator', ''),
65
+ },
66
+ 'filename': pdf_path.name,
67
+ 'filepath': str(pdf_path),
68
+ 'page_count': page_count
69
+ }
70
+
71
+ except Exception as e:
72
+ logger.error(f"Error processing {pdf_path.name}: {str(e)}")
73
+ return None
74
+
75
+ def process_all_pdfs(self) -> List[Dict[str, any]]:
76
+ """
77
+ ๋ชจ๋“  PDF ํŒŒ์ผ ์ฒ˜๋ฆฌ
78
+
79
+ Returns:
80
+ List of dictionaries containing extracted text and metadata
81
+ """
82
+ pdf_files = self.get_pdf_files()
83
+ self.processed_docs = []
84
+
85
+ logger.info(f"Processing {len(pdf_files)} PDF files...")
86
+
87
+ for pdf_path in tqdm(pdf_files, desc="Processing PDFs"):
88
+ doc_data = self.extract_text_from_pdf(pdf_path)
89
+ if doc_data and doc_data['text']: # ํ…์ŠคํŠธ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ๋งŒ ์ถ”๊ฐ€
90
+ self.processed_docs.append(doc_data)
91
+ else:
92
+ logger.warning(f"No text extracted from {pdf_path.name}")
93
+
94
+ logger.info(f"Successfully processed {len(self.processed_docs)} PDFs")
95
+ return self.processed_docs
96
+
97
+ def get_statistics(self) -> Dict[str, any]:
98
+ """์ฒ˜๋ฆฌ๋œ ๋ฌธ์„œ๋“ค์˜ ํ†ต๊ณ„ ์ •๋ณด"""
99
+ if not self.processed_docs:
100
+ return {}
101
+
102
+ total_pages = sum(doc['page_count'] for doc in self.processed_docs)
103
+ total_chars = sum(len(doc['text']) for doc in self.processed_docs)
104
+
105
+ return {
106
+ 'total_documents': len(self.processed_docs),
107
+ 'total_pages': total_pages,
108
+ 'total_characters': total_chars,
109
+ 'avg_pages_per_doc': total_pages / len(self.processed_docs),
110
+ 'avg_chars_per_doc': total_chars / len(self.processed_docs),
111
+ }
services/vector_store.py ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Vector Database ํ†ตํ•ฉ (ChromaDB ์‚ฌ์šฉ)
3
+ """
4
+ from typing import List, Dict, Optional, Any
5
+ import chromadb
6
+ from chromadb.config import Settings
7
+ from loguru import logger
8
+ from pathlib import Path
9
+
10
+
11
+ class VectorStore:
12
+ """ChromaDB๋ฅผ ์‚ฌ์šฉํ•œ ๋ฒกํ„ฐ ์ €์žฅ์†Œ ํด๋ž˜์Šค"""
13
+
14
+ def __init__(
15
+ self,
16
+ persist_directory: str = "./data/chroma_db",
17
+ collection_name: str = "financial_papers"
18
+ ):
19
+ """
20
+ Args:
21
+ persist_directory: ChromaDB ๋ฐ์ดํ„ฐ ์ €์žฅ ๊ฒฝ๋กœ
22
+ collection_name: ์ปฌ๋ ‰์…˜ ์ด๋ฆ„
23
+ """
24
+ self.persist_directory = Path(persist_directory)
25
+ self.collection_name = collection_name
26
+
27
+ # ๋””๋ ‰ํ† ๋ฆฌ ์ƒ์„ฑ
28
+ self.persist_directory.mkdir(parents=True, exist_ok=True)
29
+
30
+ # ChromaDB ํด๋ผ์ด์–ธํŠธ ์ดˆ๊ธฐํ™”
31
+ logger.info(f"Initializing ChromaDB at {persist_directory}")
32
+ self.client = chromadb.PersistentClient(
33
+ path=str(self.persist_directory)
34
+ )
35
+
36
+ # ์ปฌ๋ ‰์…˜ ์ƒ์„ฑ ๋˜๋Š” ๊ฐ€์ ธ์˜ค๊ธฐ
37
+ self.collection = self.client.get_or_create_collection(
38
+ name=collection_name,
39
+ metadata={"description": "Financial and Economics research papers"}
40
+ )
41
+
42
+ logger.info(f"Collection '{collection_name}' ready. Current count: {self.collection.count()}")
43
+
44
+ def add_documents(
45
+ self,
46
+ chunks: List[Dict[str, Any]],
47
+ embeddings: List[List[float]]
48
+ ) -> None:
49
+ """
50
+ ๋ฌธ์„œ ์ฒญํฌ๋“ค์„ ๋ฒกํ„ฐ DB์— ์ถ”๊ฐ€
51
+
52
+ Args:
53
+ chunks: ์ฒญํฌ ๋ฐ์ดํ„ฐ ๋ฆฌ์ŠคํŠธ (text, metadata ํฌํ•จ)
54
+ embeddings: ๊ฐ ์ฒญํฌ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ
55
+ """
56
+ if len(chunks) != len(embeddings):
57
+ raise ValueError("Number of chunks and embeddings must match")
58
+
59
+ logger.info(f"Adding {len(chunks)} documents to vector store...")
60
+
61
+ # ChromaDB์— ํ•„์š”ํ•œ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜
62
+ ids = [f"{chunk['source_filename']}_{chunk['chunk_id']}" for chunk in chunks]
63
+ documents = [chunk['text'] for chunk in chunks]
64
+ metadatas = [
65
+ {
66
+ 'source_filename': chunk['source_filename'],
67
+ 'source_filepath': chunk['source_filepath'],
68
+ 'chunk_id': str(chunk['chunk_id']),
69
+ 'total_chunks': str(chunk['total_chunks']),
70
+ 'title': chunk['metadata'].get('title', ''),
71
+ 'author': chunk['metadata'].get('author', ''),
72
+ 'page_count': str(chunk['page_count'])
73
+ }
74
+ for chunk in chunks
75
+ ]
76
+
77
+ # ๋ฐฐ์น˜๋กœ ์ถ”๊ฐ€ (ChromaDB๋Š” ํ•œ๋ฒˆ์— ๋งŽ์€ ์–‘ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ)
78
+ batch_size = 100
79
+ for i in range(0, len(chunks), batch_size):
80
+ batch_end = min(i + batch_size, len(chunks))
81
+ self.collection.add(
82
+ ids=ids[i:batch_end],
83
+ embeddings=embeddings[i:batch_end],
84
+ documents=documents[i:batch_end],
85
+ metadatas=metadatas[i:batch_end]
86
+ )
87
+ logger.info(f"Added batch {i // batch_size + 1}/{(len(chunks) + batch_size - 1) // batch_size}")
88
+
89
+ logger.info(f"Successfully added {len(chunks)} documents. Total in collection: {self.collection.count()}")
90
+
91
+ def search(
92
+ self,
93
+ query_embedding: List[float],
94
+ top_k: int = 5,
95
+ filter_metadata: Optional[Dict[str, str]] = None
96
+ ) -> Dict[str, Any]:
97
+ """
98
+ ๋ฒกํ„ฐ ๊ฒ€์ƒ‰ ์ˆ˜ํ–‰
99
+
100
+ Args:
101
+ query_embedding: ์ฟผ๋ฆฌ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ
102
+ top_k: ๋ฐ˜ํ™˜ํ•  ๊ฒฐ๊ณผ ๊ฐœ์ˆ˜
103
+ filter_metadata: ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ•„ํ„ฐ (optional)
104
+
105
+ Returns:
106
+ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ (documents, metadatas, distances)
107
+ """
108
+ results = self.collection.query(
109
+ query_embeddings=[query_embedding],
110
+ n_results=top_k,
111
+ where=filter_metadata
112
+ )
113
+
114
+ return {
115
+ 'documents': results['documents'][0] if results['documents'] else [],
116
+ 'metadatas': results['metadatas'][0] if results['metadatas'] else [],
117
+ 'distances': results['distances'][0] if results['distances'] else [],
118
+ 'ids': results['ids'][0] if results['ids'] else []
119
+ }
120
+
121
+ def search_by_text(
122
+ self,
123
+ query_text: str,
124
+ top_k: int = 5,
125
+ filter_metadata: Optional[Dict[str, str]] = None
126
+ ) -> Dict[str, Any]:
127
+ """
128
+ ํ…์ŠคํŠธ๋กœ ๊ฒ€์ƒ‰ (ChromaDB๊ฐ€ ์ž๋™์œผ๋กœ ์ž„๋ฒ ๋”ฉ)
129
+
130
+ Args:
131
+ query_text: ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ ํ…์ŠคํŠธ
132
+ top_k: ๋ฐ˜ํ™˜ํ•  ๊ฒฐ๊ณผ ๊ฐœ์ˆ˜
133
+ filter_metadata: ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ•„ํ„ฐ
134
+
135
+ Returns:
136
+ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ
137
+ """
138
+ results = self.collection.query(
139
+ query_texts=[query_text],
140
+ n_results=top_k,
141
+ where=filter_metadata
142
+ )
143
+
144
+ return {
145
+ 'documents': results['documents'][0] if results['documents'] else [],
146
+ 'metadatas': results['metadatas'][0] if results['metadatas'] else [],
147
+ 'distances': results['distances'][0] if results['distances'] else [],
148
+ 'ids': results['ids'][0] if results['ids'] else []
149
+ }
150
+
151
+ def get_collection_stats(self) -> Dict[str, Any]:
152
+ """์ปฌ๋ ‰์…˜ ํ†ต๊ณ„ ์ •๋ณด"""
153
+ count = self.collection.count()
154
+
155
+ # ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ
156
+ sample = self.collection.peek(limit=1)
157
+
158
+ return {
159
+ 'collection_name': self.collection_name,
160
+ 'total_documents': count,
161
+ 'persist_directory': str(self.persist_directory),
162
+ 'has_data': count > 0
163
+ }
164
+
165
+ def delete_collection(self) -> None:
166
+ """์ปฌ๋ ‰์…˜ ์‚ญ์ œ (์ฃผ์˜: ๋ชจ๋“  ๋ฐ์ดํ„ฐ ์‚ญ์ œ๋จ)"""
167
+ logger.warning(f"Deleting collection '{self.collection_name}'")
168
+ self.client.delete_collection(name=self.collection_name)
169
+ logger.info("Collection deleted")
170
+
171
+ def reset_collection(self) -> None:
172
+ """์ปฌ๋ ‰์…˜ ์ดˆ๊ธฐํ™” (์‚ญ์ œ ํ›„ ์žฌ์ƒ์„ฑ)"""
173
+ self.delete_collection()
174
+ self.collection = self.client.get_or_create_collection(
175
+ name=self.collection_name,
176
+ metadata={"description": "Financial and Economics research papers"}
177
+ )
178
+ logger.info("Collection reset")
utils/__init__.py ADDED
File without changes
utils/config.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Configuration management using pydantic-settings
3
+ """
4
+ from pydantic_settings import BaseSettings
5
+ from typing import Optional
6
+
7
+
8
+ class Settings(BaseSettings):
9
+ """Application settings loaded from environment variables"""
10
+
11
+ # API Keys
12
+ anthropic_api_key: str
13
+ openai_api_key: Optional[str] = None
14
+ cohere_api_key: Optional[str] = None
15
+
16
+ # Vector Database Settings
17
+ chroma_persist_directory: str = "./data/chroma_db"
18
+ collection_name: str = "financial_papers"
19
+
20
+ # PDF Processing Settings
21
+ pdf_source_path: str
22
+ chunk_size: int = 1000
23
+ chunk_overlap: int = 200
24
+
25
+ # Embedding Model Settings
26
+ embedding_model: str = "sentence-transformers" # options: openai, sentence-transformers, cohere
27
+ embedding_model_name: str = "all-MiniLM-L6-v2"
28
+
29
+ # API Settings
30
+ api_host: str = "0.0.0.0"
31
+ api_port: int = 8000
32
+
33
+ class Config:
34
+ env_file = ".env"
35
+ case_sensitive = False
36
+
37
+
38
+ # Global settings instance
39
+ settings = Settings()