Spaces:

csjjin2002
/

financial-rag-chatbot

Sleeping

Claude commited on Nov 7, 2025

Commit

f6b05db

unverified ·

0 Parent(s):

Add complete Financial RAG system with Metacognitive Agent

Implemented a comprehensive RAG (Retrieval-Augmented Generation) system for financial/economics research papers with the following features:

Core Components:
- PDF processing and text extraction (PyPDF2, pdfplumber)
- Text chunking with overlap for context preservation
- Vector embeddings (Sentence Transformers, OpenAI, Cohere support)
- ChromaDB vector store for efficient similarity search
- Metacognitive agent with 4-stage process (Planning → Monitoring → Evaluation → Revision)
- FastAPI REST API with comprehensive endpoints

Key Features:
- Supports 2,639+ financial/economics journal articles
- Hallucination detection and prevention
- Iterative answer refinement based on quality evaluation
- Flexible embedding model selection (free and paid options)
- Batch processing for efficient indexing
- Detailed logging and statistics

API Endpoints:
- POST /query: Question answering with RAG
- GET /health: System health check
- GET /stats: Vector store statistics
- Interactive API docs at /docs

Scripts:
- index_pdfs.py: Index PDF files into vector database
- check_vector_db.py: Verify vector database contents
- test_query.py: Test queries against the API

Documentation:
- Comprehensive README with quick start guide
- Detailed USAGE_GUIDE in Korean
- Environment configuration examples
- Troubleshooting section

The system enables researchers to query a large corpus of financial literature with high-quality, source-backed answers while minimizing hallucinations through metacognitive reflection.

Files changed (23) hide show

.env.example +26 -0
.gitignore +51 -0
README.md +276 -0
USAGE_GUIDE.md +366 -0
app/__init__.py +0 -0
app/api/__init__.py +0 -0
app/api/models.py +85 -0
app/api/routes.py +151 -0
app/main.py +121 -0
app/metacognitive_agent.py +289 -0
app/rag_pipeline.py +163 -0
requirements.txt +36 -0
scripts/__init__.py +0 -0
scripts/check_vector_db.py +85 -0
scripts/index_pdfs.py +153 -0
scripts/test_query.py +116 -0
services/__init__.py +0 -0
services/chunker.py +93 -0
services/embedder.py +147 -0
services/pdf_processor.py +111 -0
services/vector_store.py +178 -0
utils/__init__.py +0 -0
utils/config.py +39 -0

.env.example ADDED Viewed

	@@ -0,0 +1,26 @@

+# Anthropic API Key
+ANTHROPIC_API_KEY=your_anthropic_api_key_here
+# OpenAI API Key (for embeddings - optional)
+OPENAI_API_KEY=your_openai_api_key_here
+# Cohere API Key (alternative for embeddings - optional)
+COHERE_API_KEY=your_cohere_api_key_here
+# Vector Database Settings
+CHROMA_PERSIST_DIRECTORY=./data/chroma_db
+COLLECTION_NAME=financial_papers
+# PDF Processing Settings
+PDF_SOURCE_PATH=/Users/seongjincho/Desktop/HYU-06-공학박사 도전기/25.8.15(펀더멘털 DB 레퍼런스)/data/
+CHUNK_SIZE=1000
+CHUNK_OVERLAP=200
+# Embedding Model
+# Options: "openai", "sentence-transformers", "cohere"
+EMBEDDING_MODEL=sentence-transformers
+EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
+# API Settings
+API_HOST=0.0.0.0
+API_PORT=8000

.gitignore ADDED Viewed

	@@ -0,0 +1,51 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual Environment
+venv/
+env/
+ENV/
+# Environment Variables
+.env
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# Data
+data/chroma_db/
+data/*.pdf
+*.pdf
+# Logs
+logs/
+*.log
+# MacOS
+.DS_Store
+# Jupyter
+.ipynb_checkpoints/

README.md ADDED Viewed

	@@ -0,0 +1,276 @@

+# 📚 Financial RAG with Metacognitive Agent
+금융/경제 논문 기반 RAG (Retrieval-Augmented Generation) 시스템 with 메타인지 에이전트
+## 🎯 주요 기능
+- ✅ **2,639개 금융/경제 저널 논문** 벡터 인덱싱
+- 🧠 **메타인지 에이전트** (Planning → Monitoring → Evaluation → Revision)
+- 🔍 **고성능 벡터 검색** (ChromaDB)
+- 🚫 **Hallucination 감지 및 방지**
+- 🚀 **FastAPI 기반 REST API**
+- 📊 **임베딩 모델 선택 가능** (Sentence Transformers, OpenAI, Cohere)
+## 📁 프로젝트 구조
+```
+Hallucination_and_Deception_for_financial_RAG/
+├── app/
+│   ├── main.py                    # FastAPI 메인 앱
+│   ├── metacognitive_agent.py     # 메타인지 에이전트
+│   ├── rag_pipeline.py            # RAG 파이프라인
+│   └── api/
+│       ├── routes.py              # API 엔드포인트
+│       └── models.py              # Pydantic 모델
+├── services/
+│   ├── pdf_processor.py           # PDF 처리
+│   ├── chunker.py                 # 텍스트 청킹
+│   ├── embedder.py                # 임베딩 생성
+│   └── vector_store.py            # 벡터 DB
+├── utils/
+│   └── config.py                  # 설정 관리
+├── scripts/
+│   └── index_pdfs.py              # PDF 인덱싱 스크립트
+├── data/
+│   └── chroma_db/                 # 벡터 DB 저장소
+├── requirements.txt
+├── .env.example
+└── README.md
+```
+## 🚀 빠른 시작
+### 1️⃣ 환경 설정
+```bash
+# 리포지토리 클론
+git clone https://github.com/yourusername/Hallucination_and_Deception_for_financial_RAG.git
+cd Hallucination_and_Deception_for_financial_RAG
+# 가상환경 생성 및 활성화
+python -m venv venv
+source venv/bin/activate  # Windows: venv\Scripts\activate
+# 의존성 설치
+pip install -r requirements.txt
+```
+### 2️⃣ 환경 변수 설정
+```bash
+# .env 파일 생성
+cp .env.example .env
+# .env 파일 편집 (필수!)
+nano .env
+```
+`.env` 파일 예시:
+```env
+# Anthropic API Key (필수)
+ANTHROPIC_API_KEY=your_api_key_here
+# PDF 경로 (로컬 맥북 경로)
+PDF_SOURCE_PATH=/Users/seongjincho/Desktop/HYU-06-공학박사 도전기/25.8.15(펀더멘털 DB 레퍼런스)/data/
+# 임베딩 모델 (무료: sentence-transformers)
+EMBEDDING_MODEL=sentence-transformers
+EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
+```
+### 3️⃣ PDF 인덱싱 (로컬 맥북에서 실행)
+```bash
+# PDF 파일들을 벡터 DB로 인덱싱
+python scripts/index_pdfs.py
+```
+이 과정은 다음을 수행합니다:
+1. 2,639개 PDF 파일 읽기
+2. 텍스트 추출 및 청킹
+3. 임베딩 생성 (무료 모델 사용 시 약 30-60분 소요)
+4. ChromaDB에 저장
+**참고:**
+- 처음 실행 시 Sentence Transformer 모델 다운로드 (~90MB)
+- 인덱싱 완료 후 `data/chroma_db/` 폴더가 생성됨
+- 이 폴더만 GitHub에 업로드하면 됨 (PDF 원본은 제외)
+### 4️⃣ API 서버 실행
+```bash
+# FastAPI 서버 시작
+uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
+```
+서버가 시작되면:
+- API Docs: http://localhost:8000/docs
+- ReDoc: http://localhost:8000/redoc
+## 📖 API 사용법
+### 헬스 체크
+```bash
+curl http://localhost:8000/health
+```
+### 질문하기 (메타인지 활성화)
+```bash
+curl -X POST http://localhost:8000/query \
+  -H "Content-Type: application/json" \
+  -d '{
+    "question": "금융위기의 주요 원인은 무엇인가요?",
+    "top_k": 5,
+    "enable_metacognition": true
+  }'
+```
+### Python에서 사용
+```python
+import requests
+response = requests.post(
+    "http://localhost:8000/query",
+    json={
+        "question": "포트폴리오 다각화의 효과는?",
+        "top_k": 5,
+        "enable_metacognition": True
+    }
+)
+result = response.json()
+print(f"답변: {result['answer']}")
+print(f"출처: {len(result['sources'])}개 문서")
+print(f"반복 횟수: {result['metacognition']['iterations']}")
+```
+## 🧠 메타인지 에이전트란?
+메타인지 에이전트는 다음 4단계를 통해 고품질 답변을 생성합니다:
+### 1️⃣ Planning (계획)
+- 질문 분석
+- 답변 전략 수립
+- 필요한 정보 파악
+### 2️⃣ Monitoring (감시)
+- 답변 생성 과정 모니터링
+- Hallucination 감지
+- 논리적 타당성 검증
+### 3️⃣ Evaluation (평가)
+- 완전성, 정확성, 명확성, 신뢰성 평가
+- 1-10 점수 부여
+- 개선 필요 부분 식별
+### 4️⃣ Revision (수정)
+- 피드백 기반 답변 개선
+- 최대 2회 반복
+- 점수 8점 이상 시 종료
+## 🔧 고급 설정
+### 임베딩 모델 변경
+#### OpenAI 사용 (유료, 고품질)
+```env
+EMBEDDING_MODEL=openai
+EMBEDDING_MODEL_NAME=text-embedding-ada-002
+OPENAI_API_KEY=your_openai_key
+```
+#### Cohere 사용 (무료 티어 있음)
+```env
+EMBEDDING_MODEL=cohere
+EMBEDDING_MODEL_NAME=embed-multilingual-v3.0
+COHERE_API_KEY=your_cohere_key
+```
+### 청킹 파라미터 조정
+```env
+CHUNK_SIZE=1000        # 청크 크기 (기본: 1000자)
+CHUNK_OVERLAP=200      # 청크 겹침 (기본: 200자)
+```
+## 📊 성능 최적화
+### 메모리 사용량
+- **인덱싱 시**: ~2-4GB (2,639개 PDF 기준)
+- **API 실행 시**: ~1-2GB
+### 응답 시간
+- **메타인지 비활성화**: ~2-5초
+- **메타인지 활성화**: ~10-30초 (품질 ⬆️)
+### 배치 처리
+임베딩 생성 시 배치 크기 조정:
+```python
+# scripts/index_pdfs.py 수정
+embeddings = embedder.embed_batch(texts, batch_size=64)  # 기본: 32
+```
+## 🐛 문제 해결
+### PDF 경로 오류
+```
+FileNotFoundError: Directory not found
+```
+→ `.env` 파일의 `PDF_SOURCE_PATH` 확인
+### API 키 오류
+```
+AuthenticationError: Invalid API key
+```
+→ `.env` 파일의 `ANTHROPIC_API_KEY` 확인
+### Vector DB가 비어있음
+```
+total_documents: 0
+```
+→ `python scripts/index_pdfs.py` 먼저 실행
+### 메모리 부족
+→ 배치 크기 줄이기: `batch_size=16`
+## 📈 다음 단계
+1. **벡터 DB 업로드**
+   ```bash
+   git add data/chroma_db/
+   git commit -m "Add vector database"
+   git push
+   ```
+2. **클라우드 배포** (선택사항)
+   - AWS EC2 / GCP / Azure
+   - Docker 컨테이너화
+   - API 키 관리 (AWS Secrets Manager 등)
+3. **프론트엔드 구축** (선택사항)
+   - Streamlit / Gradio
+   - React / Vue.js
+## 🤝 기여
+이슈 및 PR 환영합니다!
+## 📄 라이선스
+MIT License
+## 👨‍🎓 작성자
+조성진 (Seongjin Cho)
+- 한양대학교 공학박사 과정
+- 금융/경제 연구
+---
+**⚠️ 중요 알림:**
+- API 키를 절대 GitHub에 커밋하지 마세요!
+- `.env` 파일은 `.gitignore`에 포함되어 있습니다
+- PDF 원본은 용량이 크므로 벡터 DB만 업로드하세요

USAGE_GUIDE.md ADDED Viewed

	@@ -0,0 +1,366 @@

+# 📖 사용 가이드 (한국어)
+## 목차
+1. [설치](#1-설치)
+2. [PDF 인덱싱](#2-pdf-인덱싱)
+3. [API 서버 실행](#3-api-서버-실행)
+4. [질문하기](#4-질문하기)
+5. [고급 사용법](#5-고급-사용법)
+6. [문제 해결](#6-문제-해결)
+---
+## 1. 설치
+### 1-1. Python 환경 확인
+```bash
+python --version  # 3.8 이상 필요
+```
+### 1-2. 가상환경 생성
+```bash
+# 가상환경 생성
+python -m venv venv
+# 활성화 (맥/리눅스)
+source venv/bin/activate
+# 활성화 (윈도우)
+venv\Scripts\activate
+```
+### 1-3. 의존성 설치
+```bash
+pip install -r requirements.txt
+```
+**설치되는 주요 패키지:**
+- FastAPI (웹 서버)
+- Anthropic (Claude API)
+- ChromaDB (벡터 데이터베이스)
+- Sentence Transformers (임베딩)
+- PyPDF2, pdfplumber (PDF 처리)
+---
+## 2. PDF 인덱싱
+### 2-1. 환경 변수 설정
+`.env` 파일 생성:
+```bash
+cp .env.example .env
+```
+`.env` 파일 편집:
+```env
+# 필수: Anthropic API 키
+ANTHROPIC_API_KEY=sk-ant-api03-xxx...
+# 필수: PDF 파일 경로 (로컬 맥북)
+PDF_SOURCE_PATH=/Users/seongjincho/Desktop/HYU-06-공학박사 도전기/25.8.15(펀더멘털 DB 레퍼런스)/data/
+# 선택: 임베딩 모델 (기본값 사용 권장)
+EMBEDDING_MODEL=sentence-transformers
+EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
+```
+### 2-2. 인덱싱 실행
+```bash
+python scripts/index_pdfs.py
+```
+**예상 소요 시간:**
+- 2,639개 PDF 파일
+- 약 30-60분 (무료 모델 사용 시)
+- 처음 실행 시 모델 다운로드 추가 (~90MB)
+**진행 과정:**
+```
+[1/4] PDF 파일 처리 중...
+  - 전체 문서: 2639개
+  - 전체 페이지: 50000+페이지
+[2/4] 텍스트 청킹 중...
+  - 전체 청크: 30000+개
+[3/4] 임베딩 생성 중...
+  - 임베딩 개수: 30000+개
+  - 임베딩 차원: 384차원
+[4/4] Vector DB에 저장 중...
+  - 저장 완료!
+```
+### 2-3. 인덱싱 확인
+```bash
+python scripts/check_vector_db.py
+```
+---
+## 3. API 서버 실행
+### 3-1. 서버 시작
+```bash
+uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
+```
+**또는 간단하게:**
+```bash
+python app/main.py
+```
+### 3-2. 서버 확인
+브라우저에서:
+- API 문서: http://localhost:8000/docs
+- 대체 문서: http://localhost:8000/redoc
+터미널에서:
+```bash
+curl http://localhost:8000/health
+```
+---
+## 4. 질문하기
+### 4-1. 웹 UI로 질문 (가장 쉬움)
+1. http://localhost:8000/docs 접속
+2. `POST /query` 클릭
+3. "Try it out" 클릭
+4. Request body 입력:
+```json
+{
+  "question": "금융위기의 주요 원인은?",
+  "top_k": 5,
+  "enable_metacognition": true
+}
+```
+5. "Execute" 클릭
+### 4-2. 터미널에서 질문
+```bash
+curl -X POST http://localhost:8000/query \
+  -H "Content-Type: application/json" \
+  -d '{
+    "question": "포트폴리오 다각화의 효과는?",
+    "top_k": 5,
+    "enable_metacognition": true
+  }'
+```
+### 4-3. Python 스크립트로 질문
+```bash
+python scripts/test_query.py
+```
+또는 Python 코드에서:
+```python
+import requests
+response = requests.post(
+    "http://localhost:8000/query",
+    json={
+        "question": "중앙은행 금리 정책의 효과는?",
+        "top_k": 5,
+        "enable_metacognition": True
+    }
+)
+result = response.json()
+print(result["answer"])
+```
+---
+## 5. 고급 사용법
+### 5-1. 메타인지 비활성화 (빠른 응답)
+```json
+{
+  "question": "질문",
+  "top_k": 5,
+  "enable_metacognition": false
+}
+```
+**차이점:**
+- 메타인지 활성화: 10-30초, 고품질
+- 메타인지 비활성화: 2-5초, 일반 품질
+### 5-2. 검색 결과 개수 조정
+```json
+{
+  "question": "질문",
+  "top_k": 10,  // 더 많은 문서 검색
+  "enable_metacognition": true
+}
+```
+### 5-3. 메타데이터 필터링
+특정 저자의 논문만 검색:
+```json
+{
+  "question": "질문",
+  "top_k": 5,
+  "filter_metadata": {
+    "author": "John Doe"
+  }
+}
+```
+### 5-4. 응답 구조 이해
+```json
+{
+  "question": "원본 질문",
+  "answer": "생성된 답변",
+  "sources": [
+    {
+      "text": "문서 내용...",
+      "source_filename": "paper123.pdf",
+      "similarity": 0.89,
+      "metadata": {
+        "title": "논문 제목",
+        "author": "저자"
+      }
+    }
+  ],
+  "metacognition": {
+    "thinking_history": [...],
+    "iterations": 2
+  },
+  "search_stats": {
+    "documents_found": 5,
+    "top_similarity": 0.89
+  }
+}
+```
+---
+## 6. 문제 해결
+### 문제 1: PDF 경로 오류
+```
+FileNotFoundError: Directory not found
+```
+**해결:**
+```bash
+# .env 파일에서 PDF 경로 확인
+nano .env
+# 경로가 정확한지 확인
+ls "/Users/seongjincho/Desktop/..."
+```
+### 문제 2: API 키 오류
+```
+AuthenticationError: Invalid API key
+```
+**해결:**
+```bash
+# .env 파일 확인
+nano .env
+# ANTHROPIC_API_KEY가 올바른지 확인
+# 키는 sk-ant-api03-로 시작���야 함
+```
+### 문제 3: Vector DB가 비어있음
+```
+total_documents: 0
+```
+**해결:**
+```bash
+# 인덱싱 먼저 실행
+python scripts/index_pdfs.py
+# 확인
+python scripts/check_vector_db.py
+```
+### 문제 4: 메모리 부족
+```
+MemoryError
+```
+**해결:**
+```bash
+# scripts/index_pdfs.py 수정
+# 배치 크기 줄이기
+embeddings = embedder.embed_batch(texts, batch_size=16)  # 32 → 16
+```
+### 문제 5: 서버가 시작되지 않음
+```
+Address already in use
+```
+**해결:**
+```bash
+# 포트 변경
+uvicorn app.main:app --reload --port 8001
+# 또는 .env 파일에서
+API_PORT=8001
+```
+### 문제 6: 임베딩 모델 다운로드 실패
+```
+ConnectionError
+```
+**해결:**
+```bash
+# 수동으로 모델 다운로드
+python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
+```
+---
+## 💡 팁
+### 성능 최적화
+1. **SSD 사용**: Vector DB는 SSD에 저장 권장
+2. **메모리**: 최소 8GB RAM 권장
+3. **배치 크기**: GPU 없으면 batch_size=16-32
+### 비용 절감
+1. **무료 임베딩**: Sentence Transformers 사용
+2. **메타인지 비활성화**: 빠른 테스트 시
+### 품질 향상
+1. **메타인지 활성화**: 고품질 답변 필요 시
+2. **top_k 증가**: 더 많은 문서 참고
+3. **청크 크기 조정**: 긴 문맥 필요 시
+---
+## 📞 지원
+문제가 해결되지 않으면:
+1. GitHub Issues 등록
+2. 로그 파일 첨부
+3. 에러 메시지 전체 복사
+**로그 확인:**
+```bash
+# API 서버 로그는 터미널에 출력됨
+# 필요시 파일로 저장
+uvicorn app.main:app > server.log 2>&1
+```

app/__init__.py ADDED Viewed

File without changes

app/api/__init__.py ADDED Viewed

File without changes

app/api/models.py ADDED Viewed

	@@ -0,0 +1,85 @@

+"""
+API 요청/응답 모델 정의 (Pydantic)
+"""
+from pydantic import BaseModel, Field
+from typing import List, Dict, Optional, Any
+class QueryRequest(BaseModel):
+    """질문 요청 모델"""
+    question: str = Field(..., description="사용자 질문")
+    top_k: int = Field(default=5, ge=1, le=20, description="검색할 문서 개수")
+    enable_metacognition: bool = Field(default=True, description="메타인지 과정 활성화 여부")
+    filter_metadata: Optional[Dict[str, str]] = Field(default=None, description="메타데이터 필터")
+    class Config:
+        json_schema_extra = {
+            "example": {
+                "question": "금융위기의 주요 원인은 무엇인가요?",
+                "top_k": 5,
+                "enable_metacognition": True
+            }
+        }
+class SourceDocument(BaseModel):
+    """출처 문서 모델"""
+    text: str = Field(..., description="문서 텍스트")
+    source_filename: str = Field(..., description="출처 파일명")
+    similarity: float = Field(..., description="유사도 점수")
+    metadata: Dict[str, Any] = Field(default_factory=dict, description="문서 메타데이터")
+class MetaCognitionInfo(BaseModel):
+    """메타인지 정보 모델"""
+    thinking_history: List[Dict[str, Any]] = Field(..., description="사고 과정 히스토리")
+    iterations: int = Field(..., description="개선 반복 횟수")
+class SearchStats(BaseModel):
+    """검색 통계 모델"""
+    documents_found: int = Field(..., description="발견된 문서 수")
+    top_similarity: float = Field(..., description="최고 유사도 점수")
+class QueryResponse(BaseModel):
+    """질문 응답 모델"""
+    question: str = Field(..., description="원본 질문")
+    answer: str = Field(..., description="생성된 답변")
+    sources: List[SourceDocument] = Field(..., description="참고한 출처 문서들")
+    metacognition: Optional[MetaCognitionInfo] = Field(default=None, description="메타인지 정보")
+    search_stats: SearchStats = Field(..., description="검색 통계")
+    class Config:
+        json_schema_extra = {
+            "example": {
+                "question": "금융위기의 주요 원인은 무엇인가요?",
+                "answer": "2008년 금융위기의 주요 원인은...",
+                "sources": [
+                    {
+                        "text": "논문 내용...",
+                        "source_filename": "financial_crisis_2008.pdf",
+                        "similarity": 0.89,
+                        "metadata": {"author": "John Doe"}
+                    }
+                ],
+                "search_stats": {
+                    "documents_found": 5,
+                    "top_similarity": 0.89
+                }
+            }
+        }
+class HealthResponse(BaseModel):
+    """헬스 체크 응답"""
+    status: str = Field(..., description="서버 상태")
+    vector_store: Dict[str, Any] = Field(..., description="벡터 스토어 정보")
+    embedding_model: Dict[str, Any] = Field(..., description="임베딩 모델 정보")
+class ErrorResponse(BaseModel):
+    """에러 응답"""
+    error: str = Field(..., description="에러 메시지")
+    detail: Optional[str] = Field(default=None, description="상세 정보")

app/api/routes.py ADDED Viewed

	@@ -0,0 +1,151 @@

+"""
+FastAPI 라우트 정의
+"""
+from fastapi import APIRouter, HTTPException, status
+from loguru import logger
+from app.api.models import (
+    QueryRequest,
+    QueryResponse,
+    HealthResponse,
+    ErrorResponse
+)
+# 라우터 생성
+router = APIRouter()
+# RAG 파이프라인은 main.py에서 주입됨
+rag_pipeline = None
+def set_rag_pipeline(pipeline):
+    """RAG 파이프라인 설정"""
+    global rag_pipeline
+    rag_pipeline = pipeline
+@router.get("/", tags=["Root"])
+async def root():
+    """API 루트 엔드포인트"""
+    return {
+        "message": "Financial RAG API with Metacognitive Agent",
+        "version": "1.0.0",
+        "endpoints": {
+            "health": "/health",
+            "query": "/query",
+            "docs": "/docs"
+        }
+    }
+@router.get(
+    "/health",
+    response_model=HealthResponse,
+    tags=["Health"],
+    summary="헬스 체크"
+)
+async def health_check():
+    """
+    시스템 상태 확인
+    Returns:
+        시스템 통계 및 상태 정보
+    """
+    try:
+        if not rag_pipeline:
+            raise HTTPException(
+                status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
+                detail="RAG pipeline not initialized"
+            )
+        stats = rag_pipeline.get_statistics()
+        return HealthResponse(
+            status="healthy",
+            vector_store=stats["vector_store"],
+            embedding_model=stats["embedding_model"]
+        )
+    except Exception as e:
+        logger.error(f"Health check failed: {str(e)}")
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=str(e)
+        )
+@router.post(
+    "/query",
+    response_model=QueryResponse,
+    tags=["Query"],
+    summary="질문하기",
+    description="금융/경제 관련 질문에 대해 RAG 시스템으로 답변 생성"
+)
+async def query(request: QueryRequest):
+    """
+    질문에 대한 답변 생성
+    Args:
+        request: 질문 요청 (question, top_k, enable_metacognition 등)
+    Returns:
+        답변, 출처 문서, 메타인지 정보 등
+    """
+    try:
+        if not rag_pipeline:
+            raise HTTPException(
+                status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
+                detail="RAG pipeline not initialized"
+            )
+        logger.info(f"Received query: {request.question}")
+        # RAG 파이프라인으로 질문 처리
+        result = await rag_pipeline.query(
+            question=request.question,
+            top_k=request.top_k,
+            enable_metacognition=request.enable_metacognition,
+            filter_metadata=request.filter_metadata
+        )
+        logger.info(f"Query processed successfully")
+        return QueryResponse(**result)
+    except Exception as e:
+        logger.error(f"Query failed: {str(e)}")
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Query processing failed: {str(e)}"
+        )
+@router.get(
+    "/stats",
+    tags=["Stats"],
+    summary="통계 정보"
+)
+async def get_stats():
+    """
+    RAG 시스템 통계 정보
+    Returns:
+        벡터 스토어, 임베딩 모델 등의 통계
+    """
+    try:
+        if not rag_pipeline:
+            raise HTTPException(
+                status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
+                detail="RAG pipeline not initialized"
+            )
+        stats = rag_pipeline.get_statistics()
+        return stats
+    except Exception as e:
+        logger.error(f"Stats retrieval failed: {str(e)}")
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=str(e)
+        )

app/main.py ADDED Viewed

	@@ -0,0 +1,121 @@

+"""
+FastAPI 메인 애플리케이션
+실행 방법:
+    uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
+"""
+from fastapi import FastAPI
+from fastapi.middleware.cors import CORSMiddleware
+from loguru import logger
+import sys
+from app.api import routes
+from app.metacognitive_agent import MetaCognitiveAgent
+from app.rag_pipeline import RAGPipeline
+from services.vector_store import VectorStore
+from services.embedder import Embedder
+from utils.config import settings
+# 로깅 설정
+logger.remove()
+logger.add(
+    sys.stdout,
+    format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan>:<cyan>{function}</cyan> - <level>{message}</level>",
+    level="INFO"
+)
+# FastAPI 앱 생성
+app = FastAPI(
+    title="Financial RAG API",
+    description="금융/경제 논문 기반 RAG 시스템 with 메타인지 에이전트",
+    version="1.0.0",
+    docs_url="/docs",
+    redoc_url="/redoc"
+)
+# CORS 설정
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # 프로덕션에서는 특정 도메인으로 제한
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+@app.on_event("startup")
+async def startup_event():
+    """서버 시작 시 초기화"""
+    logger.info("=" * 80)
+    logger.info("Financial RAG API 시작 중...")
+    logger.info("=" * 80)
+    try:
+        # 1. Vector Store 초기화
+        logger.info("1️⃣ Vector Store 초기화 중...")
+        vector_store = VectorStore(
+            persist_directory=settings.chroma_persist_directory,
+            collection_name=settings.collection_name
+        )
+        logger.info(f"✅ Vector Store 초기화 완료 ({vector_store.collection.count()}개 문서)")
+        # 2. Embedder 초기화
+        logger.info("2️⃣ Embedder 초기화 중...")
+        embedder = Embedder(
+            model_type=settings.embedding_model,
+            model_name=settings.embedding_model_name,
+            openai_api_key=settings.openai_api_key,
+            cohere_api_key=settings.cohere_api_key
+        )
+        logger.info(f"✅ Embedder 초기화 완료 ({embedder.get_embedding_dimension()}차원)")
+        # 3. Metacognitive Agent 초기화
+        logger.info("3️⃣ Metacognitive Agent 초기화 중...")
+        agent = MetaCognitiveAgent(api_key=settings.anthropic_api_key)
+        logger.info(f"✅ Agent 초기화 완료 ({agent.model})")
+        # 4. RAG Pipeline 생성
+        logger.info("4️⃣ RAG Pipeline 생성 중...")
+        rag_pipeline = RAGPipeline(
+            vector_store=vector_store,
+            embedder=embedder,
+            metacognitive_agent=agent
+        )
+        logger.info("✅ RAG Pipeline 생성 완료")
+        # 라우터에 파이프라인 설정
+        routes.set_rag_pipeline(rag_pipeline)
+        logger.info("=" * 80)
+        logger.info("✨ API 서버 준비 완료!")
+        logger.info(f"📚 Vector DB: {vector_store.collection.count()}개 문서")
+        logger.info(f"🤖 Model: {agent.model}")
+        logger.info(f"🔗 API Docs: http://{settings.api_host}:{settings.api_port}/docs")
+        logger.info("=" * 80)
+    except Exception as e:
+        logger.error(f"❌ 초기화 실패: {str(e)}")
+        raise
+@app.on_event("shutdown")
+async def shutdown_event():
+    """서버 종료 시 정리"""
+    logger.info("API 서버 종료 중...")
+# 라우터 등록
+app.include_router(routes.router)
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(
+        "app.main:app",
+        host=settings.api_host,
+        port=settings.api_port,
+        reload=True,
+        log_level="info"
+    )

app/metacognitive_agent.py ADDED Viewed

	@@ -0,0 +1,289 @@

+"""
+메타인지 에이전트 (Metacognitive Agent)
+이 에이전트는 다음과 같은 메타인지 전략을 사용합니다:
+1. Planning (계획): 답변 전략 수립
+2. Monitoring (감시): 답변 과정 모니터링
+3. Evaluation (평가): 답변 품질 평가
+4. Revision (수정): 필요시 답변 개선
+"""
+from typing import List, Dict, Optional
+from anthropic import Anthropic
+from loguru import logger
+import json
+class MetaCognitiveAgent:
+    """메타인지 능력을 가진 AI 에이전트"""
+    def __init__(self, api_key: str):
+        """
+        Args:
+            api_key: Anthropic API 키
+        """
+        self.client = Anthropic(api_key=api_key)
+        self.thinking_history = []
+        self.model = "claude-3-5-sonnet-20241022"
+        # 메타인지 프롬프트
+        self.reflection_prompts = {
+            "planning": """
+당신은 금융/경제 분야의 전문가입니다. 다음 질문에 답하기 위한 전략을 수립하세요.
+질문: {query}
+검색된 관련 문서:
+{context}
+다음 사항을 고려하여 답변 계획을 세우세요:
+1. 질문이 요구하는 핵심 정보는 무엇인가?
+2. 제공된 문서들이 질문에 답하기에 충분한가?
+3. 어떤 정보를 우선적으로 사용해야 하는가?
+4. 주의해야 할 점이나 한계는 무엇인가?
+계획을 JSON 형식으로 작성하세요:
+{{
+    "key_information": "질문의 핵심 정보",
+    "context_adequacy": "문서의 충분성 (충분/부족/불확실)",
+    "strategy": "답변 전략",
+    "limitations": "주의사항 및 한계"
+}}
+""",
+            "monitoring": """
+현재 생성 중인 답변을 검토하세요.
+질문: {query}
+현재 답변: {response}
+다음을 확인하세요:
+1. 답변이 질문에 직접적으로 대답하고 있는가?
+2. 제공된 문서의 정보를 정확히 사용하고 있는가?
+3. 추론이 논리적으로 타당한가?
+4. Hallucination(근거 없는 정보)이 포함되어 있지 않은가?
+평가를 JSON 형식으로 작성하세요:
+{{
+    "relevance": "질문과의 관련성 (높음/중간/낮음)",
+    "accuracy": "정확성 (높음/중간/낮음)",
+    "logic": "논리성 (타당함/보통/문제있음)",
+    "hallucination_risk": "Hallucination 위험도 (낮음/중간/높음)",
+    "issues": ["발견된 문제점들"]
+}}
+""",
+            "evaluation": """
+최종 답변을 평가하세요.
+질문: {query}
+답변: {response}
+사용된 출처: {sources}
+다음 기준으로 평가하세요:
+1. 완전성: 질문에 완전히 답했는가?
+2. 정확성: 정보가 정확한가?
+3. 명확성: 답변이 명확하고 이해하기 쉬운가?
+4. 신뢰성: 출처가 명확하고 신뢰할 수 있는가?
+평가를 JSON 형식으로 작성하세요:
+{{
+    "completeness": "완전성 점수 (1-10)",
+    "accuracy": "정확성 점수 (1-10)",
+    "clarity": "명확성 점수 (1-10)",
+    "reliability": "신뢰성 점수 (1-10)",
+    "overall_score": "전체 점수 (1-10)",
+    "feedback": "개선이 필요한 부분"
+}}
+""",
+            "revision": """
+답변을 개선하세요.
+원본 답변: {response}
+평가 피드백: {feedback}
+피드백을 바탕으로 답변을 개선하세요. 특히:
+1. 부정확한 정보 수정
+2. 불완전한 부분 보완
+3. 불명확한 표현 개선
+4. 근거 없는 주장 제거
+개선된 답변만 제공하세요.
+"""
+        }
+    async def think_and_reflect(
+        self,
+        query: str,
+        context_documents: List[Dict],
+        max_iterations: int = 2
+    ) -> Dict:
+        """
+        메타인지 과정을 통한 답변 생성
+        Args:
+            query: 사용자 질문
+            context_documents: 검색된 관련 문서들
+            max_iterations: 최대 개선 반복 횟수
+        Returns:
+            최종 답변 및 메타인지 과정 정보
+        """
+        self.thinking_history = []
+        # 컨텍스트 포맷팅
+        context_text = self._format_context(context_documents)
+        # 1단계: 계획 수립 (Planning)
+        logger.info("1️⃣ Planning: 답변 전략 수립 중...")
+        plan = await self._plan(query, context_text)
+        self.thinking_history.append({"step": "planning", "content": plan})
+        # 2단계: 초기 응답 생성
+        logger.info("2️⃣ Generating: 초기 답변 생성 중...")
+        initial_response = await self._generate_response(query, context_text, plan)
+        self.thinking_history.append({"step": "initial_response", "content": initial_response})
+        # 3단계: 모니터링 (Monitoring)
+        logger.info("3️⃣ Monitoring: 답변 검토 중...")
+        monitoring_result = await self._monitor(query, initial_response)
+        self.thinking_history.append({"step": "monitoring", "content": monitoring_result})
+        current_response = initial_response
+        # 4단계: 반복적 개선
+        for iteration in range(max_iterations):
+            # 평가 (Evaluation)
+            logger.info(f"4️⃣ Evaluation [{iteration + 1}/{max_iterations}]: 답변 평가 중...")
+            evaluation = await self._evaluate(
+                query,
+                current_response,
+                [doc.get('source_filename', 'unknown') for doc in context_documents]
+            )
+            self.thinking_history.append({"step": f"evaluation_{iteration}", "content": evaluation})
+            # 평가 점수가 충분히 높으면 종료
+            try:
+                eval_data = json.loads(evaluation)
+                overall_score = float(eval_data.get('overall_score', 0))
+                if overall_score >= 8.0:
+                    logger.info(f"✅ 충분한 품질 달성 (점수: {overall_score}/10)")
+                    break
+            except:
+                pass
+            # 개선 (Revision)
+            logger.info(f"5️⃣ Revision [{iteration + 1}/{max_iterations}]: 답변 개선 중...")
+            current_response = await self._revise(current_response, evaluation)
+            self.thinking_history.append({"step": f"revision_{iteration}", "content": current_response})
+        return {
+            "query": query,
+            "final_response": current_response,
+            "thinking_history": self.thinking_history,
+            "context_documents": context_documents,
+            "iterations": len([h for h in self.thinking_history if "revision" in h["step"]])
+        }
+    async def _plan(self, query: str, context: str) -> str:
+        """계획 수립"""
+        prompt = self.reflection_prompts["planning"].format(
+            query=query,
+            context=context
+        )
+        message = self.client.messages.create(
+            model=self.model,
+            max_tokens=1024,
+            messages=[{"role": "user", "content": prompt}]
+        )
+        return message.content[0].text
+    async def _generate_response(self, query: str, context: str, plan: str) -> str:
+        """초기 응답 생성"""
+        prompt = f"""
+당신은 금융/경제 분야의 전문가입니다.
+답변 계획:
+{plan}
+질문: {query}
+참고 문서:
+{context}
+위 계획을 바탕으로 질문에 답변하세요. 반드시:
+1. 제공된 문서의 정보만 사용하세요
+2. 확실하지 않은 정보는 추측하지 마세요
+3. 출처를 명확히 밝히세요
+4. 한국어로 답변하세요
+"""
+        message = self.client.messages.create(
+            model=self.model,
+            max_tokens=2048,
+            messages=[{"role": "user", "content": prompt}]
+        )
+        return message.content[0].text
+    async def _monitor(self, query: str, response: str) -> str:
+        """답변 모니터링"""
+        prompt = self.reflection_prompts["monitoring"].format(
+            query=query,
+            response=response
+        )
+        message = self.client.messages.create(
+            model=self.model,
+            max_tokens=1024,
+            messages=[{"role": "user", "content": prompt}]
+        )
+        return message.content[0].text
+    async def _evaluate(self, query: str, response: str, sources: List[str]) -> str:
+        """답변 평가"""
+        prompt = self.reflection_prompts["evaluation"].format(
+            query=query,
+            response=response,
+            sources=", ".join(sources)
+        )
+        message = self.client.messages.create(
+            model=self.model,
+            max_tokens=1024,
+            messages=[{"role": "user", "content": prompt}]
+        )
+        return message.content[0].text
+    async def _revise(self, response: str, feedback: str) -> str:
+        """답변 개선"""
+        prompt = self.reflection_prompts["revision"].format(
+            response=response,
+            feedback=feedback
+        )
+        message = self.client.messages.create(
+            model=self.model,
+            max_tokens=2048,
+            messages=[{"role": "user", "content": prompt}]
+        )
+        return message.content[0].text
+    def _format_context(self, documents: List[Dict]) -> str:
+        """문서들을 컨텍스트 텍스트로 포맷팅"""
+        formatted = []
+        for i, doc in enumerate(documents, 1):
+            text = doc.get('text', doc.get('document', ''))
+            metadata = doc.get('metadata', {})
+            source = metadata.get('source_filename', 'Unknown')
+            formatted.append(f"[문서 {i}] {source}\n{text}\n")
+        return "\n".join(formatted)

app/rag_pipeline.py ADDED Viewed

	@@ -0,0 +1,163 @@

+"""
+RAG (Retrieval-Augmented Generation) 파이프라인
+벡터 검색 + 메타인지 에이전트를 결합한 RAG 시스템
+"""
+from typing import List, Dict, Optional
+from loguru import logger
+from services.vector_store import VectorStore
+from services.embedder import Embedder
+from app.metacognitive_agent import MetaCognitiveAgent
+from utils.config import settings
+class RAGPipeline:
+    """RAG 파이프라인 클래스"""
+    def __init__(
+        self,
+        vector_store: VectorStore,
+        embedder: Embedder,
+        metacognitive_agent: MetaCognitiveAgent
+    ):
+        """
+        Args:
+            vector_store: 벡터 저장소
+            embedder: 임베딩 생성기
+            metacognitive_agent: 메타인지 에이전트
+        """
+        self.vector_store = vector_store
+        self.embedder = embedder
+        self.agent = metacognitive_agent
+    async def query(
+        self,
+        question: str,
+        top_k: int = 5,
+        enable_metacognition: bool = True,
+        filter_metadata: Optional[Dict[str, str]] = None
+    ) -> Dict:
+        """
+        질문에 대한 답변 생성
+        Args:
+            question: 사용자 질문
+            top_k: 검색할 문서 개수
+            enable_metacognition: 메타인지 과정 활성화 여부
+            filter_metadata: 메타데이터 필터
+        Returns:
+            답변 및 관련 정보
+        """
+        logger.info(f"RAG Query: {question}")
+        # 1. 질문 임베딩 생성
+        logger.info("1️⃣ 질문 임베딩 생성 중...")
+        query_embedding = self.embedder.embed_text(question)
+        # 2. 관련 문서 검색
+        logger.info(f"2️⃣ 관련 문서 검색 중 (top_k={top_k})...")
+        search_results = self.vector_store.search(
+            query_embedding=query_embedding,
+            top_k=top_k,
+            filter_metadata=filter_metadata
+        )
+        # 검색 결과 포맷팅
+        context_documents = []
+        for doc, metadata, distance in zip(
+            search_results['documents'],
+            search_results['metadatas'],
+            search_results['distances']
+        ):
+            context_documents.append({
+                'text': doc,
+                'metadata': metadata,
+                'similarity': 1 - distance,  # distance를 similarity로 변환
+                'source_filename': metadata.get('source_filename', 'unknown')
+            })
+        logger.info(f"검색 완료: {len(context_documents)}개 문서 발견")
+        # 3. 메타인지 에이전트로 답변 생성
+        if enable_metacognition:
+            logger.info("3️⃣ 메타인지 에이전트로 답변 생성 중...")
+            result = await self.agent.think_and_reflect(
+                query=question,
+                context_documents=context_documents
+            )
+            return {
+                "question": question,
+                "answer": result["final_response"],
+                "sources": context_documents,
+                "metacognition": {
+                    "thinking_history": result["thinking_history"],
+                    "iterations": result["iterations"]
+                },
+                "search_stats": {
+                    "documents_found": len(context_documents),
+                    "top_similarity": context_documents[0]['similarity'] if context_documents else 0
+                }
+            }
+        else:
+            # 메타인지 없이 간단한 답변
+            logger.info("3️⃣ 간단한 답변 생성 중...")
+            simple_response = await self._generate_simple_response(question, context_documents)
+            return {
+                "question": question,
+                "answer": simple_response,
+                "sources": context_documents,
+                "search_stats": {
+                    "documents_found": len(context_documents),
+                    "top_similarity": context_documents[0]['similarity'] if context_documents else 0
+                }
+            }
+    async def _generate_simple_response(self, question: str, context_documents: List[Dict]) -> str:
+        """메타인지 없는 간단한 답변 생성"""
+        # 컨텍스트 포맷팅
+        context_text = "\n\n".join([
+            f"[출처: {doc['source_filename']}]\n{doc['text']}"
+            for doc in context_documents
+        ])
+        prompt = f"""
+당신은 금융/경제 분야의 전문가입니다.
+질문: {question}
+참고 문서:
+{context_text}
+위 문서들을 참고하여 질문에 답변하세요. 반드시:
+1. 제공된 문서의 정보만 사용하세요
+2. 확실하지 않은 정보는 추측하지 마세요
+3. 출처를 명확히 밝히세요
+4. 한국어로 답변하세요
+"""
+        message = self.agent.client.messages.create(
+            model=self.agent.model,
+            max_tokens=2048,
+            messages=[{"role": "user", "content": prompt}]
+        )
+        return message.content[0].text
+    def get_statistics(self) -> Dict:
+        """RAG 시스템 통계"""
+        vector_stats = self.vector_store.get_collection_stats()
+        return {
+            "vector_store": vector_stats,
+            "embedding_model": {
+                "type": self.embedder.model_type,
+                "name": self.embedder.model_name,
+                "dimension": self.embedder.get_embedding_dimension()
+            },
+            "agent_model": self.agent.model
+        }

requirements.txt ADDED Viewed

	@@ -0,0 +1,36 @@

+# FastAPI and Web Server
+fastapi==0.109.0
+uvicorn[standard]==0.27.0
+pydantic==2.5.3
+pydantic-settings==2.1.0
+python-multipart==0.0.6
+# Anthropic Claude
+anthropic==0.18.1
+# PDF Processing
+PyPDF2==3.0.1
+pdfplumber==0.10.3
+pymupdf==1.23.8
+# Vector Database
+chromadb==0.4.22
+sentence-transformers==2.3.1
+# Embeddings (multiple options)
+openai==1.10.0
+cohere==4.47
+# Text Processing
+langchain==0.1.4
+langchain-community==0.0.17
+tiktoken==0.5.2
+# Utilities
+python-dotenv==1.0.0
+tqdm==4.66.1
+numpy==1.26.3
+pandas==2.1.4
+# Logging and Monitoring
+loguru==0.7.2

scripts/__init__.py ADDED Viewed

File without changes

scripts/check_vector_db.py ADDED Viewed

	@@ -0,0 +1,85 @@

+"""
+Vector DB 상태 확인 스크립트
+인덱싱이 완료된 후 벡터 DB의 내용을 확인합니다.
+"""
+import sys
+from pathlib import Path
+# 프로젝트 루트를 Python 경로에 추가
+project_root = Path(__file__).parent.parent
+sys.path.insert(0, str(project_root))
+from dotenv import load_dotenv
+from services.vector_store import VectorStore
+from utils.config import settings
+def main():
+    """Vector DB 상태 확인"""
+    load_dotenv()
+    print("=" * 80)
+    print("Vector DB 상태 확인")
+    print("=" * 80)
+    # Vector Store 초기화
+    vector_store = VectorStore(
+        persist_directory=settings.chroma_persist_directory,
+        collection_name=settings.collection_name
+    )
+    # 통계 정보
+    stats = vector_store.get_collection_stats()
+    print(f"\n📊 기본 정보:")
+    print(f"   컬렉션명: {stats['collection_name']}")
+    print(f"   저장 경로: {stats['persist_directory']}")
+    print(f"   전체 문서: {stats['total_documents']}개")
+    print(f"   데이터 존재: {'✅ 예' if stats['has_data'] else '❌ 아니오'}")
+    if not stats['has_data']:
+        print("\n⚠️ Vector DB가 비어있습니다!")
+        print("   python scripts/index_pdfs.py 를 먼저 실행하세요.")
+        return
+    # 샘플 데이터 확인
+    print(f"\n📚 샘플 문서 확인:")
+    sample = vector_store.collection.peek(limit=3)
+    for i, (doc_id, doc, metadata) in enumerate(zip(
+        sample['ids'],
+        sample['documents'],
+        sample['metadatas']
+    ), 1):
+        print(f"\n[{i}] {doc_id}")
+        print(f"   출처: {metadata.get('source_filename', 'unknown')}")
+        print(f"   제목: {metadata.get('title', 'N/A')}")
+        print(f"   저자: {metadata.get('author', 'N/A')}")
+        print(f"   내용: {doc[:150]}...")
+    # 간단한 검색 테스트
+    print(f"\n🔍 검색 테스트:")
+    test_query = "financial crisis"
+    print(f"   쿼리: '{test_query}'")
+    results = vector_store.search_by_text(test_query, top_k=3)
+    print(f"   결과: {len(results['documents'])}개 문서 발견")
+    for i, (doc, metadata, distance) in enumerate(zip(
+        results['documents'],
+        results['metadatas'],
+        results['distances']
+    ), 1):
+        similarity = 1 - distance
+        print(f"\n   [{i}] {metadata.get('source_filename', 'unknown')}")
+        print(f"       유사도: {similarity:.3f}")
+        print(f"       내용: {doc[:100]}...")
+    print("\n" + "=" * 80)
+    print("✅ Vector DB 확인 완료!")
+    print("=" * 80)
+if __name__ == "__main__":
+    main()

scripts/index_pdfs.py ADDED Viewed

	@@ -0,0 +1,153 @@

+"""
+로컬에서 실행할 PDF 인덱싱 스크립트
+이 스크립트를 맥북 로컬에서 실행하면:
+1. 지정된 경로의 모든 PDF 파일 읽기
+2. 텍스트 추출 및 청킹
+3. 임베딩 생성
+4. ChromaDB에 저장
+사용법:
+    python scripts/index_pdfs.py
+"""
+import sys
+from pathlib import Path
+# 프로젝트 루트를 Python 경로에 추가
+project_root = Path(__file__).parent.parent
+sys.path.insert(0, str(project_root))
+from dotenv import load_dotenv
+from loguru import logger
+import time
+from services.pdf_processor import PDFProcessor
+from services.chunker import TextChunker
+from services.embedder import Embedder
+from services.vector_store import VectorStore
+from utils.config import settings
+def main():
+    """메인 인덱싱 프로세스"""
+    # 환경 변수 로드
+    load_dotenv()
+    logger.info("=" * 80)
+    logger.info("PDF 인덱싱 시작")
+    logger.info("=" * 80)
+    start_time = time.time()
+    # 1. PDF 처리
+    logger.info(f"\n[1/4] PDF 파일 처리 중...")
+    logger.info(f"PDF 경로: {settings.pdf_source_path}")
+    pdf_processor = PDFProcessor(settings.pdf_source_path)
+    documents = pdf_processor.process_all_pdfs()
+    if not documents:
+        logger.error("처리할 PDF 파일이 없습니다. 경로를 확인하세요.")
+        return
+    # 통계 출력
+    stats = pdf_processor.get_statistics()
+    logger.info(f"\n처리 완료:")
+    logger.info(f"  - 전체 문서: {stats['total_documents']}개")
+    logger.info(f"  - 전체 페이지: {stats['total_pages']}페이지")
+    logger.info(f"  - 평균 페이지/문서: {stats['avg_pages_per_doc']:.1f}페이지")
+    logger.info(f"  - 전체 문자 수: {stats['total_characters']:,}자")
+    # 2. 텍스트 청킹
+    logger.info(f"\n[2/4] 텍스트 청킹 중...")
+    logger.info(f"청크 크기: {settings.chunk_size}, 오버랩: {settings.chunk_overlap}")
+    chunker = TextChunker(
+        chunk_size=settings.chunk_size,
+        chunk_overlap=settings.chunk_overlap
+    )
+    chunks = chunker.chunk_all_documents(documents)
+    chunk_stats = chunker.get_chunk_statistics(chunks)
+    logger.info(f"\n청킹 완료:")
+    logger.info(f"  - 전체 청크: {chunk_stats['total_chunks']}개")
+    logger.info(f"  - 평균 청크 길이: {chunk_stats['avg_chunk_length']:.0f}자")
+    logger.info(f"  - 문서당 평균 청크: {chunk_stats['total_chunks'] / len(documents):.1f}개")
+    # 3. 임베딩 생성
+    logger.info(f"\n[3/4] 임베딩 생성 중...")
+    logger.info(f"임베딩 모델: {settings.embedding_model} ({settings.embedding_model_name})")
+    embedder = Embedder(
+        model_type=settings.embedding_model,
+        model_name=settings.embedding_model_name,
+        openai_api_key=settings.openai_api_key,
+        cohere_api_key=settings.cohere_api_key
+    )
+    texts = [chunk['text'] for chunk in chunks]
+    embeddings = embedder.embed_batch(texts, batch_size=32)
+    logger.info(f"\n임베딩 완료:")
+    logger.info(f"  - 임베딩 개수: {len(embeddings)}개")
+    logger.info(f"  - 임베딩 차원: {len(embeddings[0])}차원")
+    # 4. Vector DB에 저장
+    logger.info(f"\n[4/4] Vector DB에 저장 중...")
+    logger.info(f"저장 경로: {settings.chroma_persist_directory}")
+    vector_store = VectorStore(
+        persist_directory=settings.chroma_persist_directory,
+        collection_name=settings.collection_name
+    )
+    # 기존 데이터가 있으면 사용자에게 확인
+    current_count = vector_store.collection.count()
+    if current_count > 0:
+        logger.warning(f"\n기존 데이터 {current_count}개가 존재합니다.")
+        response = input("기존 데이터를 삭제하고 새로 인덱싱하시겠습니까? (y/N): ")
+        if response.lower() == 'y':
+            vector_store.reset_collection()
+        else:
+            logger.info("기존 데이터에 추가합니다.")
+    vector_store.add_documents(chunks, embeddings)
+    # 최종 통계
+    final_stats = vector_store.get_collection_stats()
+    logger.info(f"\n저장 완료:")
+    logger.info(f"  - 컬렉션: {final_stats['collection_name']}")
+    logger.info(f"  - 전체 문서: {final_stats['total_documents']}개")
+    logger.info(f"  - 저장 경로: {final_stats['persist_directory']}")
+    # 총 소요 시간
+    elapsed_time = time.time() - start_time
+    logger.info(f"\n총 소요 시간: {elapsed_time:.1f}초 ({elapsed_time/60:.1f}분)")
+    logger.info("\n" + "=" * 80)
+    logger.info("인덱싱 완료! 🎉")
+    logger.info("이제 FastAPI 서버를 실행하여 RAG 시스템을 사용할 수 있습니다.")
+    logger.info("=" * 80)
+    # 간단한 검색 테스트
+    logger.info("\n검색 테스트를 수행하시겠습니까?")
+    test_query = input("검색어를 입력하세요 (Enter를 누르면 건너뜀): ").strip()
+    if test_query:
+        logger.info(f"\n'{test_query}' 검색 중...")
+        results = vector_store.search_by_text(test_query, top_k=3)
+        logger.info(f"\n상위 {len(results['documents'])}개 결과:")
+        for i, (doc, metadata, distance) in enumerate(zip(
+            results['documents'],
+            results['metadatas'],
+            results['distances']
+        ), 1):
+            logger.info(f"\n[{i}] {metadata['source_filename']} (similarity: {1-distance:.3f})")
+            logger.info(f"내용: {doc[:200]}...")
+if __name__ == "__main__":
+    main()

scripts/test_query.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""
+RAG 시스템 테스트 스크립트
+API 서버가 실행 중일 때 사용
+"""
+import requests
+import json
+from typing import Dict
+def test_query(
+    question: str,
+    top_k: int = 5,
+    enable_metacognition: bool = True,
+    api_url: str = "http://localhost:8000"
+) -> Dict:
+    """
+    질문 테스트
+    Args:
+        question: 질문
+        top_k: 검색할 문서 개수
+        enable_metacognition: 메타인지 활성화
+        api_url: API URL
+    Returns:
+        응답 데이터
+    """
+    print("=" * 80)
+    print(f"질문: {question}")
+    print("=" * 80)
+    # 요청
+    response = requests.post(
+        f"{api_url}/query",
+        json={
+            "question": question,
+            "top_k": top_k,
+            "enable_metacognition": enable_metacognition
+        }
+    )
+    if response.status_code != 200:
+        print(f"❌ 오류: {response.status_code}")
+        print(response.text)
+        return {}
+    result = response.json()
+    # 결과 출력
+    print("\n📝 답변:")
+    print("-" * 80)
+    print(result["answer"])
+    print("-" * 80)
+    print(f"\n📚 참고 문헌: {len(result['sources'])}개")
+    for i, source in enumerate(result['sources'][:3], 1):
+        print(f"\n[{i}] {source['source_filename']}")
+        print(f"    유사도: {source['similarity']:.3f}")
+        print(f"    내용: {source['text'][:100]}...")
+    if result.get('metacognition'):
+        print(f"\n🧠 메타인지 정보:")
+        print(f"    반복 횟수: {result['metacognition']['iterations']}")
+        print(f"    사고 과정 단계: {len(result['metacognition']['thinking_history'])}")
+    print("\n" + "=" * 80)
+    return result
+def test_health(api_url: str = "http://localhost:8000"):
+    """헬스 체크"""
+    print("🏥 헬스 체크 중...")
+    response = requests.get(f"{api_url}/health")
+    if response.status_code == 200:
+        data = response.json()
+        print("✅ 서버 정상")
+        print(f"   Vector Store: {data['vector_store']['total_documents']}개 문서")
+        print(f"   Embedding: {data['embedding_model']['type']} ({data['embedding_model']['dimension']}차원)")
+    else:
+        print(f"❌ 서버 오류: {response.status_code}")
+if __name__ == "__main__":
+    # 헬스 체크
+    test_health()
+    print("\n")
+    # 샘플 질문들
+    questions = [
+        "금융위기의 주요 원인은 무엇인가요?",
+        "포트폴리오 다각화의 효과는?",
+        "중앙은행의 금리 정책이 시장에 미치는 영향은?",
+    ]
+    for question in questions:
+        try:
+            test_query(question, top_k=5, enable_metacognition=True)
+            print("\n\n")
+        except Exception as e:
+            print(f"❌ 오류: {str(e)}\n\n")
+    # 사용자 입력 질문
+    print("\n커스텀 질문을 입력하세요 (Enter를 누르면 종료):")
+    while True:
+        question = input("\n질문: ").strip()
+        if not question:
+            break
+        try:
+            test_query(question, top_k=5, enable_metacognition=True)
+        except Exception as e:
+            print(f"❌ 오류: {str(e)}")

services/__init__.py ADDED Viewed

File without changes

services/chunker.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""
+텍스트를 적절한 크기로 분할 (Chunking)
+"""
+from typing import List, Dict
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from loguru import logger
+class TextChunker:
+    """텍스트를 의미 있는 청크로 분할하는 클래스"""
+    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
+        """
+        Args:
+            chunk_size: 각 청크의 최대 문자 수
+            chunk_overlap: 청크 간 겹치는 문자 수 (컨텍스트 보존)
+        """
+        self.chunk_size = chunk_size
+        self.chunk_overlap = chunk_overlap
+        # LangChain의 RecursiveCharacterTextSplitter 사용
+        # 문단, 문장, 단어 순서로 분할 시도
+        self.text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=chunk_size,
+            chunk_overlap=chunk_overlap,
+            length_function=len,
+            separators=["\n\n", "\n", ". ", " ", ""]
+        )
+    def chunk_document(self, doc_data: Dict[str, any]) -> List[Dict[str, any]]:
+        """
+        단일 문서를 여러 청크로 분할
+        Args:
+            doc_data: PDF processor에서 추출한 문서 데이터
+        Returns:
+            List of chunks with metadata
+        """
+        text = doc_data['text']
+        chunks = self.text_splitter.split_text(text)
+        chunked_docs = []
+        for i, chunk in enumerate(chunks):
+            chunked_docs.append({
+                'text': chunk,
+                'chunk_id': i,
+                'source_filename': doc_data['filename'],
+                'source_filepath': doc_data['filepath'],
+                'total_chunks': len(chunks),
+                'metadata': doc_data['metadata'],
+                'page_count': doc_data['page_count']
+            })
+        return chunked_docs
+    def chunk_all_documents(self, documents: List[Dict[str, any]]) -> List[Dict[str, any]]:
+        """
+        여러 문서를 모두 청크로 분할
+        Args:
+            documents: PDF processor에서 추출한 문서 리스트
+        Returns:
+            List of all chunks from all documents
+        """
+        all_chunks = []
+        logger.info(f"Chunking {len(documents)} documents...")
+        for doc in documents:
+            chunks = self.chunk_document(doc)
+            all_chunks.extend(chunks)
+        logger.info(f"Created {len(all_chunks)} chunks from {len(documents)} documents")
+        logger.info(f"Average {len(all_chunks) / len(documents):.1f} chunks per document")
+        return all_chunks
+    def get_chunk_statistics(self, chunks: List[Dict[str, any]]) -> Dict[str, any]:
+        """청크 통계 정보"""
+        if not chunks:
+            return {}
+        chunk_lengths = [len(chunk['text']) for chunk in chunks]
+        return {
+            'total_chunks': len(chunks),
+            'avg_chunk_length': sum(chunk_lengths) / len(chunks),
+            'min_chunk_length': min(chunk_lengths),
+            'max_chunk_length': max(chunk_lengths),
+            'total_characters': sum(chunk_lengths),
+        }

services/embedder.py ADDED Viewed

	@@ -0,0 +1,147 @@

+"""
+텍스트를 벡터 임베딩으로 변환
+"""
+from typing import List, Optional
+import numpy as np
+from loguru import logger
+from sentence_transformers import SentenceTransformer
+from tqdm import tqdm
+class Embedder:
+    """텍스트를 벡터로 변환하는 클래스"""
+    def __init__(
+        self,
+        model_type: str = "sentence-transformers",
+        model_name: str = "all-MiniLM-L6-v2",
+        openai_api_key: Optional[str] = None,
+        cohere_api_key: Optional[str] = None
+    ):
+        """
+        Args:
+            model_type: 사용할 임베딩 모델 타입 (sentence-transformers, openai, cohere)
+            model_name: 모델 이름
+            openai_api_key: OpenAI API 키 (openai 사용 시)
+            cohere_api_key: Cohere API 키 (cohere 사용 시)
+        """
+        self.model_type = model_type
+        self.model_name = model_name
+        if model_type == "sentence-transformers":
+            logger.info(f"Loading Sentence Transformer model: {model_name}")
+            self.model = SentenceTransformer(model_name)
+            logger.info(f"Model loaded. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
+        elif model_type == "openai":
+            if not openai_api_key:
+                raise ValueError("OpenAI API key required for openai embeddings")
+            import openai
+            openai.api_key = openai_api_key
+            self.model = None
+            logger.info(f"Using OpenAI embeddings: {model_name}")
+        elif model_type == "cohere":
+            if not cohere_api_key:
+                raise ValueError("Cohere API key required for cohere embeddings")
+            import cohere
+            self.model = cohere.Client(cohere_api_key)
+            logger.info(f"Using Cohere embeddings: {model_name}")
+        else:
+            raise ValueError(f"Unknown model type: {model_type}")
+    def embed_text(self, text: str) -> List[float]:
+        """
+        단일 텍스트를 임베딩으로 변환
+        Args:
+            text: 임베딩할 텍스트
+        Returns:
+            Embedding vector as list of floats
+        """
+        if self.model_type == "sentence-transformers":
+            embedding = self.model.encode(text, convert_to_numpy=True)
+            return embedding.tolist()
+        elif self.model_type == "openai":
+            import openai
+            response = openai.embeddings.create(
+                input=text,
+                model=self.model_name
+            )
+            return response.data[0].embedding
+        elif self.model_type == "cohere":
+            response = self.model.embed(
+                texts=[text],
+                model=self.model_name
+            )
+            return response.embeddings[0]
+    def embed_batch(self, texts: List[str], batch_size: int = 32) -> List[List[float]]:
+        """
+        여러 텍스트를 배치로 임베딩
+        Args:
+            texts: 임베딩할 텍스트 리스트
+            batch_size: 배치 크기
+        Returns:
+            List of embedding vectors
+        """
+        logger.info(f"Embedding {len(texts)} texts with batch size {batch_size}")
+        if self.model_type == "sentence-transformers":
+            # Sentence Transformers는 자체 배치 처리 지원
+            embeddings = []
+            for i in tqdm(range(0, len(texts), batch_size), desc="Embedding batches"):
+                batch = texts[i:i + batch_size]
+                batch_embeddings = self.model.encode(
+                    batch,
+                    convert_to_numpy=True,
+                    show_progress_bar=False
+                )
+                embeddings.extend(batch_embeddings.tolist())
+            return embeddings
+        elif self.model_type == "openai":
+            # OpenAI API 호출 제한 고려
+            import openai
+            embeddings = []
+            for i in tqdm(range(0, len(texts), batch_size), desc="Embedding batches"):
+                batch = texts[i:i + batch_size]
+                response = openai.embeddings.create(
+                    input=batch,
+                    model=self.model_name
+                )
+                batch_embeddings = [item.embedding for item in response.data]
+                embeddings.extend(batch_embeddings)
+            return embeddings
+        elif self.model_type == "cohere":
+            # Cohere 배치 처리
+            embeddings = []
+            for i in tqdm(range(0, len(texts), batch_size), desc="Embedding batches"):
+                batch = texts[i:i + batch_size]
+                response = self.model.embed(
+                    texts=batch,
+                    model=self.model_name
+                )
+                embeddings.extend(response.embeddings)
+            return embeddings
+    def get_embedding_dimension(self) -> int:
+        """임베딩 차원 반환"""
+        if self.model_type == "sentence-transformers":
+            return self.model.get_sentence_embedding_dimension()
+        elif self.model_type == "openai":
+            if "ada-002" in self.model_name:
+                return 1536
+            return 1536  # default
+        elif self.model_type == "cohere":
+            if "embed-multilingual" in self.model_name:
+                return 768
+            return 1024  # default
+        return 768

services/pdf_processor.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""
+PDF 파일 처리 및 텍스트 추출
+"""
+import os
+from pathlib import Path
+from typing import List, Dict, Optional
+import PyPDF2
+import pdfplumber
+from loguru import logger
+from tqdm import tqdm
+class PDFProcessor:
+    """PDF 파일에서 텍스트와 메타데이터를 추출하는 클래스"""
+    def __init__(self, pdf_directory: str):
+        """
+        Args:
+            pdf_directory: PDF 파일들이 있는 디렉토리 경로
+        """
+        self.pdf_directory = Path(pdf_directory)
+        self.processed_docs = []
+    def get_pdf_files(self) -> List[Path]:
+        """디렉토리에서 모든 PDF 파일 찾기"""
+        if not self.pdf_directory.exists():
+            raise FileNotFoundError(f"Directory not found: {self.pdf_directory}")
+        pdf_files = list(self.pdf_directory.glob("**/*.pdf"))
+        logger.info(f"Found {len(pdf_files)} PDF files in {self.pdf_directory}")
+        return pdf_files
+    def extract_text_from_pdf(self, pdf_path: Path) -> Optional[Dict[str, any]]:
+        """
+        단일 PDF 파일에서 텍스트 추출
+        Args:
+            pdf_path: PDF 파일 경로
+        Returns:
+            Dict with 'text', 'metadata', 'filename', 'page_count'
+        """
+        try:
+            # pdfplumber를 사용한 텍스트 추출 (더 정확함)
+            with pdfplumber.open(pdf_path) as pdf:
+                text = ""
+                for page in pdf.pages:
+                    page_text = page.extract_text()
+                    if page_text:
+                        text += page_text + "\n\n"
+                # PyPDF2로 메타데이터 추출
+                with open(pdf_path, 'rb') as f:
+                    pdf_reader = PyPDF2.PdfReader(f)
+                    metadata = pdf_reader.metadata if pdf_reader.metadata else {}
+                    page_count = len(pdf_reader.pages)
+                return {
+                    'text': text.strip(),
+                    'metadata': {
+                        'title': metadata.get('/Title', ''),
+                        'author': metadata.get('/Author', ''),
+                        'subject': metadata.get('/Subject', ''),
+                        'creator': metadata.get('/Creator', ''),
+                    },
+                    'filename': pdf_path.name,
+                    'filepath': str(pdf_path),
+                    'page_count': page_count
+                }
+        except Exception as e:
+            logger.error(f"Error processing {pdf_path.name}: {str(e)}")
+            return None
+    def process_all_pdfs(self) -> List[Dict[str, any]]:
+        """
+        모든 PDF 파일 처리
+        Returns:
+            List of dictionaries containing extracted text and metadata
+        """
+        pdf_files = self.get_pdf_files()
+        self.processed_docs = []
+        logger.info(f"Processing {len(pdf_files)} PDF files...")
+        for pdf_path in tqdm(pdf_files, desc="Processing PDFs"):
+            doc_data = self.extract_text_from_pdf(pdf_path)
+            if doc_data and doc_data['text']:  # 텍스트가 있는 경우만 추가
+                self.processed_docs.append(doc_data)
+            else:
+                logger.warning(f"No text extracted from {pdf_path.name}")
+        logger.info(f"Successfully processed {len(self.processed_docs)} PDFs")
+        return self.processed_docs
+    def get_statistics(self) -> Dict[str, any]:
+        """처리된 문서들의 통계 정보"""
+        if not self.processed_docs:
+            return {}
+        total_pages = sum(doc['page_count'] for doc in self.processed_docs)
+        total_chars = sum(len(doc['text']) for doc in self.processed_docs)
+        return {
+            'total_documents': len(self.processed_docs),
+            'total_pages': total_pages,
+            'total_characters': total_chars,
+            'avg_pages_per_doc': total_pages / len(self.processed_docs),
+            'avg_chars_per_doc': total_chars / len(self.processed_docs),
+        }

services/vector_store.py ADDED Viewed

	@@ -0,0 +1,178 @@

+"""
+Vector Database 통합 (ChromaDB 사용)
+"""
+from typing import List, Dict, Optional, Any
+import chromadb
+from chromadb.config import Settings
+from loguru import logger
+from pathlib import Path
+class VectorStore:
+    """ChromaDB를 사용한 벡터 저장소 클래스"""
+    def __init__(
+        self,
+        persist_directory: str = "./data/chroma_db",
+        collection_name: str = "financial_papers"
+    ):
+        """
+        Args:
+            persist_directory: ChromaDB 데이터 저장 경로
+            collection_name: 컬렉션 이름
+        """
+        self.persist_directory = Path(persist_directory)
+        self.collection_name = collection_name
+        # 디렉토리 생성
+        self.persist_directory.mkdir(parents=True, exist_ok=True)
+        # ChromaDB 클라이언트 초기화
+        logger.info(f"Initializing ChromaDB at {persist_directory}")
+        self.client = chromadb.PersistentClient(
+            path=str(self.persist_directory)
+        )
+        # 컬렉션 생성 또는 가져오기
+        self.collection = self.client.get_or_create_collection(
+            name=collection_name,
+            metadata={"description": "Financial and Economics research papers"}
+        )
+        logger.info(f"Collection '{collection_name}' ready. Current count: {self.collection.count()}")
+    def add_documents(
+        self,
+        chunks: List[Dict[str, Any]],
+        embeddings: List[List[float]]
+    ) -> None:
+        """
+        문서 청크들을 벡터 DB에 추가
+        Args:
+            chunks: 청크 데이터 리스트 (text, metadata 포함)
+            embeddings: 각 청크의 임베딩 벡터
+        """
+        if len(chunks) != len(embeddings):
+            raise ValueError("Number of chunks and embeddings must match")
+        logger.info(f"Adding {len(chunks)} documents to vector store...")
+        # ChromaDB에 필요한 형식으로 변환
+        ids = [f"{chunk['source_filename']}_{chunk['chunk_id']}" for chunk in chunks]
+        documents = [chunk['text'] for chunk in chunks]
+        metadatas = [
+            {
+                'source_filename': chunk['source_filename'],
+                'source_filepath': chunk['source_filepath'],
+                'chunk_id': str(chunk['chunk_id']),
+                'total_chunks': str(chunk['total_chunks']),
+                'title': chunk['metadata'].get('title', ''),
+                'author': chunk['metadata'].get('author', ''),
+                'page_count': str(chunk['page_count'])
+            }
+            for chunk in chunks
+        ]
+        # 배치로 추가 (ChromaDB는 한번에 많은 양 처리 가능)
+        batch_size = 100
+        for i in range(0, len(chunks), batch_size):
+            batch_end = min(i + batch_size, len(chunks))
+            self.collection.add(
+                ids=ids[i:batch_end],
+                embeddings=embeddings[i:batch_end],
+                documents=documents[i:batch_end],
+                metadatas=metadatas[i:batch_end]
+            )
+            logger.info(f"Added batch {i // batch_size + 1}/{(len(chunks) + batch_size - 1) // batch_size}")
+        logger.info(f"Successfully added {len(chunks)} documents. Total in collection: {self.collection.count()}")
+    def search(
+        self,
+        query_embedding: List[float],
+        top_k: int = 5,
+        filter_metadata: Optional[Dict[str, str]] = None
+    ) -> Dict[str, Any]:
+        """
+        벡터 검색 수행
+        Args:
+            query_embedding: 쿼리의 임베딩 벡터
+            top_k: 반환할 결과 개수
+            filter_metadata: 메타데이터 필터 (optional)
+        Returns:
+            검색 결과 (documents, metadatas, distances)
+        """
+        results = self.collection.query(
+            query_embeddings=[query_embedding],
+            n_results=top_k,
+            where=filter_metadata
+        )
+        return {
+            'documents': results['documents'][0] if results['documents'] else [],
+            'metadatas': results['metadatas'][0] if results['metadatas'] else [],
+            'distances': results['distances'][0] if results['distances'] else [],
+            'ids': results['ids'][0] if results['ids'] else []
+        }
+    def search_by_text(
+        self,
+        query_text: str,
+        top_k: int = 5,
+        filter_metadata: Optional[Dict[str, str]] = None
+    ) -> Dict[str, Any]:
+        """
+        텍스트로 검색 (ChromaDB가 자동으로 임베딩)
+        Args:
+            query_text: 검색 쿼리 텍스트
+            top_k: 반환할 결과 개수
+            filter_metadata: 메타데이터 필터
+        Returns:
+            검색 결과
+        """
+        results = self.collection.query(
+            query_texts=[query_text],
+            n_results=top_k,
+            where=filter_metadata
+        )
+        return {
+            'documents': results['documents'][0] if results['documents'] else [],
+            'metadatas': results['metadatas'][0] if results['metadatas'] else [],
+            'distances': results['distances'][0] if results['distances'] else [],
+            'ids': results['ids'][0] if results['ids'] else []
+        }
+    def get_collection_stats(self) -> Dict[str, Any]:
+        """컬렉션 통계 정보"""
+        count = self.collection.count()
+        # 샘플 데이터 가져오기
+        sample = self.collection.peek(limit=1)
+        return {
+            'collection_name': self.collection_name,
+            'total_documents': count,
+            'persist_directory': str(self.persist_directory),
+            'has_data': count > 0
+        }
+    def delete_collection(self) -> None:
+        """컬렉션 삭제 (주의: 모든 데이터 삭제됨)"""
+        logger.warning(f"Deleting collection '{self.collection_name}'")
+        self.client.delete_collection(name=self.collection_name)
+        logger.info("Collection deleted")
+    def reset_collection(self) -> None:
+        """컬렉션 초기화 (삭제 후 재생성)"""
+        self.delete_collection()
+        self.collection = self.client.get_or_create_collection(
+            name=self.collection_name,
+            metadata={"description": "Financial and Economics research papers"}
+        )
+        logger.info("Collection reset")

utils/__init__.py ADDED Viewed

File without changes

utils/config.py ADDED Viewed

	@@ -0,0 +1,39 @@

+"""
+Configuration management using pydantic-settings
+"""
+from pydantic_settings import BaseSettings
+from typing import Optional
+class Settings(BaseSettings):
+    """Application settings loaded from environment variables"""
+    # API Keys
+    anthropic_api_key: str
+    openai_api_key: Optional[str] = None
+    cohere_api_key: Optional[str] = None
+    # Vector Database Settings
+    chroma_persist_directory: str = "./data/chroma_db"
+    collection_name: str = "financial_papers"
+    # PDF Processing Settings
+    pdf_source_path: str
+    chunk_size: int = 1000
+    chunk_overlap: int = 200
+    # Embedding Model Settings
+    embedding_model: str = "sentence-transformers"  # options: openai, sentence-transformers, cohere
+    embedding_model_name: str = "all-MiniLM-L6-v2"
+    # API Settings
+    api_host: str = "0.0.0.0"
+    api_port: int = 8000
+    class Config:
+        env_file = ".env"
+        case_sensitive = False
+# Global settings instance
+settings = Settings()