Spaces:

bluewhale2025
/

parseai-document-processor

Build error

App Files Files Community

bluewhale2025 commited on May 23, 2025

Commit

3022fd1

1 Parent(s): 6e46f97

Initial commit: Add ParseAI document processor application

Browse files

Files changed (11) hide show

.gitignore +44 -0
Dockerfile +54 -0
README.md +88 -8
app.py +255 -0
download_nltk_data.py +26 -0
extractor.py +57 -0
requirements.txt +21 -0
space.yml +11 -0
start.sh +20 -0
summarizer.py +120 -0
vector_store.py +162 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,44 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Environment variables
+.env
+# Local development
+.idea/
+.vscode/
+*.swp
+*.swo
+# Data directories
+/data/
+/app/data/
+/app/nltk_data/
+/app/huggingface_cache/
+# Logs
+*.log
+# OS generated files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+# Local test files
+*.pdf
+*.md
+run_local.sh
+create_test_pdf.py

Dockerfile ADDED Viewed

	@@ -0,0 +1,54 @@

+FROM python:3.9-slim
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    curl \
+    libssl-dev \
+    libffi-dev \
+    python3-dev \
+    python3-pip \
+    git \
+    poppler-utils \
+    tesseract-ocr \
+    tesseract-ocr-eng \
+    && rm -rf /var/lib/apt/lists/*
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+ENV NLTK_DATA=/app/nltk_data
+ENV HUGGINGFACE_HUB_CACHE=/app/huggingface_cache
+ENV UPLOAD_FOLDER=/app/data/uploads
+# Copy requirements first to leverage Docker cache
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application files
+COPY . .
+# Make scripts executable
+RUN chmod +x /app/start.sh /app/download_nltk_data.py
+# Switch to non-root user
+RUN useradd -m -u 1000 user && \
+    chown -R user:user /app
+USER user
+# Set working directory
+WORKDIR /app
+# Expose the port the app runs on
+EXPOSE 7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:7860/ || exit 1
+# Command to run the application
+CMD ["/app/start.sh"]

README.md CHANGED Viewed

@@ -1,11 +1,91 @@
 ---
-title: Parseai Document Processor
-emoji: 🚀
-colorFrom: indigo
-colorTo: yellow
-sdk: docker
-pinned: false
-license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: ParseAI Document Processor
+emoji: 📊
+colorFrom: blue
+colorTo: indigo
+sdk: gradio
+sdk_version: 5.30.0
+app_file: app.py
+pinned: true
 ---
+# ParseAI - Document Processing and Analysis
+ParseAI는 PDF 문서를 처리하고 분석하기 위한 강력한 도구입니다. 문서에서 텍스트를 추출하고, 요약하며, 벡터 검색을 통해 관련 문서를 찾을 수 있습니다.
+## 🚀 주요 기능
+- PDF 문서 업로드 및 텍스트 추출
+- 문서 내용 요약
+- 벡터 기반 문서 검색
+- Gradio 기반의 사용자 친화적인 웹 인터페이스
+## 🛠️ 기술 스택
+- **Backend**: FastAPI
+- **Frontend**: Gradio
+- **NLP**: NLTK, Hugging Face Transformers
+- **Vector Store**: Sentence Transformers
+- **Container**: Docker
+## 🚀 로컬에서 실행하기
+### 사전 요구사항
+- Docker 및 Docker Compose
+- Python 3.9+
+### 환경 변수 설정
+`.env` 파일을 생성하고 다음 변수들을 설정하세요:
+```bash
+# Hugging Face Hub configuration
+HUGGINGFACE_HUB_TOKEN=your_hf_token_here
+# Application configuration
+UPLOAD_FOLDER=/app/data/uploads
+NLTK_DATA=/app/nltk_data
+```
+### Docker를 사용한 실행
+1. Docker 이미지 빌드:
+   ```bash
+   docker build -t parseai .
+   ```
+2. 컨테이너 실행:
+   ```bash
+   docker run -d -p 7860:7860 --env-file .env parseai
+   ```
+3. 웹 브라우저에서 접속:
+   ```
+   http://localhost:7860
+   ```
+## 🌐 Hugging Face Spaces에 배포하기
+1. 이 저장소를 Hugging Face Spaces에 푸시합니다.
+2. 저장소 설정에서 다음 환경 변수를 설정하세요:
+   - `HUGGINGFACE_HUB_TOKEN`: Hugging Face API 토큰
+   - `UPLOAD_FOLDER`: `/app/data/uploads`
+   - `NLTK_DATA`: `/app/nltk_data`
+## 📝 사용 방법
+1. **문서 업로드** 탭에서 PDF 파일을 업로드하세요.
+2. **문서 검색** 탭에서 키워드를 입력하여 관련 문서를 검색하세요.
+## 📊 상태 확인
+애플리케이션 상태는 다음 엔드포인트에서 확인할 수 있습니다:
+```
+GET /health
+```
+## 📄 라이선스
+이 프로젝트는 [MIT 라이선스](LICENSE) 하에 배포됩니다.

app.py ADDED Viewed

	@@ -0,0 +1,255 @@

+import os
+import sys
+import logging
+from pathlib import Path
+from typing import List, Dict, Optional
+import gradio as gr
+from fastapi import FastAPI, HTTPException, status, UploadFile, File, BackgroundTasks
+from fastapi.middleware.cors import CORSMiddleware
+from dotenv import load_dotenv
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.StreamHandler(sys.stdout)
+    ]
+)
+logger = logging.getLogger(__name__)
+# Load environment variables
+load_dotenv()
+# Initialize FastAPI
+app = FastAPI(
+    title="ParseAI API",
+    description="API for processing and analyzing PDF documents",
+    version="1.0.0"
+)
+# CORS middleware configuration
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # In production, replace with specific origins
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Directory configuration
+UPLOAD_DIR = Path(os.getenv("UPLOAD_FOLDER", "/app/data/uploads"))
+PROCESSED_DIR = Path(os.getenv("PROCESSED_FOLDER", "/app/data/processed"))
+NLTK_DATA_DIR = Path(os.getenv("NLTK_DATA", "/app/nltk_data"))
+# Ensure directories exist with proper permissions
+for directory in [UPLOAD_DIR, PROCESSED_DIR, NLTK_DATA_DIR]:
+    try:
+        directory.mkdir(parents=True, exist_ok=True)
+        os.chmod(directory, 0o755)
+        logger.info(f"Ensured directory exists: {directory}")
+    except Exception as e:
+        logger.error(f"Failed to create directory {directory}: {e}")
+        raise
+# Import modules after environment is set up
+try:
+    from extractor import pdf_extractor
+    from summarizer import document_summarizer
+    from vector_store import vector_store
+    # Initialize NLTK data
+    import nltk
+    nltk.data.path.append(str(NLTK_DATA_DIR))
+    # Verify NLTK data is available
+    try:
+        nltk.data.find('tokenizers/punkt')
+        nltk.data.find('corpora/stopwords')
+        nltk.data.find('corpora/wordnet')
+        nltk.data.find('taggers/averaged_perceptron_tagger')
+        logger.info("NLTK data verified successfully")
+    except LookupError as e:
+        logger.warning(f"NLTK data missing: {e}")
+        # Attempt to download missing data
+        try:
+            nltk.download('punkt', download_dir=str(NLTK_DATA_DIR))
+            nltk.download('stopwords', download_dir=str(NLTK_DATA_DIR))
+            nltk.download('wordnet', download_dir=str(NLTK_DATA_DIR))
+            nltk.download('averaged_perceptron_tagger', download_dir=str(NLTK_DATA_DIR))
+            logger.info("Successfully downloaded NLTK data")
+        except Exception as download_error:
+            logger.error(f"Failed to download NLTK data: {download_error}")
+            raise
+except ImportError as e:
+    logger.error(f"Failed to import required modules: {e}")
+    raise
+# Health check endpoint
+@app.get("/health")
+async def health_check():
+    """Health check endpoint for monitoring"""
+    return {
+        "status": "healthy",
+        "environment": os.getenv("ENV", "development"),
+        "nltk_data": str(NLTK_DATA_DIR),
+        "upload_dir": str(UPLOAD_DIR),
+        "processed_dir": str(PROCESSED_DIR)
+    }
+async def process_document(file_path: str):
+    """Process a document as a background task.
+    Args:
+        file_path: Path to the uploaded file
+    Returns:
+        dict: Processing results including status and metadata
+    """
+    logger.info(f"Starting document processing: {file_path}")
+    try:
+        # Verify file exists
+        if not os.path.exists(file_path):
+            error_msg = f"File not found: {file_path}"
+            logger.error(error_msg)
+            raise FileNotFoundError(error_msg)
+        # Extract text from PDF
+        logger.info(f"Extracting text from: {file_path}")
+        extracted_data = pdf_extractor.extract_text(file_path)
+        if not extracted_data or "text_by_page" not in extracted_data:
+            error_msg = f"Failed to extract text from: {file_path}"
+            logger.error(error_msg)
+            raise ValueError(error_msg)
+        # Combine text from all pages
+        full_text = " ".join([page["text"] for page in extracted_data["text_by_page"]])
+        if not full_text.strip():
+            error_msg = f"No text content found in: {file_path}"
+            logger.error(error_msg)
+            raise ValueError(error_msg)
+        # Generate summary
+        logger.info(f"Generating summary for: {file_path}")
+        try:
+            summary_result = document_summarizer.summarize_text(full_text)
+        except Exception as e:
+            logger.error(f"Error during summarization: {str(e)}")
+            summary_result = {"full_summary": "Summary generation failed", "key_points": []}
+        # Add to vector store
+        logger.info(f"Adding document to vector store: {file_path}")
+        metadata = {
+            "filename": os.path.basename(file_path),
+            "total_pages": extracted_data.get("total_pages", 0),
+            "summary": summary_result.get("full_summary", ""),
+            "timestamp": extracted_data.get("timestamp", ""),
+            "source": "upload"
+        }
+        try:
+            vector_store.add_document(full_text, metadata)
+        except Exception as e:
+            logger.error(f"Failed to add document to vector store: {str(e)}")
+            raise
+        # Save processed data
+        processed_path = None
+        try:
+            processed_path = pdf_extractor.save_extracted_text(
+                {
+                **extracted_data,
+                "summary": summary_result["full_summary"],
+                "chunk_summaries": summary_result["chunk_summaries"]
+            },
+            str(PROCESSED_DIR)
+        )
+        return {
+            "status": "success",
+            "processed_file": processed_path,
+            "summary": summary_result["full_summary"]
+        }
+    except Exception as e:
+        raise Exception(f"문서 처리 중 오류 발생: {str(e)}")
+@app.post("/upload/pdf")
+async def upload_pdf(
+    file: UploadFile = File(...),
+    background_tasks: BackgroundTasks = None
+):
+    """PDF 파일 업로드 API"""
+    if not file.filename.lower().endswith('.pdf'):
+        raise HTTPException(status_code=400, detail="PDF 파일만 업로드 가능합니다")
+    file_path = UPLOAD_DIR / file.filename
+    try:
+        # 파일 저장
+        with open(file_path, "wb") as buffer:
+            content = await file.read()
+            buffer.write(content)
+        # 비동기로 문서 처리 시작
+        background_tasks.add_task(process_document, str(file_path))
+        return {"filename": file.filename, "status": "processing"}
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/search")
+async def search_documents(query: str, top_k: int = 5):
+    """문서 검색 API"""
+    try:
+        results = vector_store.search(query, top_k)
+        return {"results": results}
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+# Gradio 인터페이스 생성
+def process_file(file):
+    file_path = UPLOAD_DIR / file.name
+    with open(file_path, "wb") as buffer:
+        buffer.write(file.getbuffer())
+    result = process_document(str(file_path))
+    return result["summary"]
+def search(query):
+    results = vector_store.search(query)
+    return "\n\n".join([f"{r['filename']} - 유사도: {r['similarity']:.2f}" for r in results["results"]])
+with gr.Blocks() as demo:
+    gr.Markdown("# ParseAI PDF 분석 서비스")
+    with gr.Tab("PDF 업로드"):
+        file_input = gr.File(type="file", file_types=[".pdf"])
+        upload_button = gr.Button("업로드")
+        summary_output = gr.Textbox(label="요약")
+        upload_button.click(
+            process_file,
+            inputs=[file_input],
+            outputs=[summary_output]
+        )
+    with gr.Tab("문서 검색"):
+        search_input = gr.Textbox(label="검색어 입력")
+        search_button = gr.Button("검색")
+        search_output = gr.Textbox(label="검색 결과")
+        search_button.click(
+            search,
+            inputs=[search_input],
+            outputs=[search_output]
+        )
+if __name__ == "__main__":
+    demo.launch()

download_nltk_data.py ADDED Viewed

	@@ -0,0 +1,26 @@

+#!/usr/bin/env python3
+import nltk
+import os
+def download_nltk_data():
+    # Ensure the NLTK data directory exists
+    nltk_data_dir = os.getenv('NLTK_DATA', '/app/nltk_data')
+    os.makedirs(nltk_data_dir, exist_ok=True)
+    # Set NLTK data path
+    nltk.data.path.append(nltk_data_dir)
+    # Download required NLTK data
+    print("Downloading NLTK data...")
+    try:
+        nltk.download('punkt', download_dir=nltk_data_dir)
+        nltk.download('stopwords', download_dir=nltk_data_dir)
+        nltk.download('wordnet', download_dir=nltk_data_dir)
+        nltk.download('averaged_perceptron_tagger', download_dir=nltk_data_dir)
+        print("NLTK data downloaded successfully!")
+    except Exception as e:
+        print(f"Error downloading NLTK data: {e}")
+        raise
+if __name__ == "__main__":
+    download_nltk_data()

extractor.py ADDED Viewed

	@@ -0,0 +1,57 @@

+from PyPDF2 import PdfReader
+from typing import Dict, List
+import json
+from datetime import datetime
+import os
+from pathlib import Path
+class PDFExtractor:
+    def __init__(self):
+        pass
+    def extract_text(self, file_path: str) -> Dict:
+        """PDF 파일에서 텍스트 추출"""
+        try:
+            # PDF 파일 읽기
+            pdf_reader = PdfReader(file_path)
+            # 페이지 수 확인
+            total_pages = len(pdf_reader.pages)
+            # 페이지별 텍스트 추출
+            text_by_page = []
+            for page_num, page in enumerate(pdf_reader.pages, 1):
+                text = page.extract_text()
+                if text:
+                    text_by_page.append({
+                        "page_number": page_num,
+                        "text": text
+                    })
+            # 결과 반환
+            return {
+                "filename": os.path.basename(file_path),
+                "total_pages": total_pages,
+                "extracted_pages": len(text_by_page),
+                "text_by_page": text_by_page,
+                "timestamp": datetime.now().isoformat()
+            }
+        except Exception as e:
+            raise Exception(f"PDF 텍스트 추출 중 오류 발생: {str(e)}")
+    def save_extracted_text(self, extracted_data: Dict, output_dir: str) -> str:
+        """추출된 텍스트를 JSON 파일로 저장"""
+        output_dir = Path(output_dir)
+        output_dir.mkdir(parents=True, exist_ok=True)
+        filename = f"extracted_{extracted_data['filename'].split('.')[0]}.json"
+        output_path = output_dir / filename
+        with open(output_path, 'w', encoding='utf-8') as f:
+            json.dump(extracted_data, f, ensure_ascii=False, indent=2)
+        return str(output_path)
+# 싱글톤 인스턴스 생성
+pdf_extractor = PDFExtractor()

requirements.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+fastapi==0.68.0
+uvicorn==0.15.0
+python-multipart==0.0.5
+boto3>=1.34.12
+PyPDF2>=2.0.04
+nltk==3.6.3
+scikit-learn==0.24.2
+numpy>=1.21.0
+scipy>=1.10.1
+sentence-transformers==2.2.0
+datasets==2.0.0
+gradio==3.0.0
+pytest==6.2.5
+pytest-cov==2.12.1
+pytest-asyncio==0.15.1
+httpx==0.19.0
+python-dotenv==0.19.0
+huggingface-hub==0.31.4
+torch>=1.9.0
+transformers>=4.11.0
+pandas>=1.3.0

space.yml ADDED Viewed

	@@ -0,0 +1,11 @@

+build:
+  dockerfile: Dockerfile
+compute:
+  type: cpu-small
+ports:
+  - 7860
+env:
+  HF_TOKEN: ${{secrets.HF_TOKEN}}

start.sh ADDED Viewed

	@@ -0,0 +1,20 @@

+#!/bin/bash
+set -e
+# Create necessary directories
+mkdir -p /app/data/uploads
+mkdir -p /app/nltk_data
+mkdir -p /app/huggingface_cache
+# Set permissions
+chown -R 1000:1000 /app
+chmod -R 755 /app
+# Download NLTK data if not present
+if [ ! -d "/app/nltk_data/tokenizers" ]; then
+    echo "Downloading NLTK data..."
+    python /app/download_nltk_data.py
+fi
+# Start the application
+exec python /app/app.py

summarizer.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import nltk
+from typing import Dict, List
+import json
+from datetime import datetime
+import heapq
+class DocumentSummarizer:
+    def __init__(self):
+        # NLTK 다운로드
+        try:
+            nltk.download('punkt', download_dir='/app/nltk_data')
+            nltk.download('stopwords', download_dir='/app/nltk_data')
+            nltk.data.path.append('/app/nltk_data')
+        except Exception as e:
+            print(f"Warning: NLTK data download failed: {str(e)}")
+        # 텍스트 분할 크기 설정
+        self.chunk_size = 1000  # 토큰 기준
+        try:
+            self.tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
+        except Exception as e:
+            print(f"Warning: Failed to load tokenizer: {str(e)}")
+            self.tokenizer = nltk.tokenize.sent_tokenize
+    def summarize_text(self, text: str) -> Dict:
+        """텍스트를 요약"""
+        try:
+            # 텍스트 분할
+            chunks = self._split_text(text)
+            # 각 분할에 대해 요약 생성
+            summaries = []
+            for chunk in chunks:
+                summary = self._summarize_chunk(chunk)
+                if summary:
+                    summaries.append(summary)
+            return {
+                "timestamp": datetime.now().isoformat(),
+                "full_summary": " ".join(summaries),
+                "chunk_summaries": summaries
+            }
+        except Exception as e:
+            raise Exception(f"요약 생성 중 오류 발생: {str(e)}")
+    def _summarize_chunk(self, text: str) -> str:
+        """개별 텍스트 분할을 요약"""
+        try:
+            # 텍스트 전처리
+            words = nltk.word_tokenize(text.lower())
+            sentences = nltk.sent_tokenize(text)
+            # 불용어 제거
+            stop_words = set(nltk.corpus.stopwords.words('english'))
+            words = [word for word in words if word.isalnum() and word not in stop_words]
+            # 단어 빈도수 계산
+            word_frequencies = {}
+            for word in words:
+                if word not in word_frequencies:
+                    word_frequencies[word] = 1
+                else:
+                    word_frequencies[word] += 1
+            # 최대 빈도수 계산
+            max_frequency = max(word_frequencies.values())
+            # 정규화된 빈도수 계산
+            for word in word_frequencies:
+                word_frequencies[word] = word_frequencies[word] / max_frequency
+            # 문장 점수 계산
+            sentence_scores = {}
+            for sentence in sentences:
+                for word, freq in word_frequencies.items():
+                    if word in sentence.lower():
+                        if sentence not in sentence_scores:
+                            sentence_scores[sentence] = freq
+                        else:
+                            sentence_scores[sentence] += freq
+            # 상위 30%의 문장 선택
+            summary_sentences = heapq.nlargest(
+                int(len(sentences) * 0.3),
+                sentence_scores,
+                key=sentence_scores.get
+            )
+            # 요약 생성
+            return " ".join(summary_sentences)
+        except Exception as e:
+            print(f"Chunk summarization error: {str(e)}")
+            return ""
+    def _split_text(self, text: str) -> List[str]:
+        """텍스트를 적절한 크기로 분할"""
+        try:
+            sentences = nltk.sent_tokenize(text)
+            chunks = []
+            current_chunk = ""
+            for sentence in sentences:
+                if len(current_chunk) + len(sentence) <= self.chunk_size:
+                    current_chunk += " " + sentence
+                else:
+                    chunks.append(current_chunk.strip())
+                    current_chunk = sentence
+            if current_chunk:
+                chunks.append(current_chunk.strip())
+            return chunks
+        except Exception as e:
+            raise Exception(f"텍스트 분할 중 오류 발생: {str(e)}")
+# 싱글톤 인스턴스 생성
+document_summarizer = DocumentSummarizer()

vector_store.py ADDED Viewed

	@@ -0,0 +1,162 @@

+import json
+from typing import Dict, List
+from pathlib import Path
+import numpy as np
+from datetime import datetime
+from sentence_transformers import SentenceTransformer
+from huggingface_hub import HfApi
+import os
+class VectorStore:
+    def __init__(self):
+        self.documents = []
+        self.metadata = []  # 문서 메타데이터 저장
+        self.model = SentenceTransformer('all-MiniLM-L6-v2')
+        self.hf_api = HfApi()
+        self.dataset_name = "bluewhale2025/parseai_202506"  # Hugging Face dataset 이름
+        # 데이터셋이 없으면 생성
+        try:
+            self.hf_api.create_repo(
+                repo_id=self.dataset_name,
+                repo_type="dataset",
+                private=True  # 개인 데이터셋으로 설정
+            )
+            print(f"데이터셋 {self.dataset_name} 생성 완료")
+        except Exception as e:
+            print(f"데이터셋 생성 중 오류 발생: {str(e)}")
+    def add_document(self, text: str, metadata: Dict) -> None:
+        """문서를 저장"""
+        try:
+            # 문서 저장
+            self.documents.append(text)
+            # 메타데이터 저장
+            metadata["timestamp"] = str(datetime.now())
+            self.metadata.append(metadata)
+            # 벡터 생성
+            vector = self.model.encode(text)
+            # 파일 경로 설정
+            vector_path = f"vectors/{len(self.documents)}.npy"
+            metadata_path = f"metadata/{len(self.documents)}.json"
+            # 임시 파일로 저장
+            np.save(vector_path, vector)
+            with open(metadata_path, 'w', encoding='utf-8') as f:
+                json.dump(metadata, f)
+            # Hugging Face에 업로드
+            self.hf_api.upload_file(
+                path_or_fileobj=vector_path,
+                path_in_repo=vector_path,
+                repo_id=self.dataset_name,
+                repo_type="dataset"
+            )
+            self.hf_api.upload_file(
+                path_or_fileobj=metadata_path,
+                path_in_repo=metadata_path,
+                repo_id=self.dataset_name,
+                repo_type="dataset"
+            )
+            # 임시 파일 삭제
+            os.remove(vector_path)
+            os.remove(metadata_path)
+        except Exception as e:
+            raise Exception(f"문서 저장 중 오류 발생: {str(e)}")
+    def search(self, query: str, top_k: int = 5) -> List[Dict]:
+        """키워드 검색"""
+        try:
+            # 쿼리 벡터 생성
+            query_vector = self.model.encode(query)
+            # Hugging Face에서 모든 벡터 로드
+            vectors = []
+            metadata = []
+            # 모든 벡터 파일 로드
+            files = self.hf_api.list_repo_files(
+                repo_id=self.dataset_name,
+                repo_type="dataset"
+            )
+            # 파일 정렬 (1부터 시작)
+            vector_files = sorted([f for f in files if f.startswith("vectors/")])
+            metadata_files = sorted([f for f in files if f.startswith("metadata/")])
+            if not vector_files or not metadata_files:
+                return []
+            # 파일 로드
+            for vector_file, metadata_file in zip(vector_files, metadata_files):
+                vector = np.load(self.hf_api.download_file(
+                    repo_id=self.dataset_name,
+                    filename=vector_file,
+                    repo_type="dataset"
+                ))
+                vectors.append(vector)
+                meta = json.load(self.hf_api.download_file(
+                    repo_id=self.dataset_name,
+                    filename=metadata_file,
+                    repo_type="dataset"
+                ))
+                metadata.append(meta)
+            # 유사도 계산
+            similarities = cosine_similarity(vectors, [query_vector]).flatten()
+            # 유사도 기반 정렬
+            sorted_idx = np.argsort(similarities)[::-1][:top_k]
+            # 결과 생성
+            results = []
+            for idx in sorted_idx:
+                results.append({
+                    "filename": metadata[idx]["filename"],
+                    "total_pages": metadata[idx]["total_pages"],
+                    "summary": metadata[idx]["summary"],
+                    "timestamp": metadata[idx]["timestamp"],
+                    "similarity": float(similarities[idx])
+                })
+            return results
+        except Exception as e:
+            raise Exception(f"검색 중 오류 발생: {str(e)}")
+    def _save_metadata(self) -> None:
+        """메타데이터 저장"""
+        try:
+            Path(self.metadata_path).parent.mkdir(parents=True, exist_ok=True)
+            with open(self.metadata_path, 'w', encoding='utf-8') as f:
+                json.dump({
+                    "documents": self.documents,
+                    "metadata": self.metadata
+                }, f, ensure_ascii=False, indent=2)
+        except Exception as e:
+            raise Exception(f"메타데이터 저장 중 오류 발생: {str(e)}")
+    def _load_metadata(self):
+        """메타데이터 로드"""
+        try:
+            if Path(self.metadata_path).exists():
+                with open(self.metadata_path, 'r', encoding='utf-8') as f:
+                    data = json.load(f)
+                    self.documents = data["documents"]
+                    self.metadata = data["metadata"]
+        except Exception as e:
+            raise Exception(f"메타데이터 로드 중 오류 발생: {str(e)}")
+    def load(self) -> None:
+        """저장된 메타데이터 불러오기"""
+        self._load_metadata()
+# 싱글톤 인스턴스 생성
+vector_store = VectorStore()