financial-rag-chatbot / REMOTE_INDEXING.md
Claude
Add remote indexing options and pipeline testing tools
047f43e unverified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

์›๊ฒฉ ์ธ๋ฑ์‹ฑ ๊ฐ€์ด๋“œ

๋กœ์ปฌ ๋งฅ๋ถ์˜ PDF ํŒŒ์ผ์„ ํด๋ผ์šฐ๋“œ์—์„œ ์ธ๋ฑ์‹ฑํ•˜๋Š” ๋ฐฉ๋ฒ•

์˜ต์…˜ 1: Google Drive ์‚ฌ์šฉ (์ถ”์ฒœ)

1๏ธโƒฃ PDF๋ฅผ Google Drive์— ์—…๋กœ๋“œ

# ๋กœ์ปฌ ๋งฅ๋ถ์—์„œ
# Google Drive ์•ฑ์œผ๋กœ ํด๋” ๋“œ๋ž˜๊ทธ ์•ค ๋“œ๋กญ

2๏ธโƒฃ ๊ณต์œ  ๋งํฌ ์ƒ์„ฑ

  1. Google Drive์—์„œ ํด๋” ์šฐํด๋ฆญ
  2. "๊ณต์œ " โ†’ "๋งํฌ ๊ฐ€์ ธ์˜ค๊ธฐ"
  3. "์•ก์„ธ์Šค ๊ถŒํ•œ ์žˆ๋Š” ๋ชจ๋“  ์‚ฌ์šฉ์ž" ์„ ํƒ
  4. ๋งํฌ ๋ณต์‚ฌ

3๏ธโƒฃ ํด๋ผ์šฐ๋“œ์—์„œ ๋‹ค์šด๋กœ๋“œ & ์ธ๋ฑ์‹ฑ

# scripts/download_and_index.py
import gdown
import os

# Google Drive ํด๋” ID (๋งํฌ์—์„œ ์ถ”์ถœ)
FOLDER_ID = "YOUR_FOLDER_ID_HERE"

# ๋‹ค์šด๋กœ๋“œ
print("PDF ๋‹ค์šด๋กœ๋“œ ์ค‘...")
gdown.download_folder(id=FOLDER_ID, output="data/pdfs", quiet=False)

# ์ธ๋ฑ์‹ฑ
print("์ธ๋ฑ์‹ฑ ์‹œ์ž‘...")
os.system("python scripts/index_pdfs.py")

# GitHub ์—…๋กœ๋“œ
os.system("git add data/chroma_db/")
os.system('git commit -m "Add vector database"')
os.system("git push")

์‹คํ–‰:

pip install gdown
python scripts/download_and_index.py

์˜ต์…˜ 2: Dropbox ์‚ฌ์šฉ

1๏ธโƒฃ Dropbox์— ์—…๋กœ๋“œ

๋กœ์ปฌ ๋งฅ๋ถ โ†’ Dropbox ํด๋”

2๏ธโƒฃ ๊ณต์œ  ๋งํฌ ์ƒ์„ฑ

ํŒŒ์ผ/ํด๋” โ†’ ๊ณต์œ  โ†’ ๋งํฌ ๋ณต์‚ฌ

3๏ธโƒฃ ๋‹ค์šด๋กœ๋“œ & ์ธ๋ฑ์‹ฑ

# Dropbox ๋งํฌ์—์„œ dl=0์„ dl=1๋กœ ๋ณ€๊ฒฝ
wget "https://www.dropbox.com/...?dl=1" -O pdfs.zip

# ์••์ถ• ํ•ด์ œ
unzip pdfs.zip -d data/pdfs/

# ์ธ๋ฑ์‹ฑ
python scripts/index_pdfs.py

# ์—…๋กœ๋“œ
./upload_to_github.sh

์˜ต์…˜ 3: AWS S3 ์‚ฌ์šฉ

1๏ธโƒฃ S3 ๋ฒ„ํ‚ท์— ์—…๋กœ๋“œ

# ๋กœ์ปฌ ๋งฅ๋ถ์—์„œ
aws s3 sync /path/to/pdfs s3://your-bucket/pdfs/

2๏ธโƒฃ ํด๋ผ์šฐ๋“œ์—์„œ ๋‹ค์šด๋กœ๋“œ

# GitHub Codespaces ๋˜๋Š” EC2์—์„œ
aws s3 sync s3://your-bucket/pdfs/ data/pdfs/

# ์ธ๋ฑ์‹ฑ
python scripts/index_pdfs.py

์˜ต์…˜ 4: ๋กœ์ปฌ์—์„œ ์‹คํ–‰ ํ›„ ๋ฒกํ„ฐ DB๋งŒ ์—…๋กœ๋“œ (๊ฐ€์žฅ ๊ฐ„๋‹จ)

1๏ธโƒฃ ๋กœ์ปฌ ๋งฅ๋ถ์—์„œ ๋ชจ๋“  ์ž‘์—… ์ˆ˜ํ–‰

# ์ „์ฒด ๊ณผ์ •์„ ๋กœ์ปฌ์—์„œ
./setup.sh
./run_indexing.sh  # 30-60๋ถ„

2๏ธโƒฃ ๋ฒกํ„ฐ DB๋งŒ GitHub์— ์—…๋กœ๋“œ

# ์ธ๋ฑ์‹ฑ ์™„๋ฃŒ ํ›„
./upload_to_github.sh

์ด ๋ฐฉ๋ฒ•์ด ๊ฐ€์žฅ ๊ฐ„๋‹จํ•˜๊ณ  ์•ˆ์ „ํ•ฉ๋‹ˆ๋‹ค!


๋น„์šฉ ๋ฐ ์‹œ๊ฐ„ ๋น„๊ต

๋ฐฉ๋ฒ• ์—…๋กœ๋“œ ์‹œ๊ฐ„ ๋‹ค์šด๋กœ๋“œ ์‹œ๊ฐ„ ์ธ๋ฑ์‹ฑ ์‹œ๊ฐ„ ์ด ์‹œ๊ฐ„ ๋น„์šฉ
๋กœ์ปฌ ์‹คํ–‰ โญ - - 30-60๋ถ„ 30-60๋ถ„ ๋ฌด๋ฃŒ
Google Drive 10-30๋ถ„ 10-30๋ถ„ 30-60๋ถ„ 50-120๋ถ„ ๋ฌด๋ฃŒ
Dropbox 10-30๋ถ„ 10-30๋ถ„ 30-60๋ถ„ 50-120๋ถ„ ๋ฌด๋ฃŒ
AWS S3 10-30๋ถ„ 5-10๋ถ„ 30-60๋ถ„ 45-100๋ถ„ ~$1-2

์ถ”์ฒœ ๋ฐฉ๋ฒ•

๐Ÿ† ์ตœ์„ : ๋กœ์ปฌ์—์„œ ์‹คํ–‰

# ๊ฐ€์žฅ ๋น ๋ฅด๊ณ  ๊ฐ„๋‹จ
./setup.sh
./run_indexing.sh
./upload_to_github.sh

๐Ÿฅˆ ์ฐจ์„ : Google Drive

PDF๋ฅผ ์—…๋กœ๋“œ๋งŒ ํ•˜๋ฉด ๋‚˜๋จธ์ง€๋Š” ์ž๋™ํ™” ๊ฐ€๋Šฅ


๋ฌธ์ œ ํ•ด๊ฒฐ

Q: ์ธํ„ฐ๋„ท์ด ๋А๋ ค์„œ ์—…๋กœ๋“œ๊ฐ€ ์˜ค๋ž˜ ๊ฑธ๋ ค์š”

A: ๋กœ์ปฌ์—์„œ ์ธ๋ฑ์‹ฑ ํ›„ ๋ฒกํ„ฐ DB๋งŒ ์—…๋กœ๋“œ (500MB-2GB)

Q: ํด๋ผ์šฐ๋“œ ๋น„์šฉ์ด ๊ฑฑ์ •๋ผ์š”

A: ๋กœ์ปฌ ์‹คํ–‰์ด ๋ฌด๋ฃŒ์ž…๋‹ˆ๋‹ค

Q: ์ž๋™ํ™”ํ•˜๊ณ  ์‹ถ์–ด์š”

A: GitHub Actions๋กœ ์ž๋™ํ™” ๊ฐ€๋Šฅ (๋ณ„๋„ ๊ฐ€์ด๋“œ)


๊ฒฐ๋ก : ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ๋กœ์ปฌ์—์„œ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ข‹์Šต๋‹ˆ๋‹ค!