Claude commited on
Commit
92d9149
ยท
unverified ยท
1 Parent(s): f6b05db

Enable vector DB upload to GitHub and add upload guide

Browse files

Changes:
- Modified .gitignore to allow data/chroma_db/ upload
- Added UPLOAD_VECTOR_DB.md with step-by-step guide
- Includes both regular Git and Git LFS options
- PDF files still excluded from tracking

Files changed (2) hide show
  1. .gitignore +1 -1
  2. UPLOAD_VECTOR_DB.md +163 -0
.gitignore CHANGED
@@ -36,7 +36,7 @@ ENV/
36
  *~
37
 
38
  # Data
39
- data/chroma_db/
40
  data/*.pdf
41
  *.pdf
42
 
 
36
  *~
37
 
38
  # Data
39
+ # data/chroma_db/ โ† ๋ฒกํ„ฐ DB๋Š” GitHub์— ์—…๋กœ๋“œ
40
  data/*.pdf
41
  *.pdf
42
 
UPLOAD_VECTOR_DB.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ๋ฒกํ„ฐ DB GitHub ์—…๋กœ๋“œ ๊ฐ€์ด๋“œ
2
+
3
+ ## ๐Ÿ“ฆ ํ˜„์žฌ ์ƒํ™ฉ
4
+
5
+ - โœ… ์ฝ”๋“œ: GitHub์— ์—…๋กœ๋“œ ์™„๋ฃŒ
6
+ - โŒ ๋ฒกํ„ฐ DB: ์•„์ง ๋กœ์ปฌ์—์„œ ์ƒ์„ฑ ์•ˆ๋จ
7
+
8
+ ## ๐Ÿš€ ์—…๋กœ๋“œ ๋‹จ๊ณ„
9
+
10
+ ### 1๏ธโƒฃ ๋กœ์ปฌ์—์„œ ์ธ๋ฑ์‹ฑ ์‹คํ–‰
11
+
12
+ ```bash
13
+ # 1. ๋ฆฌํฌ์ง€ํ† ๋ฆฌ ํด๋ก  (๋กœ์ปฌ ๋งฅ๋ถ)
14
+ git clone https://github.com/csjjin2025/Hallucination_and_Deception_for_financial_RAG.git
15
+ cd Hallucination_and_Deception_for_financial_RAG
16
+
17
+ # 2. ํ™˜๊ฒฝ ์„ค์ •
18
+ python -m venv venv
19
+ source venv/bin/activate
20
+ pip install -r requirements.txt
21
+
22
+ # 3. .env ํŒŒ์ผ ์ƒ์„ฑ
23
+ cp .env.example .env
24
+ nano .env # API ํ‚ค์™€ PDF ๊ฒฝ๋กœ ์ž…๋ ฅ
25
+
26
+ # 4. PDF ์ธ๋ฑ์‹ฑ (30-60๋ถ„ ์†Œ์š”)
27
+ python scripts/index_pdfs.py
28
+ ```
29
+
30
+ ### 2๏ธโƒฃ ์ธ๋ฑ์‹ฑ ์™„๋ฃŒ ํ™•์ธ
31
+
32
+ ```bash
33
+ # ๋ฒกํ„ฐ DB ํด๋” ํ™•์ธ
34
+ ls -la data/chroma_db/
35
+
36
+ # ๋‚ด์šฉ ํ™•์ธ
37
+ python scripts/check_vector_db.py
38
+ ```
39
+
40
+ **์˜ˆ์ƒ ๊ฒฐ๊ณผ:**
41
+ ```
42
+ data/
43
+ โ””โ”€โ”€ chroma_db/
44
+ โ”œโ”€โ”€ chroma.sqlite3
45
+ โ”œโ”€โ”€ [UUID]/
46
+ โ””โ”€โ”€ ...
47
+ ```
48
+
49
+ ### 3๏ธโƒฃ GitHub์— ์—…๋กœ๋“œ
50
+
51
+ #### ๋ฐฉ๋ฒ• A: ์ผ๋ฐ˜ Git (๋ฒกํ„ฐ DB < 100MB)
52
+
53
+ ```bash
54
+ # 1. ๋ณ€๊ฒฝ์‚ฌํ•ญ ํ™•์ธ
55
+ git status
56
+
57
+ # 2. ๋ฒกํ„ฐ DB ์ถ”๊ฐ€
58
+ git add data/chroma_db/
59
+
60
+ # 3. ์ปค๋ฐ‹
61
+ git commit -m "Add indexed vector database (2,639 financial papers)"
62
+
63
+ # 4. ํ‘ธ์‹œ
64
+ git push origin main
65
+ ```
66
+
67
+ #### ๋ฐฉ๋ฒ• B: Git LFS (๋ฒกํ„ฐ DB > 100MB) โญ ์ถ”์ฒœ
68
+
69
+ ๋ฒกํ„ฐ DB๊ฐ€ ํด ๊ฒฝ์šฐ Git LFS ์‚ฌ์šฉ:
70
+
71
+ ```bash
72
+ # 1. Git LFS ์„ค์น˜ (๋งฅ)
73
+ brew install git-lfs
74
+
75
+ # 2. Git LFS ์ดˆ๊ธฐํ™”
76
+ git lfs install
77
+
78
+ # 3. ChromaDB ํŒŒ์ผ ์ถ”์ 
79
+ git lfs track "data/chroma_db/**/*"
80
+ git lfs track "*.sqlite3"
81
+
82
+ # 4. .gitattributes ํŒŒ์ผ ์ถ”๊ฐ€
83
+ git add .gitattributes
84
+
85
+ # 5. ๋ฒกํ„ฐ DB ์ถ”๊ฐ€
86
+ git add data/chroma_db/
87
+
88
+ # 6. ์ปค๋ฐ‹ ๋ฐ ํ‘ธ์‹œ
89
+ git commit -m "Add indexed vector database via Git LFS"
90
+ git push origin main
91
+ ```
92
+
93
+ ### 4๏ธโƒฃ ์—…๋กœ๋“œ ํ™•์ธ
94
+
95
+ GitHub ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์—์„œ:
96
+ ```
97
+ data/
98
+ โ””โ”€โ”€ chroma_db/ โ† ์ด ํด๋”๊ฐ€ ๋ณด์—ฌ์•ผ ํ•จ
99
+ ```
100
+
101
+ ## โš ๏ธ ์ฃผ์˜์‚ฌํ•ญ
102
+
103
+ ### 1. ์šฉ๋Ÿ‰ ์ œํ•œ
104
+ - **GitHub ์ผ๋ฐ˜ ํŒŒ์ผ**: 100MB ์ œํ•œ
105
+ - **Git LFS**: ๋ฌด๋ฃŒ ํ”Œ๋žœ 1GB/์›”
106
+ - **์˜ˆ์ƒ ๋ฒกํ„ฐ DB ํฌ๊ธฐ**: 500MB ~ 2GB
107
+
108
+ ### 2. .gitignore ํ™•์ธ
109
+ ํ˜„์žฌ `.gitignore`์—์„œ `data/chroma_db/` ์ฃผ์„ ์ฒ˜๋ฆฌ๋จ (์—…๋กœ๋“œ ๊ฐ€๋Šฅ)
110
+
111
+ ### 3. PDF ์›๋ณธ์€ ์—…๋กœ๋“œ ์•ˆ๋จ
112
+ `.gitignore`์—์„œ `*.pdf` ์ œ์™ธ๋จ (์ •์ƒ)
113
+
114
+ ## ๐Ÿ”„ ๋‹ค๋ฅธ ์‚ฌ๋žŒ์ด ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•
115
+
116
+ ๋ฒกํ„ฐ DB๊ฐ€ GitHub์— ์˜ฌ๋ผ๊ฐ„ ํ›„:
117
+
118
+ ```bash
119
+ # 1. ํด๋ก 
120
+ git clone https://github.com/csjjin2025/Hallucination_and_Deception_for_financial_RAG.git
121
+ cd Hallucination_and_Deception_for_financial_RAG
122
+
123
+ # 2. Git LFS pull (LFS ์‚ฌ์šฉ ์‹œ)
124
+ git lfs pull
125
+
126
+ # 3. ํ™˜๊ฒฝ ์„ค์ •
127
+ pip install -r requirements.txt
128
+ cp .env.example .env
129
+ nano .env # API ํ‚ค๋งŒ ์ž…๋ ฅ
130
+
131
+ # 4. ๋ฐ”๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ! (์ธ๋ฑ์‹ฑ ๋ถˆํ•„์š”)
132
+ uvicorn app.main:app --reload
133
+ ```
134
+
135
+ ## ๐Ÿ’ก ๋Œ€์•ˆ: ํด๋ผ์šฐ๋“œ ์Šคํ† ๋ฆฌ์ง€
136
+
137
+ ๋ฒกํ„ฐ DB๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด:
138
+ - **AWS S3**
139
+ - **Google Cloud Storage**
140
+ - **Dropbox/Google Drive**
141
+
142
+ README์— ๋‹ค์šด๋กœ๋“œ ๋งํฌ ์ถ”๊ฐ€:
143
+ ```markdown
144
+ ## Vector Database Download
145
+
146
+ ๋ฒกํ„ฐ DB ๋‹ค์šด๋กœ๋“œ: [๋งํฌ](https://drive.google.com/...)
147
+
148
+ ๋‹ค์šด๋กœ๋“œ ํ›„ `data/chroma_db/`์— ์••์ถ• ํ•ด์ œ
149
+ ```
150
+
151
+ ## ๐Ÿ“Š ์šฉ๋Ÿ‰ ํ™•์ธ
152
+
153
+ ```bash
154
+ # ๋ฒกํ„ฐ DB ์šฉ๋Ÿ‰ ํ™•์ธ
155
+ du -sh data/chroma_db/
156
+
157
+ # ์˜ˆ์ƒ ์ถœ๋ ฅ
158
+ # 800M data/chroma_db/
159
+ ```
160
+
161
+ ---
162
+
163
+ **๋‹ค์Œ ๋‹จ๊ณ„:** ๋กœ์ปฌ ๋งฅ๋ถ์—์„œ ์ธ๋ฑ์‹ฑ ์‹คํ–‰ ํ›„ ์ด ๊ฐ€์ด๋“œ ๋”ฐ๋ผ ์—…๋กœ๋“œํ•˜์„ธ์š”! ๐Ÿš€