Charles Grandjean commited on
Commit
7414a53
Β·
1 Parent(s): d81d6f6

migrate data

Browse files
DEPLOYMENT.md ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Guide de dΓ©ploiement - CyberLegalAI Knowledge Graph sur Hugging Face
2
+
3
+ ## Vue d'ensemble
4
+
5
+ Ce guide explique comment dΓ©ployer CyberLegalAI en utilisant un dataset Hugging Face pour stocker le knowledge graph, libΓ©rant ainsi 842 Mo dans le repo principal Space.
6
+
7
+ ## Avantages de cette solution
8
+
9
+ βœ… **842 Mo libΓ©rΓ©s** dans le repo Space (limite de 1 Go respectΓ©e)
10
+ βœ… **TΓ©lΓ©chargement intelligent** avec cache local (pas de re-download)
11
+ βœ… **DΓ©marrage rapide** aprΓ¨s premier tΓ©lΓ©chargement
12
+ βœ… **Multi-juridictions** supportΓ©es nativement
13
+ βœ… **MaintenabilitΓ©** facile d'ajouter de nouvelles juridictions
14
+ βœ… **Robustesse** donnΓ©es sauvegardΓ©es dans dataset sΓ©parΓ©
15
+
16
+ ## Architecture
17
+
18
+ ```
19
+ Hugging Face Space (Repo principal)
20
+ β”œβ”€β”€ Code application
21
+ └── Configuration
22
+
23
+ Hugging Face Dataset (SΓ©parΓ©)
24
+ └── data/rag_storage/
25
+ β”œβ”€β”€ romania/ (~267 Mo)
26
+ └── bahrain/ (~575 Mo)
27
+ ```
28
+
29
+ ## Γ‰tapes de dΓ©ploiement
30
+
31
+ ### 1. PrΓ©requis
32
+
33
+ - Compte Hugging Face avec accès aux Spaces
34
+ - Token d'accès Hugging Face avec permissions de lecture sur les datasets
35
+ - CLI Hugging Face installΓ©e: `pip install huggingface-hub`
36
+
37
+ ### 2. CrΓ©ation du Dataset Hugging Face
38
+
39
+ ```bash
40
+ # CrΓ©er le dataset
41
+ huggingface-cli repo create Cyberlgl/CyberLegalAI-knowledge-graph --type dataset
42
+
43
+ # Uploader les donnΓ©es du knowledge graph
44
+ cd /Users/cgrdj/Documents/Code/Cyberlgl/CyberlegalAI
45
+ huggingface-cli upload Cyberlgl/CyberLegalAI-knowledge-graph data/rag_storage/
46
+ ```
47
+
48
+ Vérifiez que le dataset est bien créé: https://huggingface.co/datasets/Cyberlgl/CyberLegalAI-knowledge-graph
49
+
50
+ ### 3. Configuration du Space
51
+
52
+ Ajoutez les variables d'environnement suivantes dans votre Hugging Face Space:
53
+
54
+ ```
55
+ HF_TOKEN=your_hf_token_here
56
+ JURISDICTIONS=romania,bahrain
57
+ HF_KNOWLEDGE_GRAPH_DATASET=Cyberlgl/CyberLegalAI-knowledge-graph
58
+ HF_HOME=/data/.huggingface
59
+ ```
60
+
61
+ ### 4. Nettoyage du repo principal
62
+
63
+ Une fois les donnΓ©es transfΓ©rΓ©es vers le dataset:
64
+
65
+ ```bash
66
+ # Supprimer les donnΓ©es du repo principal (conservez localement si nΓ©cessaire)
67
+ rm -rf data/rag_storage/
68
+
69
+ # Ajouter au .gitignore pour Γ©viter de re-ajouter ces fichiers
70
+ echo "data/rag_storage/" >> .gitignore
71
+
72
+ # Commiter les changements
73
+ git add .gitignore
74
+ git commit -m "Exclude knowledge graph from repo - now served from Hugging Face dataset"
75
+ git push
76
+ ```
77
+
78
+ ### 5. RedΓ©ploiement du Space
79
+
80
+ Le Space va automatiquement:
81
+ 1. TΓ©lΓ©charger le knowledge graph depuis le dataset
82
+ 2. Le mettre en cache dans `/data/.huggingface`
83
+ 3. Copier les fichiers vers `data/rag_storage/`
84
+ 4. DΓ©marrer les serveurs LightRAG et l'API
85
+
86
+ ## VΓ©rification du dΓ©ploiement
87
+
88
+ ### Logs de dΓ©marrage
89
+
90
+ Vous devriez voir dans les logs du Space:
91
+
92
+ ```
93
+ πŸ“₯ Checking for knowledge graph data...
94
+ πŸš€ Knowledge graph not found, downloading from Hugging Face...
95
+ ================================================================================
96
+ πŸš€ Starting Knowledge Graph Download
97
+ ================================================================================
98
+ πŸ“¦ Dataset: Cyberlgl/CyberLegalAI-knowledge-graph
99
+ 🌍 Jurisdictions: romania, bahrain
100
+ πŸ’Ύ HF Cache: /data/.huggingface
101
+ πŸ“ Target Directory: data/rag_storage
102
+ ================================================================================
103
+
104
+ πŸ“₯ Processing jurisdiction: romania
105
+ ...
106
+ βœ… romania: 18 files copied (267.0 MB)
107
+
108
+ πŸ“₯ Processing jurisdiction: bahrain
109
+ ...
110
+ βœ… bahrain: 18 files copied (575.0 MB)
111
+
112
+ ================================================================================
113
+ πŸŽ‰ Knowledge Graph Download Complete!
114
+ ================================================================================
115
+ πŸ“Š romania: 267.0 MB
116
+ πŸ“Š bahrain: 575.0 MB
117
+
118
+ πŸ’Ύ Total size: 842.0 MB
119
+ ================================================================================
120
+ ```
121
+
122
+ ### RedΓ©marrage ultΓ©rieur
123
+
124
+ Aux redΓ©marrages suivants, vous verrez:
125
+
126
+ ```
127
+ πŸ“₯ Checking for knowledge graph data...
128
+ βœ… Knowledge graph data already present, skipping download
129
+ ```
130
+
131
+ ## Maintenance
132
+
133
+ ### Mettre Γ  jour le knowledge graph
134
+
135
+ 1. Mettez Γ  jour les donnΓ©es localement
136
+ 2. Uploadez les modifications vers le dataset Hugging Face:
137
+
138
+ ```bash
139
+ huggingface-cli upload Cyberlgl/CyberLegalAI-knowledge-graph data/rag_storage/ --repo-type dataset
140
+ ```
141
+
142
+ 3. RedΓ©marrez le Space pour appliquer les modifications
143
+
144
+ ### Ajouter une nouvelle juridiction
145
+
146
+ 1. Ajoutez les donnΓ©es dans `data/rag_storage/nouvelle_juridiction/`
147
+ 2. Uploadez vers le dataset Hugging Face:
148
+
149
+ ```bash
150
+ huggingface-cli upload Cyberlgl/CyberLegalAI-knowledge-graph data/rag_storage/nouvelle_juridiction/ --repo-type dataset
151
+ ```
152
+
153
+ 3. Mettez Γ  jour la variable `JURISDICTIONS` dans le Space:
154
+ ```
155
+ JURISDICTIONS=romania,bahrain,nouvelle_juridiction
156
+ ```
157
+
158
+ 4. RedΓ©marrez le Space
159
+
160
+ ## DΓ©pannage
161
+
162
+ ### Erreur "Dataset not found"
163
+
164
+ **SymptΓ΄me:** `Repo not found: Cyberlgl/CyberLegalAI-knowledge-graph`
165
+
166
+ **Solution:**
167
+ - VΓ©rifiez que le dataset existe: https://huggingface.co/datasets/Cyberlgl/CyberLegalAI-knowledge-graph
168
+ - VΓ©rifiez que l'ID du dataset dans `HF_KNOWLEDGE_GRAPH_DATASET` est correct
169
+
170
+ ### Erreur "Invalid token"
171
+
172
+ **SymptΓ΄me:** `Invalid token passed`
173
+
174
+ **Solution:**
175
+ - VΓ©rifiez que `HF_TOKEN` est correctement configurΓ© dans le Space
176
+ - CrΓ©ez un nouveau token avec les permissions de lecture (read) sur les datasets: https://huggingface.co/settings/tokens
177
+
178
+ ### TΓ©lΓ©chargement lent
179
+
180
+ **SymptΓ΄me:** Le tΓ©lΓ©chargement prend beaucoup de temps
181
+
182
+ **Solution:**
183
+ - Le premier tΓ©lΓ©chargement peut prendre plusieurs minutes pour 842 Mo
184
+ - Les tΓ©lΓ©chargements suivants seront instantanΓ©s grΓ’ce au cache
185
+ - VΓ©rifiez que les permissions rΓ©seau du Space sont correctes
186
+
187
+ ### Erreur "Permission denied" lors du tΓ©lΓ©chargement
188
+
189
+ **SymptΓ΄me:** `PermissionError: [Errno 13] Permission denied: '/data/.huggingface'`
190
+
191
+ **Solution:**
192
+ - Le script devrait crΓ©er automatiquement le rΓ©pertoire avec les bonnes permissions
193
+ - Si l'erreur persiste, vΓ©rifiez les permissions dans le Dockerfile
194
+
195
+ ## Variables d'environnement
196
+
197
+ | Variable | Description | DΓ©faut |
198
+ |----------|-------------|--------|
199
+ | `HF_TOKEN` | Token d'accès Hugging Face | (requis) |
200
+ | `JURISDICTIONS` | Liste des juridictions Γ  tΓ©lΓ©charger (sΓ©parΓ©es par virgules) | `romania,bahrain` |
201
+ | `HF_KNOWLEDGE_GRAPH_DATASET` | ID du dataset Hugging Face | `Cyberlgl/CyberLegalAI-knowledge-graph` |
202
+ | `HF_HOME` | RΓ©pertoire de cache Hugging Face | `/data/.huggingface` |
203
+
204
+ ## Scripts inclus
205
+
206
+ ### `scripts/download_knowledge_graph.py`
207
+
208
+ Script principal gΓ©rant le tΓ©lΓ©chargement du knowledge graph.
209
+
210
+ **FonctionnalitΓ©s:**
211
+ - TΓ©lΓ©chargement automatique depuis le dataset Hugging Face
212
+ - Support du cache persistant pour Γ©viter les re-tΓ©lΓ©chargements
213
+ - TΓ©lΓ©chargement sΓ©lectif par juridiction
214
+ - Logs dΓ©taillΓ©s du processus de tΓ©lΓ©chargement
215
+ - Copie des fichiers vers le rΓ©pertoire d'application
216
+
217
+ Pour plus de dΓ©tails, voir: `scripts/README.md`
218
+
219
+ ## Support
220
+
221
+ Pour toute question ou problème:
222
+ 1. Consultez `scripts/README.md` pour les dΓ©tails des scripts
223
+ 2. Consultez les logs du Space pour les erreurs spΓ©cifiques
224
+ 3. Ouvrez une issue sur GitHub: https://github.com/Cgrandjean/CyberLegalAI
225
+
226
+ ## Migration depuis l'ancien système
227
+
228
+ Si vous migrez depuis un système où les données étaient dans le repo principal:
229
+
230
+ 1. **Sauvegardez localement** les donnΓ©es existantes
231
+ 2. **CrΓ©ez le dataset** (Γ©tape 2)
232
+ 3. **Uploadez les donnΓ©es** (Γ©tape 2)
233
+ 4. **Configurez le Space** (Γ©tape 3)
234
+ 5. **Nettoyez le repo** (Γ©tape 4)
235
+ 6. **RedΓ©ployez** (Γ©tape 5)
236
+
237
+ Les données resteront accessibles et le système fonctionnera comme avant, mais avec les avantages du nouveau système.
Dockerfile CHANGED
@@ -11,6 +11,8 @@ ENV PYTHONIOENCODING=utf-8
11
  ENV LIGHTRAG_HOST=127.0.0.1
12
  ENV LIGHTRAG_PORT=9621
13
  ENV API_PORT=8000
 
 
14
 
15
  # Install system dependencies
16
  RUN apt-get update && apt-get install -y \
 
11
  ENV LIGHTRAG_HOST=127.0.0.1
12
  ENV LIGHTRAG_PORT=9621
13
  ENV API_PORT=8000
14
+ ENV HF_HOME=/data/.huggingface
15
+ ENV JURISDICTIONS=romania,bahrain
16
 
17
  # Install system dependencies
18
  RUN apt-get update && apt-get install -y \
data/lawyers.json DELETED
@@ -1,202 +0,0 @@
1
- [
2
- {
3
- "name": "Nader Bakri",
4
- "experience_years": 8,
5
- "specialty": "Cyber Law",
6
- "presentation": "Experienced lawyer focusing on complex legal matters at the intersection of technology, business, and regulatory compliance. Provides practical and solution-oriented legal advice tailored to modern digital challenges.",
7
- "areas_of_practice": [
8
- "Criminal Law",
9
- "Commercial Law",
10
- "Civil Law",
11
- "Administrative Law",
12
- "Family Law",
13
- "Cyber Law",
14
- "IT Law",
15
- "AI Law",
16
- "Data Protection"
17
- ]
18
- },
19
- {
20
- "name": "Andrei Popescu",
21
- "experience_years": 12,
22
- "specialty": "Commercial & Corporate Law",
23
- "presentation": "Seasoned legal professional with extensive experience advising companies on corporate governance, contracts, and commercial disputes at both national and international levels.",
24
- "areas_of_practice": [
25
- "Commercial Law",
26
- "Corporate Law",
27
- "Civil Law",
28
- "Contract Law",
29
- "Commercial Litigation",
30
- "Arbitration"
31
- ]
32
- },
33
- {
34
- "name": "Maria Ionescu",
35
- "experience_years": 9,
36
- "specialty": "Data Protection & Privacy Law",
37
- "presentation": "Specialized in data protection and privacy compliance, assisting organizations in aligning their operations with GDPR and international data protection standards.",
38
- "areas_of_practice": [
39
- "Data Protection",
40
- "GDPR",
41
- "Cyber Law",
42
- "IT Law",
43
- "Civil Law",
44
- "Commercial Law",
45
- "Compliance"
46
- ]
47
- },
48
- {
49
- "name": "Karim Al-Hassan",
50
- "experience_years": 15,
51
- "specialty": "International Business Law",
52
- "presentation": "International lawyer advising multinational clients on cross-border transactions, regulatory frameworks, and international commercial contracts.",
53
- "areas_of_practice": [
54
- "International Commercial Law",
55
- "Civil Law",
56
- "Contract Law",
57
- "Arbitration",
58
- "Customs Law"
59
- ]
60
- },
61
- {
62
- "name": "Elena Radu",
63
- "experience_years": 7,
64
- "specialty": "Civil & Family Law",
65
- "presentation": "Dedicated legal professional providing assistance in sensitive civil and family matters, with a strong focus on ethics, discretion, and client trust.",
66
- "areas_of_practice": [
67
- "Civil Law",
68
- "Family Law",
69
- "Matrimonial Law",
70
- "Inheritance Law",
71
- "Litigation"
72
- ]
73
- },
74
- {
75
- "name": "Victor Marinescu",
76
- "experience_years": 14,
77
- "specialty": "Criminal Law",
78
- "presentation": "Experienced criminal defense lawyer representing clients in complex investigations and high-stakes criminal proceedings.",
79
- "areas_of_practice": [
80
- "Criminal Law",
81
- "Criminal Procedure",
82
- "Related Civil Claims",
83
- "Litigation"
84
- ]
85
- },
86
- {
87
- "name": "Sophia Klein",
88
- "experience_years": 10,
89
- "specialty": "IT & Technology Law",
90
- "presentation": "Technology-focused legal advisor assisting startups and technology companies with contracts, compliance, and intellectual property matters.",
91
- "areas_of_practice": [
92
- "IT Law",
93
- "Cyber Law",
94
- "AI Law",
95
- "Intellectual Property Law",
96
- "Commercial Law"
97
- ]
98
- },
99
- {
100
- "name": "Mihai Dumitrescu",
101
- "experience_years": 18,
102
- "specialty": "Administrative & Public Law",
103
- "presentation": "Legal expert in administrative disputes and public procurement, representing both private entities and public authorities.",
104
- "areas_of_practice": [
105
- "Administrative Law",
106
- "Public Law",
107
- "Public Procurement",
108
- "Administrative Litigation",
109
- "Constitutional Law"
110
- ]
111
- },
112
- {
113
- "name": "Laura Petrescu",
114
- "experience_years": 6,
115
- "specialty": "Employment & Labor Law",
116
- "presentation": "Advises employers and employees on labor relations, regulatory compliance, and employment-related dispute resolution.",
117
- "areas_of_practice": [
118
- "Employment Law",
119
- "Labor Law",
120
- "Civil Law",
121
- "Employment Litigation",
122
- "Compliance"
123
- ]
124
- },
125
- {
126
- "name": "Omar Khaled",
127
- "experience_years": 11,
128
- "specialty": "Cybercrime & Digital Evidence",
129
- "presentation": "Specialist in cybercrime cases and digital investigations, with strong expertise in electronic evidence and forensic collaboration.",
130
- "areas_of_practice": [
131
- "Cyber Law",
132
- "Criminal Law",
133
- "Cybercrime",
134
- "IT Law",
135
- "Digital Evidence"
136
- ]
137
- },
138
- {
139
- "name": "Ana-Maria Stoica",
140
- "experience_years": 13,
141
- "specialty": "Intellectual Property Law",
142
- "presentation": "Provides strategic legal protection for brands, software, and creative works in both domestic and international markets.",
143
- "areas_of_practice": [
144
- "Intellectual Property Law",
145
- "Commercial Law",
146
- "IT Law",
147
- "Copyright",
148
- "Trademarks"
149
- ]
150
- },
151
- {
152
- "name": "Daniel Weiss",
153
- "experience_years": 16,
154
- "specialty": "Arbitration & Litigation",
155
- "presentation": "Experienced litigator representing clients in complex commercial disputes before courts and arbitral tribunals.",
156
- "areas_of_practice": [
157
- "Litigation",
158
- "Arbitration",
159
- "Commercial Law",
160
- "Civil Law",
161
- "Private International Law"
162
- ]
163
- },
164
- {
165
- "name": "Raluca Neagu",
166
- "experience_years": 8,
167
- "specialty": "Compliance & Regulatory Law",
168
- "presentation": "Advises companies on regulatory compliance, internal governance, and risk management frameworks.",
169
- "areas_of_practice": [
170
- "Compliance",
171
- "Regulatory Law",
172
- "Commercial Law",
173
- "Administrative Law",
174
- "Data Protection"
175
- ]
176
- },
177
- {
178
- "name": "Hassan Farouk",
179
- "experience_years": 20,
180
- "specialty": "Banking & Financial Law",
181
- "presentation": "Senior legal advisor with extensive experience in banking regulation, financial transactions, and risk mitigation.",
182
- "areas_of_practice": [
183
- "Banking Law",
184
- "Financial Law",
185
- "Compliance",
186
- "Commercial Law"
187
- ]
188
- },
189
- {
190
- "name": "Ioana Vasilescu",
191
- "experience_years": 5,
192
- "specialty": "AI & Emerging Technologies Law",
193
- "presentation": "Focused on legal challenges related to artificial intelligence, automation, and emerging technologies, supporting innovation-driven organizations.",
194
- "areas_of_practice": [
195
- "AI Law",
196
- "Cyber Law",
197
- "IT Law",
198
- "Data Protection",
199
- "Commercial Law"
200
- ]
201
- }
202
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt CHANGED
@@ -26,3 +26,4 @@ langchain-tavily>=0.2.16
26
  resend>=0.8.0
27
  beautifulsoup4>=4.12.0
28
  httpx>=0.24.0
 
 
26
  resend>=0.8.0
27
  beautifulsoup4>=4.12.0
28
  httpx>=0.24.0
29
+ huggingface-hub>=0.20.0
scripts/download_knowledge_graph.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Download knowledge graph from Hugging Face dataset at container startup
4
+ """
5
+ import os
6
+ import shutil
7
+ import logging
8
+ from pathlib import Path
9
+ from huggingface_hub import snapshot_download
10
+ from dotenv import load_dotenv
11
+
12
+ # Load environment variables
13
+ load_dotenv(dotenv_path=".env", override=False)
14
+
15
+ # Configure logging
16
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
17
+ logger = logging.getLogger(__name__)
18
+
19
+
20
+ def download_knowledge_graph():
21
+ """
22
+ Download knowledge graph from Hugging Face dataset and copy to application directory
23
+ """
24
+ # Configure Hugging Face cache to use persistent storage
25
+ hf_home = "/data/.huggingface"
26
+ os.environ["HF_HOME"] = hf_home
27
+ os.makedirs(hf_home, exist_ok=True)
28
+
29
+ # Get jurisdictions to download
30
+ jurisdictions_str = os.getenv("JURISDICTIONS", "romania,bahrain")
31
+ jurisdictions = [j.strip() for j in jurisdictions_str.split(",")]
32
+
33
+ # Dataset configuration
34
+ dataset_id = os.getenv("HF_KNOWLEDGE_GRAPH_DATASET", "Cyberlgl/CyberLegalAI-knowledge-graph")
35
+ hf_token = os.getenv("HF_TOKEN")
36
+
37
+ # Target directory
38
+ target_base_dir = "data/rag_storage"
39
+ os.makedirs(target_base_dir, exist_ok=True)
40
+
41
+ logger.info("=" * 80)
42
+ logger.info("πŸš€ Starting Knowledge Graph Download")
43
+ logger.info("=" * 80)
44
+ logger.info(f"πŸ“¦ Dataset: {dataset_id}")
45
+ logger.info(f"🌍 Jurisdictions: {', '.join(jurisdictions)}")
46
+ logger.info(f"πŸ’Ύ HF Cache: {hf_home}")
47
+ logger.info(f"πŸ“ Target Directory: {target_base_dir}")
48
+ logger.info("=" * 80)
49
+
50
+ try:
51
+ for jurisdiction in jurisdictions:
52
+ logger.info(f"\nπŸ“₯ Processing jurisdiction: {jurisdiction}")
53
+
54
+ # Download from dataset with filtering
55
+ local_path = snapshot_download(
56
+ repo_id=dataset_id,
57
+ repo_type="dataset",
58
+ allow_patterns=[f"{jurisdiction}/*"],
59
+ cache_dir=hf_home,
60
+ token=hf_token
61
+ )
62
+
63
+ logger.info(f"βœ… Downloaded to cache: {local_path}")
64
+
65
+ # Copy to application directory
66
+ dest_dir = os.path.join(target_base_dir, jurisdiction)
67
+ os.makedirs(dest_dir, exist_ok=True)
68
+
69
+ src_dir = os.path.join(local_path, jurisdiction)
70
+
71
+ if os.path.exists(src_dir):
72
+ files_copied = 0
73
+ total_size = 0
74
+
75
+ for file in os.listdir(src_dir):
76
+ src_file = os.path.join(src_dir, file)
77
+ dest_file = os.path.join(dest_dir, file)
78
+
79
+ # Copy file
80
+ shutil.copy2(src_file, dest_file)
81
+ file_size = os.path.getsize(dest_file)
82
+ total_size += file_size
83
+ files_copied += 1
84
+
85
+ logger.info(f"πŸ“„ Copied: {file} ({file_size / (1024*1024):.1f} MB)")
86
+
87
+ logger.info(f"βœ… {jurisdiction}: {files_copied} files copied ({total_size / (1024*1024):.1f} MB)")
88
+ else:
89
+ logger.warning(f"⚠️ Jurisdiction directory not found in dataset: {src_dir}")
90
+
91
+ logger.info("\n" + "=" * 80)
92
+ logger.info("πŸŽ‰ Knowledge Graph Download Complete!")
93
+ logger.info("=" * 80)
94
+
95
+ # Print summary
96
+ total_size = 0
97
+ for jurisdiction in jurisdictions:
98
+ jur_dir = os.path.join(target_base_dir, jurisdiction)
99
+ if os.path.exists(jur_dir):
100
+ jur_size = sum(os.path.getsize(os.path.join(jur_dir, f)) for f in os.listdir(jur_dir))
101
+ total_size += jur_size
102
+ logger.info(f"πŸ“Š {jurisdiction}: {jur_size / (1024*1024):.1f} MB")
103
+
104
+ logger.info(f"\nπŸ’Ύ Total size: {total_size / (1024*1024):.1f} MB")
105
+ logger.info("=" * 80)
106
+
107
+ return True
108
+
109
+ except Exception as e:
110
+ logger.error("\n" + "=" * 80)
111
+ logger.error(f"❌ Error downloading knowledge graph: {e}")
112
+ logger.error("=" * 80)
113
+ return False
114
+
115
+
116
+ if __name__ == "__main__":
117
+ success = download_knowledge_graph()
118
+ exit(0 if success else 1)
startup.sh CHANGED
@@ -1,6 +1,22 @@
1
  #!/usr/bin/env bash
2
  set -euo pipefail
3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  HOST="${LIGHTRAG_HOST:-127.0.0.1}"
5
  ROOT="${LIGHTRAG_STORAGE_ROOT:-data/rag_storage}"
6
  GRAPHS="${LIGHTRAG_GRAPHS:-romania:9621}"
 
1
  #!/usr/bin/env bash
2
  set -euo pipefail
3
 
4
+ # Step 1: Download knowledge graph from Hugging Face
5
+ echo "πŸ“₯ Checking for knowledge graph data..."
6
+ if [ ! -d "data/rag_storage/romania" ] || [ ! -d "data/rag_storage/bahrain" ]; then
7
+ echo "πŸš€ Knowledge graph not found, downloading from Hugging Face..."
8
+ python scripts/download_knowledge_graph.py
9
+ if [ $? -ne 0 ]; then
10
+ echo "❌ Failed to download knowledge graph. Exiting."
11
+ exit 1
12
+ fi
13
+ echo "βœ… Knowledge graph download complete"
14
+ else
15
+ echo "βœ… Knowledge graph data already present, skipping download"
16
+ fi
17
+
18
+ echo ""
19
+
20
  HOST="${LIGHTRAG_HOST:-127.0.0.1}"
21
  ROOT="${LIGHTRAG_STORAGE_ROOT:-data/rag_storage}"
22
  GRAPHS="${LIGHTRAG_GRAPHS:-romania:9621}"