Andrej Janchevski commited on
Commit
e99172e
Β·
1 Parent(s): 8d87d9f

Add backend initialization plan

Browse files
Files changed (1) hide show
  1. .claude/plans/backend_init.md +283 -0
.claude/plans/backend_init.md ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Backend Initialization Plan
2
+
3
+ ## Context
4
+
5
+ Build the Django backend foundation: project scaffolding, dataset/entity/relation loading, health and discovery endpoints. NO inference/generation logic β€” that comes later. This covers all non-prediction endpoints from `docs/api.yaml`.
6
+
7
+ ## Project Structure
8
+
9
+ ```
10
+ src/backend/
11
+ manage.py
12
+ requirements.txt
13
+ research_api/
14
+ __init__.py
15
+ settings.py
16
+ urls.py
17
+ wsgi.py
18
+ api/
19
+ __init__.py
20
+ apps.py
21
+ urls.py
22
+ views/
23
+ __init__.py
24
+ health.py
25
+ coins.py
26
+ graph_generation.py
27
+ kg_anomaly.py
28
+ services/
29
+ __init__.py
30
+ registry.py # Singleton: loads datasets at startup, scans checkpoints
31
+ coins_loader.py # Lightweight KG dataset reader (NOT reusing research Loader)
32
+ constants.py # All static/hardcoded data (models, query structures, etc.)
33
+ pagination.py # Custom page/page_size pagination helper
34
+ exceptions.py # Standardized error envelope
35
+ ```
36
+
37
+ ## Key Design Decision: Reuse Research Loaders
38
+
39
+ Use the prebuilt `Loader` class from `graph_completion/graphs/load_graph.py` for COINs datasets and the equivalent loaders for generation. This ensures data loading matches the research code exactly (same entity/relation mappings, same splits). The Django service layer wraps these loaders with a thin adapter for search/pagination.
40
+
41
+ Requires adding `src/research/COINs-KGGeneration` and `src/research/MultiProxAn` to `sys.path` at startup.
42
+
43
+ ### Dependencies for Loader
44
+
45
+ The research `Loader` needs: `igraph`, `pandas`, `numpy`, `torch` (for device handling). These must be in `requirements.txt` even for the init phase.
46
+
47
+ ## Model Registry (singleton)
48
+
49
+ `registry.py` β€” loaded once via `AppConfig.ready()`:
50
+
51
+ 1. **Downloads checkpoints from Google Drive** if not already present locally
52
+ 2. **Loads all 3 COINs datasets** using the research `Loader`
53
+ 3. **Scans checkpoint directories** for `.tar` / `.ckpt` files to determine availability flags (does NOT load model weights)
54
+ 4. **Generates sample subgraphs** for KG anomaly using the existing research code:
55
+ - `Loader.get_context_subgraph_dataset(max_graph_size, device)` β€” DFS-based partitioning of the full graph into sized subgraphs
56
+ - Internally calls `Sampler.get_context_subgraph_samples_dfs()` in `graph_completion/graphs/preprocess.py` (line 742)
57
+ - Returns `List[ContextSubgraphData]` (torch_geometric Data objects with `x`, `x_index`, `edge_index`, `edge_attr`, `subgraph_row`, `subgraph_col`)
58
+ - `context_subgraphs_to_tensors()` (preprocess.py line 809) converts raw DFS samples to tensors
59
+ - For the API sample-subgraphs endpoint: call this once per dataset at startup, store a subset, convert to the API response format (entity names + relation names resolved from the Loader's name maps)
60
+
61
+ ### Checkpoint Download (on each boot)
62
+
63
+ Per CLAUDE.md: "backend downloads all files and places them in the `checkpoints` folders in `src/research` appropriately on each system boot".
64
+
65
+ Source: Google Drive folder `https://drive.google.com/drive/u/0/folders/14Bf8fi4KJn0rDdh9y8EFyA5b8OpQyXWi`
66
+
67
+ Download logic in `registry.py`:
68
+ 1. Use `gdown` library to list and download files from the shared Google Drive folder
69
+ 2. For each file, check if it already exists locally (by filename + size) β€” skip if present
70
+ 3. Place `.tar` files into `{COINS_COMPLETION_DIR}/checkpoints/`
71
+ 4. Place `.ckpt` files into the appropriate dir based on naming convention:
72
+ - `{dataset}_c.ckpt` / `{dataset}.ckpt` (MultiProxAn) -> `{MULTIPROXAN_DIR}/checkpoints/`
73
+ - `{dataset}_correct.ckpt` / `{dataset}.ckpt` (DiGress KG) -> `{DIGRESS_KG_DIR}/checkpoints/`
74
+ 5. Log download progress to console
75
+ 6. If download fails (no internet, quota exceeded), log warning and continue β€” models_loaded flags will be false
76
+
77
+ Add `gdown` to `requirements.txt`.
78
+
79
+ ### Checkpoint Scanning (after download)
80
+
81
+ - COINs (link prediction): glob `{COINS_COMPLETION_DIR}/checkpoints/{dataset}_{algorithm}.tar`
82
+ - MultiProxAn (graph generation): glob `{MULTIPROXAN_DIR}/checkpoints/{dataset}.ckpt` and `{dataset}_c.ckpt`
83
+ - DiGress KG (anomaly correction): glob `{DIGRESS_KG_DIR}/checkpoints/{dataset}.ckpt` and `{dataset}_correct.ckpt`
84
+
85
+ ## Constants (hardcoded from API spec)
86
+
87
+ `constants.py` contains all static data:
88
+ - `METHODS` β€” 3 research methods (id, name, thesis_section, description)
89
+ - `COINS_DATASET_META` β€” dataset id -> {name, description, data_dir}
90
+ - `COINS_MODELS` β€” 6 algorithms with supported_query_structures and available_datasets
91
+ - TransE, DistMult, ComplEx, RotatE, KBGAT: 1p only
92
+ - Q2B: all query structures (1p, 2p, 3p, 2i, 3i, ip, pi)
93
+ - `QUERY_STRUCTURES` β€” 7 query graph templates (nodes + edges arrays)
94
+ - `GRAPHGEN_DATASETS` β€” QM9 (molecular) + Community20 (synthetic)
95
+ - `GRAPHGEN_SAMPLING_MODES` β€” standard (T, chain_frames) + multiprox (T, m, t, t_prime)
96
+ - `KG_ANOMALY_DATASET_META` β€” freebase, wordnet, nell
97
+
98
+ ## Endpoints (12 total)
99
+
100
+ | Method | Path | View Class | Data Source |
101
+ |--------|------|-----------|-------------|
102
+ | `GET` | `/health` | `HealthView` | Registry availability flags |
103
+ | `GET` | `/methods` | `MethodsView` | `METHODS` constant |
104
+ | `GET` | `/coins/datasets` | `CoinsDatasetsView` | Registry + `COINS_DATASET_META` |
105
+ | `GET` | `/coins/datasets/{id}/entities` | `CoinsEntitiesView` | Registry dataset `search_entities()` |
106
+ | `GET` | `/coins/datasets/{id}/relations` | `CoinsRelationsView` | Registry dataset `search_relations()` |
107
+ | `GET` | `/coins/datasets/{id}/sample-triples` | `CoinsSampleTriplesView` | Registry dataset `sample_triples()` |
108
+ | `GET` | `/coins/models` | `CoinsModelsView` | `COINS_MODELS` filtered by available checkpoints |
109
+ | `GET` | `/coins/query-structures` | `CoinsQueryStructuresView` | `QUERY_STRUCTURES` constant |
110
+ | `GET` | `/graph-generation/datasets` | `GraphGenDatasetsView` | `GRAPHGEN_DATASETS` filtered by checkpoints |
111
+ | `GET` | `/graph-generation/sampling-modes` | `GraphGenSamplingModesView` | `GRAPHGEN_SAMPLING_MODES` constant |
112
+ | `GET` | `/kg-anomaly/datasets` | `KgAnomalyDatasetsView` | `KG_ANOMALY_DATASET_META` |
113
+ | `GET` | `/kg-anomaly/datasets/{id}/sample-subgraphs` | `KgAnomalySampleSubgraphsView` | Registry pre-computed subgraphs |
114
+
115
+ ## Django Settings
116
+
117
+ - `DATABASES = {}` β€” no database
118
+ - `INSTALLED_APPS`: `rest_framework`, `corsheaders`, `api`
119
+ - `CORS_ALLOWED_ORIGINS`: `["https://bani57.pythonanywhere.com"]`
120
+ - `REST_FRAMEWORK`: JSON renderer only, no default pagination, custom exception handler
121
+ - Custom research path settings:
122
+ - `COINS_DATA_DIR` -> `src/research/COINs-KGGeneration/data/`
123
+ - `COINS_COMPLETION_DIR` -> `src/research/COINs-KGGeneration/graph_completion/`
124
+ - `DIGRESS_KG_DIR` -> `src/research/COINs-KGGeneration/graph_generation/` (DiGress model applied to KG data)
125
+ - `MULTIPROXAN_DIR` -> `src/research/MultiProxAn/`
126
+ - Each has its own checkpoints subdirectory:
127
+ - COINs (link prediction): `{COINS_COMPLETION_DIR}/checkpoints/`
128
+ - DiGress KG (anomaly correction): `{DIGRESS_KG_DIR}/checkpoints/`
129
+ - MultiProxAn (graph generation): `{MULTIPROXAN_DIR}/checkpoints/`
130
+
131
+ ## Error Handling
132
+
133
+ Custom DRF exception handler wrapping all errors in:
134
+ ```json
135
+ {"error": {"code": "NOT_FOUND", "message": "...", "details": {}}}
136
+ ```
137
+
138
+ `ApiError` base class with `NotFoundError`, `InvalidRequestError` subclasses.
139
+
140
+ ## Dependencies
141
+
142
+ The research repos have **two incompatible Python environments**:
143
+
144
+ ### COINs (graph_completion) β€” Python 3.6.13
145
+ From `src/research/COINs-KGGeneration/requirements.txt`:
146
+ ```
147
+ -f https://download.pytorch.org/whl/cu113/torch_stable.html
148
+ -f https://data.pyg.org/whl/torch-1.10.2+cu113.html
149
+ torch==1.10.2+cu113
150
+ torch-geometric==2.0.3
151
+ torch-cluster==1.5.9
152
+ torch-scatter==2.0.9
153
+ torch-sparse==0.6.12
154
+ torch-spline-conv==1.2.1
155
+ igraph==0.9.11
156
+ numpy==1.19.5
157
+ pandas==1.1.5
158
+ scipy==1.5.4
159
+ PyYAML==6.0
160
+ ```
161
+ Install: `conda create --name coins python=3.6.13 && pip install -r requirements.txt`
162
+
163
+ ### DiGress KG (graph_generation) + MultiProxAn β€” Python 3.9
164
+ From `src/research/COINs-KGGeneration/graph_generation/requirements.txt` and `src/research/MultiProxAn/requirements.txt` (identical):
165
+ ```
166
+ torch_geometric==2.3.1
167
+ pytorch_lightning==2.0.4
168
+ hydra-core==1.3.2
169
+ omegaconf==2.3.0
170
+ numpy==1.23
171
+ pandas==1.4
172
+ scipy==1.11.0
173
+ torchmetrics==0.11.4
174
+ wandb==0.15.4
175
+ imageio==2.31.1
176
+ matplotlib==3.7.1
177
+ networkx==2.8.7
178
+ pyemd==1.0.0
179
+ PyGSP==0.5.1
180
+ ```
181
+ Install requires conda for rdkit and graph-tool, plus specific torch index URL:
182
+ ```bash
183
+ conda create -c conda-forge -n digress rdkit=2023.03.2 python=3.9
184
+ conda activate digress
185
+ conda install -c conda-forge graph-tool=2.45
186
+ conda install -c "nvidia/label/cuda-11.8.0" cuda
187
+ pip install torch==2.0.1 --index-url https://download.pytorch.org/whl/cu118
188
+ pip install -r requirements.txt
189
+ pip install -e .
190
+ # Compile orca:
191
+ cd src/analysis/orca && g++ -O2 -std=c++11 -o orca orca.cpp
192
+ ```
193
+
194
+ ### Backend requirements.txt
195
+
196
+ The backend must unify both into one Python 3.9 environment. Since the COINs `Loader` is the only part reused for init, we pin to the DiGress/MultiProxAn versions and ensure COINs code runs under them. For PythonAnywhere (CPU only), use CPU torch wheels.
197
+
198
+ ```
199
+ # Django
200
+ django==4.2.*
201
+ djangorestframework==3.14.*
202
+ django-cors-headers==4.*
203
+
204
+ # Checkpoint download from Google Drive
205
+ gdown>=4.7
206
+
207
+ # PyTorch (CPU only for PythonAnywhere)
208
+ -f https://download.pytorch.org/whl/cpu/torch_stable.html
209
+ torch==2.0.1+cpu
210
+ torchvision==0.15.2+cpu
211
+ torchaudio==2.0.2+cpu
212
+
213
+ # PyTorch Geometric (must match torch version)
214
+ torch-geometric==2.3.1
215
+
216
+ # Research shared deps
217
+ pytorch-lightning==2.0.4
218
+ hydra-core==1.3.2
219
+ omegaconf==2.3.0
220
+ numpy==1.23
221
+ pandas==1.4
222
+ scipy==1.11.0
223
+ igraph==0.9.11
224
+ PyYAML==6.0
225
+ networkx==2.8.7
226
+ matplotlib==3.7.1
227
+ imageio==2.31.1
228
+ torchmetrics==0.11.4
229
+ tqdm==4.65.0
230
+
231
+ # Conda-only deps (must be pre-installed, not in pip requirements):
232
+ # rdkit==2023.03.2 (conda create -c conda-forge)
233
+ # graph-tool==2.45 (conda install -c conda-forge)
234
+ ```
235
+
236
+ **Note**: rdkit and graph-tool are conda-only packages. On PythonAnywhere (pip-only), these must either be pre-compiled or the code paths that use them guarded with try/except imports.
237
+
238
+ ## Implementation Sequence
239
+
240
+ 1. Write `requirements.txt` and install dependencies (django, DRF, cors-headers, torch, igraph, scipy, pandas, numpy)
241
+ 2. Create Django project scaffolding (`manage.py`, `research_api/` settings/urls/wsgi)
242
+ 3. Configure settings β€” CORS, DRF, research paths (`COINS_DATA_DIR`, `COINS_COMPLETION_DIR`, `DIGRESS_KG_DIR`, `MULTIPROXAN_DIR`), add research dirs to `sys.path`
243
+ 4. Create `api` app skeleton (`apps.py`, `urls.py`)
244
+ 5. Write `constants.py` β€” all static data from API spec (methods, models, query structures, dataset metadata, sampling modes)
245
+ 6. Write `registry.py` β€” `ModelRegistry` singleton: Google Drive checkpoint download, research `Loader` for COINs datasets, checkpoint scanning across all 3 dirs
246
+ 7. Write `pagination.py` β€” `paginate_list()` helper
247
+ 8. Write `exceptions.py` β€” custom error envelope formatting
248
+ 9. Write views β€” one class per endpoint, using registry + constants
249
+ 10. Wire URL routing (`research_api/urls.py` -> `api/urls.py`)
250
+ 11. Hook `ModelRegistry.initialize()` in `AppConfig.ready()`
251
+
252
+ ## Verification
253
+
254
+ 1. `pip install -r requirements.txt` β€” all dependencies install cleanly
255
+ 2. `python manage.py check` β€” Django system check passes (no DB warnings expected since `DATABASES = {}`)
256
+ 3. `python manage.py runserver` β€” registry downloads checkpoints from Google Drive (or skips if present), loads 3 COINs datasets via research `Loader`, scans all 3 checkpoint dirs, prints entity/relation counts
257
+ 4. `GET /api/v1/health` β€” returns `{"status": "ok", "models_loaded": {"coins": true/false, "multiproxan": true/false, "kg_anomaly": true/false}}` based on checkpoint scan
258
+ 5. `GET /api/v1/methods` β€” returns 3 methods
259
+ 6. `GET /api/v1/coins/datasets` β€” returns 3 datasets with correct entity/relation counts (FB15k-237: ~14541 entities, 237 relations; WN18RR: ~40943 entities, 11 relations; NELL-995: from entity2id.txt/relation2id.txt)
260
+ 7. `GET /api/v1/coins/datasets/freebase/entities?q=location&page=1&page_size=10` β€” filtered, paginated results
261
+ 8. `GET /api/v1/coins/datasets/freebase/relations?q=country` β€” substring search on relation names
262
+ 9. `GET /api/v1/coins/datasets/freebase/sample-triples?count=5` β€” 5 random triples with resolved entity/relation names
263
+ 10. `GET /api/v1/coins/datasets/unknown/entities` β€” `{"error": {"code": "NOT_FOUND", ...}}`
264
+ 11. `GET /api/v1/coins/models` β€” 6 algorithms, filtered by actually available checkpoints
265
+ 12. `GET /api/v1/coins/query-structures` β€” 7 templates (1p, 2p, 3p, 2i, 3i, ip, pi) matching API spec
266
+ 13. `GET /api/v1/graph-generation/datasets` β€” QM9 + Community20 with model type availability from MultiProxAn checkpoint scan
267
+ 14. `GET /api/v1/graph-generation/sampling-modes` β€” standard + multiprox with parameter specs
268
+ 15. `GET /api/v1/kg-anomaly/datasets` β€” 3 datasets with availability from DiGress KG checkpoint scan
269
+ 16. `GET /api/v1/kg-anomaly/datasets/freebase/sample-subgraphs?count=3` β€” 3 pre-computed subgraphs with entity/relation names
270
+ 17. Import `docs/postman/collection.json` into Postman, set environment, run all GET requests against running server
271
+
272
+ ## Critical Source Files
273
+
274
+ | File | Why |
275
+ |------|-----|
276
+ | `docs/api.yaml` | Authoritative API contract for all response schemas |
277
+ | `src/research/COINs-KGGeneration/graph_completion/graphs/load_graph.py` | Research `Loader` class reused for dataset loading |
278
+ | `src/research/COINs-KGGeneration/graph_data/load_data.py` | Dataset classes (FreeBase, WordNet, NELL) with file formats |
279
+ | `src/research/COINs-KGGeneration/data/` | Actual data files (train/valid/test.txt, entity2id.txt, relation2id.txt) |
280
+ | `src/research/MultiProxAn/checkpoints/` | MultiProxAn checkpoint dir (scanned for availability) |
281
+ | `src/research/COINs-KGGeneration/graph_generation/checkpoints/` | DiGress KG checkpoint dir (scanned for availability) |
282
+ | `src/research/COINs-KGGeneration/graph_completion/checkpoints/` | COINs checkpoint dir (scanned for availability) |
283
+ | `CLAUDE.md` | Line length 120, Ruff rules, deployment target |