Andrej Janchevski commited on
Commit
0b4eacc
·
1 Parent(s): e99172e

Implement API skeleton

Browse files
.claude/plans/backend_init.md CHANGED
@@ -21,17 +21,16 @@ src/backend/
21
  urls.py
22
  views/
23
  __init__.py
24
- health.py
25
  coins.py
26
  graph_generation.py
27
  kg_anomaly.py
28
  services/
29
  __init__.py
30
- registry.py # Singleton: loads datasets at startup, scans checkpoints
31
- coins_loader.py # Lightweight KG dataset reader (NOT reusing research Loader)
32
- constants.py # All static/hardcoded data (models, query structures, etc.)
33
- pagination.py # Custom page/page_size pagination helper
34
- exceptions.py # Standardized error envelope
35
  ```
36
 
37
  ## Key Design Decision: Reuse Research Loaders
@@ -49,8 +48,14 @@ The research `Loader` needs: `igraph`, `pandas`, `numpy`, `torch` (for device ha
49
  `registry.py` — loaded once via `AppConfig.ready()`:
50
 
51
  1. **Downloads checkpoints from Google Drive** if not already present locally
 
 
 
52
  2. **Loads all 3 COINs datasets** using the research `Loader`
53
  3. **Scans checkpoint directories** for `.tar` / `.ckpt` files to determine availability flags (does NOT load model weights)
 
 
 
54
  4. **Generates sample subgraphs** for KG anomaly using the existing research code:
55
  - `Loader.get_context_subgraph_dataset(max_graph_size, device)` — DFS-based partitioning of the full graph into sized subgraphs
56
  - Internally calls `Sampler.get_context_subgraph_samples_dfs()` in `graph_completion/graphs/preprocess.py` (line 742)
@@ -64,17 +69,17 @@ Per CLAUDE.md: "backend downloads all files and places them in the `checkpoints`
64
 
65
  Source: Google Drive folder `https://drive.google.com/drive/u/0/folders/14Bf8fi4KJn0rDdh9y8EFyA5b8OpQyXWi`
66
 
67
- Download logic in `registry.py`:
68
- 1. Use `gdown` library to list and download files from the shared Google Drive folder
69
- 2. For each file, check if it already exists locally (by filename + size) — skip if present
70
- 3. Place `.tar` files into `{COINS_COMPLETION_DIR}/checkpoints/`
71
- 4. Place `.ckpt` files into the appropriate dir based on naming convention:
72
- - `{dataset}_c.ckpt` / `{dataset}.ckpt` (MultiProxAn) -> `{MULTIPROXAN_DIR}/checkpoints/`
73
- - `{dataset}_correct.ckpt` / `{dataset}.ckpt` (DiGress KG) -> `{DIGRESS_KG_DIR}/checkpoints/`
74
- 5. Log download progress to console
75
- 6. If download fails (no internet, quota exceeded), log warning and continue — models_loaded flags will be false
76
 
77
- Add `gdown` to `requirements.txt`.
 
 
 
 
78
 
79
  ### Checkpoint Scanning (after download)
80
 
@@ -82,23 +87,40 @@ Add `gdown` to `requirements.txt`.
82
  - MultiProxAn (graph generation): glob `{MULTIPROXAN_DIR}/checkpoints/{dataset}.ckpt` and `{dataset}_c.ckpt`
83
  - DiGress KG (anomaly correction): glob `{DIGRESS_KG_DIR}/checkpoints/{dataset}.ckpt` and `{dataset}_correct.ckpt`
84
 
85
- ## Constants (hardcoded from API spec)
86
 
87
- `constants.py` contains all static data:
88
- - `METHODS` — 3 research methods (id, name, thesis_section, description)
89
- - `COINS_DATASET_META` dataset id -> {name, description, data_dir}
90
- - `COINS_MODELS` 6 algorithms with supported_query_structures and available_datasets
91
- - TransE, DistMult, ComplEx, RotatE, KBGAT: 1p only
92
- - Q2B: all query structures (1p, 2p, 3p, 2i, 3i, ip, pi)
93
- - `QUERY_STRUCTURES` 7 query graph templates (nodes + edges arrays)
94
- - `GRAPHGEN_DATASETS` QM9 (molecular) + Community20 (synthetic)
95
- - `GRAPHGEN_SAMPLING_MODES` standard (T, chain_frames) + multiprox (T, m, t, t_prime)
96
- - `KG_ANOMALY_DATASET_META` freebase, wordnet, nell
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
- ## Endpoints (12 total)
99
 
100
  | Method | Path | View Class | Data Source |
101
  |--------|------|-----------|-------------|
 
102
  | `GET` | `/health` | `HealthView` | Registry availability flags |
103
  | `GET` | `/methods` | `MethodsView` | `METHODS` constant |
104
  | `GET` | `/coins/datasets` | `CoinsDatasetsView` | Registry + `COINS_DATASET_META` |
@@ -112,30 +134,18 @@ Add `gdown` to `requirements.txt`.
112
  | `GET` | `/kg-anomaly/datasets` | `KgAnomalyDatasetsView` | `KG_ANOMALY_DATASET_META` |
113
  | `GET` | `/kg-anomaly/datasets/{id}/sample-subgraphs` | `KgAnomalySampleSubgraphsView` | Registry pre-computed subgraphs |
114
 
115
- ## Django Settings
116
-
117
- - `DATABASES = {}` — no database
118
- - `INSTALLED_APPS`: `rest_framework`, `corsheaders`, `api`
119
- - `CORS_ALLOWED_ORIGINS`: `["https://bani57.pythonanywhere.com"]`
120
- - `REST_FRAMEWORK`: JSON renderer only, no default pagination, custom exception handler
121
- - Custom research path settings:
122
- - `COINS_DATA_DIR` -> `src/research/COINs-KGGeneration/data/`
123
- - `COINS_COMPLETION_DIR` -> `src/research/COINs-KGGeneration/graph_completion/`
124
- - `DIGRESS_KG_DIR` -> `src/research/COINs-KGGeneration/graph_generation/` (DiGress model applied to KG data)
125
- - `MULTIPROXAN_DIR` -> `src/research/MultiProxAn/`
126
- - Each has its own checkpoints subdirectory:
127
- - COINs (link prediction): `{COINS_COMPLETION_DIR}/checkpoints/`
128
- - DiGress KG (anomaly correction): `{DIGRESS_KG_DIR}/checkpoints/`
129
- - MultiProxAn (graph generation): `{MULTIPROXAN_DIR}/checkpoints/`
130
-
131
- ## Error Handling
132
-
133
- Custom DRF exception handler wrapping all errors in:
134
- ```json
135
- {"error": {"code": "NOT_FOUND", "message": "...", "details": {}}}
136
- ```
137
 
138
- `ApiError` base class with `NotFoundError`, `InvalidRequestError` subclasses.
 
 
 
 
 
 
 
 
 
139
 
140
  ## Dependencies
141
 
@@ -191,9 +201,11 @@ pip install -e .
191
  cd src/analysis/orca && g++ -O2 -std=c++11 -o orca orca.cpp
192
  ```
193
 
 
 
194
  ### Backend requirements.txt
195
 
196
- The backend must unify both into one Python 3.9 environment. Since the COINs `Loader` is the only part reused for init, we pin to the DiGress/MultiProxAn versions and ensure COINs code runs under them. For PythonAnywhere (CPU only), use CPU torch wheels.
197
 
198
  ```
199
  # Django
@@ -237,37 +249,40 @@ tqdm==4.65.0
237
 
238
  ## Implementation Sequence
239
 
240
- 1. Write `requirements.txt` and install dependencies (django, DRF, cors-headers, torch, igraph, scipy, pandas, numpy)
241
  2. Create Django project scaffolding (`manage.py`, `research_api/` settings/urls/wsgi)
242
  3. Configure settings — CORS, DRF, research paths (`COINS_DATA_DIR`, `COINS_COMPLETION_DIR`, `DIGRESS_KG_DIR`, `MULTIPROXAN_DIR`), add research dirs to `sys.path`
243
  4. Create `api` app skeleton (`apps.py`, `urls.py`)
244
- 5. Write `constants.py` — all static data from API spec (methods, models, query structures, dataset metadata, sampling modes)
245
- 6. Write `registry.py` — `ModelRegistry` singleton: Google Drive checkpoint download, research `Loader` for COINs datasets, checkpoint scanning across all 3 dirs
246
- 7. Write `pagination.py` — `paginate_list()` helper
247
- 8. Write `exceptions.py` — custom error envelope formatting
248
- 9. Write viewsone class per endpoint, using registry + constants
249
- 10. Wire URL routing (`research_api/urls.py` -> `api/urls.py`)
250
- 11. Hook `ModelRegistry.initialize()` in `AppConfig.ready()`
 
 
251
 
252
  ## Verification
253
 
254
- 1. `pip install -r requirements.txt` — all dependencies install cleanly
255
- 2. `python manage.py check` — Django system check passes (no DB warnings expected since `DATABASES = {}`)
256
- 3. `python manage.py runserver` — registry downloads checkpoints from Google Drive (or skips if present), loads 3 COINs datasets via research `Loader`, scans all 3 checkpoint dirs, prints entity/relation counts
257
- 4. `GET /api/v1/health` — returns `{"status": "ok", "models_loaded": {"coins": true/false, "multiproxan": true/false, "kg_anomaly": true/false}}` based on checkpoint scan
258
- 5. `GET /api/v1/methods` — returns 3 methods
259
- 6. `GET /api/v1/coins/datasets` — returns 3 datasets with correct entity/relation counts (FB15k-237: ~14541 entities, 237 relations; WN18RR: ~40943 entities, 11 relations; NELL-995: from entity2id.txt/relation2id.txt)
260
- 7. `GET /api/v1/coins/datasets/freebase/entities?q=location&page=1&page_size=10` — filtered, paginated results
261
- 8. `GET /api/v1/coins/datasets/freebase/relations?q=country` — substring search on relation names
262
- 9. `GET /api/v1/coins/datasets/freebase/sample-triples?count=5` — 5 random triples with resolved entity/relation names
263
- 10. `GET /api/v1/coins/datasets/unknown/entities` — `{"error": {"code": "NOT_FOUND", ...}}`
264
- 11. `GET /api/v1/coins/models` — 6 algorithms, filtered by actually available checkpoints
265
- 12. `GET /api/v1/coins/query-structures` — 7 templates (1p, 2p, 3p, 2i, 3i, ip, pi) matching API spec
266
- 13. `GET /api/v1/graph-generation/datasets` — QM9 + Community20 with model type availability from MultiProxAn checkpoint scan
267
- 14. `GET /api/v1/graph-generation/sampling-modes` — standard + multiprox with parameter specs
268
- 15. `GET /api/v1/kg-anomaly/datasets` — 3 datasets with availability from DiGress KG checkpoint scan
269
- 16. `GET /api/v1/kg-anomaly/datasets/freebase/sample-subgraphs?count=3` — 3 pre-computed subgraphs with entity/relation names
270
- 17. Import `docs/postman/collection.json` into Postman, set environment, run all GET requests against running server
 
271
 
272
  ## Critical Source Files
273
 
@@ -276,6 +291,7 @@ tqdm==4.65.0
276
  | `docs/api.yaml` | Authoritative API contract for all response schemas |
277
  | `src/research/COINs-KGGeneration/graph_completion/graphs/load_graph.py` | Research `Loader` class reused for dataset loading |
278
  | `src/research/COINs-KGGeneration/graph_data/load_data.py` | Dataset classes (FreeBase, WordNet, NELL) with file formats |
 
279
  | `src/research/COINs-KGGeneration/data/` | Actual data files (train/valid/test.txt, entity2id.txt, relation2id.txt) |
280
  | `src/research/MultiProxAn/checkpoints/` | MultiProxAn checkpoint dir (scanned for availability) |
281
  | `src/research/COINs-KGGeneration/graph_generation/checkpoints/` | DiGress KG checkpoint dir (scanned for availability) |
 
21
  urls.py
22
  views/
23
  __init__.py
24
+ health.py # ApiRootView + HealthView
25
  coins.py
26
  graph_generation.py
27
  kg_anomaly.py
28
  services/
29
  __init__.py
30
+ registry.py # Singleton: download + scan checkpoints, load datasets
31
+ constants.py # All static/hardcoded data (models, query structures, etc.)
32
+ pagination.py # Custom page/page_size pagination helper
33
+ exceptions.py # Standardized error envelope
 
34
  ```
35
 
36
  ## Key Design Decision: Reuse Research Loaders
 
48
  `registry.py` — loaded once via `AppConfig.ready()`:
49
 
50
  1. **Downloads checkpoints from Google Drive** if not already present locally
51
+ - Checks `_all_checkpoint_dirs_populated()` first — skips download if all dirs have files
52
+ - Uses `gdown.download_folder()` to a `.checkpoint_staging` dir, then `_distribute_checkpoints()` moves files to correct locations
53
+ - Gracefully handles failures (logs warning, continues with local files)
54
  2. **Loads all 3 COINs datasets** using the research `Loader`
55
  3. **Scans checkpoint directories** for `.tar` / `.ckpt` files to determine availability flags (does NOT load model weights)
56
+ - COINs: parses `{dataset}_{algorithm}.tar` filenames
57
+ - MultiProxAn: `{dataset}.ckpt` = discrete, `{dataset}_c.ckpt` = continuous
58
+ - DiGress KG: `{dataset}.ckpt` = generate, `{dataset}_correct.ckpt` = correct
59
  4. **Generates sample subgraphs** for KG anomaly using the existing research code:
60
  - `Loader.get_context_subgraph_dataset(max_graph_size, device)` — DFS-based partitioning of the full graph into sized subgraphs
61
  - Internally calls `Sampler.get_context_subgraph_samples_dfs()` in `graph_completion/graphs/preprocess.py` (line 742)
 
69
 
70
  Source: Google Drive folder `https://drive.google.com/drive/u/0/folders/14Bf8fi4KJn0rDdh9y8EFyA5b8OpQyXWi`
71
 
72
+ Google Drive folder structure (mapped to local dirs via `GDRIVE_SUBFOLDERS` config in `registry.py`):
73
+ - `COINs-KGGeneration/checkpoints_coins/*.tar` -> `{COINS_COMPLETION_DIR}/checkpoints/`
74
+ - `COINs-KGGeneration/checkpoints_kg_generation/*.ckpt` -> `{DIGRESS_KG_DIR}/checkpoints/`
75
+ - `MultiProxAn/checkpoints/*.ckpt` -> `{MULTIPROXAN_DIR}/checkpoints/`
76
+ - `COINs-KGGeneration/data/*` -> data files (also downloaded alongside checkpoints)
 
 
 
 
77
 
78
+ Download logic in `registry.py`:
79
+ 1. Check if all checkpoint dirs already have files — skip entire download if so
80
+ 2. Use `gdown.download_folder()` to staging dir with `resume=True`
81
+ 3. `_distribute_checkpoints()` moves files, comparing size to skip already-present files
82
+ 4. Log download progress; on failure, log warning and continue
83
 
84
  ### Checkpoint Scanning (after download)
85
 
 
87
  - MultiProxAn (graph generation): glob `{MULTIPROXAN_DIR}/checkpoints/{dataset}.ckpt` and `{dataset}_c.ckpt`
88
  - DiGress KG (anomaly correction): glob `{DIGRESS_KG_DIR}/checkpoints/{dataset}.ckpt` and `{dataset}_correct.ckpt`
89
 
90
+ ## Django Settings
91
 
92
+ - `DATABASES = {}` no database
93
+ - `INSTALLED_APPS`: `corsheaders`, `rest_framework`, `api`
94
+ - `MIDDLEWARE`: `CorsMiddleware` + `CommonMiddleware` only
95
+ - `CORS_ALLOWED_ORIGINS`: `["https://bani57.pythonanywhere.com"]` (+ `CORS_ALLOW_ALL_ORIGINS = True` in DEBUG)
96
+ - `REST_FRAMEWORK`: JSON renderer/parser only, no default pagination, custom exception handler (`api.exceptions.api_exception_handler`)
97
+ - `SECRET_KEY` from env var `DJANGO_SECRET_KEY` with dev fallback
98
+ - `DEBUG` from env var `DJANGO_DEBUG`
99
+ - `ALLOWED_HOSTS` from env var `DJANGO_ALLOWED_HOSTS`
100
+ - Custom research path settings (all as `pathlib.Path` resolved from `PROJECT_ROOT`):
101
+ - `COINS_DATA_DIR` -> `src/research/COINs-KGGeneration/data/`
102
+ - `COINS_COMPLETION_DIR` -> `src/research/COINs-KGGeneration/graph_completion/`
103
+ - `DIGRESS_KG_DIR` -> `src/research/COINs-KGGeneration/graph_generation/` (DiGress model applied to KG data)
104
+ - `MULTIPROXAN_DIR` -> `src/research/MultiProxAn/`
105
+ - Each has its own checkpoints subdirectory:
106
+ - COINs (link prediction): `{COINS_COMPLETION_DIR}/checkpoints/`
107
+ - DiGress KG (anomaly correction): `{DIGRESS_KG_DIR}/checkpoints/`
108
+ - MultiProxAn (graph generation): `{MULTIPROXAN_DIR}/checkpoints/`
109
+
110
+ ## Error Handling
111
+
112
+ Custom DRF exception handler (`api/exceptions.py`) wrapping all errors in:
113
+ ```json
114
+ {"error": {"code": "NOT_FOUND", "message": "...", "details": {}}}
115
+ ```
116
+
117
+ `ApiError` base class with `NotFoundError`, `InvalidRequestError` subclasses. Falls back to standard DRF handler for non-ApiError exceptions, still wrapping in the error envelope.
118
 
119
+ ## Endpoints (13 total)
120
 
121
  | Method | Path | View Class | Data Source |
122
  |--------|------|-----------|-------------|
123
+ | `GET` | `/` (api root) | `ApiRootView` | API name, version, description, endpoint directory |
124
  | `GET` | `/health` | `HealthView` | Registry availability flags |
125
  | `GET` | `/methods` | `MethodsView` | `METHODS` constant |
126
  | `GET` | `/coins/datasets` | `CoinsDatasetsView` | Registry + `COINS_DATASET_META` |
 
134
  | `GET` | `/kg-anomaly/datasets` | `KgAnomalyDatasetsView` | `KG_ANOMALY_DATASET_META` |
135
  | `GET` | `/kg-anomaly/datasets/{id}/sample-subgraphs` | `KgAnomalySampleSubgraphsView` | Registry pre-computed subgraphs |
136
 
137
+ ## Constants (hardcoded from API spec)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
+ `constants.py` contains all static data:
140
+ - `METHODS` — 3 research methods (id, name, thesis_section, description)
141
+ - `COINS_DATASET_META` — dataset id -> {name, description, data_dir}
142
+ - `COINS_MODELS` — 6 algorithms with supported_query_structures and available_datasets
143
+ - TransE, DistMult, ComplEx, RotatE, KBGAT: 1p only
144
+ - Q2B: all query structures (1p, 2p, 3p, 2i, 3i, ip, pi)
145
+ - `QUERY_STRUCTURES` — 7 query graph templates (nodes + edges arrays)
146
+ - `GRAPHGEN_DATASETS` — QM9 (molecular) + Community20 (synthetic)
147
+ - `GRAPHGEN_SAMPLING_MODES` — standard (T, chain_frames) + multiprox (T, m, t, t_prime)
148
+ - `KG_ANOMALY_DATASET_META` — freebase, wordnet, nell
149
 
150
  ## Dependencies
151
 
 
201
  cd src/analysis/orca && g++ -O2 -std=c++11 -o orca orca.cpp
202
  ```
203
 
204
+ ### Backend environment — `website_c` mamba env (Python 3.9)
205
+
206
  ### Backend requirements.txt
207
 
208
+ The backend unifies both into one Python 3.9 environment. Since the COINs `Loader` is the only part reused for init, we pin to the DiGress/MultiProxAn versions and ensure COINs code runs under them. For PythonAnywhere (CPU only), use CPU torch wheels.
209
 
210
  ```
211
  # Django
 
249
 
250
  ## Implementation Sequence
251
 
252
+ 1. Write `requirements.txt` and install dependencies (django, DRF, cors-headers, torch, igraph, scipy, pandas, numpy, gdown)
253
  2. Create Django project scaffolding (`manage.py`, `research_api/` settings/urls/wsgi)
254
  3. Configure settings — CORS, DRF, research paths (`COINS_DATA_DIR`, `COINS_COMPLETION_DIR`, `DIGRESS_KG_DIR`, `MULTIPROXAN_DIR`), add research dirs to `sys.path`
255
  4. Create `api` app skeleton (`apps.py`, `urls.py`)
256
+ 5. Write `exceptions.py` — custom error envelope formatting (`ApiError`, `NotFoundError`, `InvalidRequestError`)
257
+ 6. Write `registry.py` — `ModelRegistry` singleton: Google Drive checkpoint download (with skip-if-present), checkpoint scanning across all 3 dirs
258
+ 7. Write `constants.py` — all static data from API spec (methods, models, query structures, dataset metadata, sampling modes)
259
+ 8. Write `pagination.py` — `paginate_list()` helper
260
+ 9. Write `health.py``ApiRootView` (endpoint directory) + `HealthView` (status + models_loaded flags)
261
+ 10. Write remaining views (`coins.py`, `graph_generation.py`, `kg_anomaly.py`) — one class per endpoint, using registry + constants
262
+ 11. Wire URL routing (`research_api/urls.py` -> `api/urls.py`)
263
+ 12. Hook `ModelRegistry.initialize()` in `AppConfig.ready()`
264
+ 13. Add research `Loader` integration to registry for dataset loading + subgraph generation
265
 
266
  ## Verification
267
 
268
+ 1. `pip install -r requirements.txt` — all dependencies install cleanly into `website_c` env
269
+ 2. `python manage.py check` — Django system check passes, 0 issues
270
+ 3. `python manage.py runserver` — registry downloads checkpoints from Google Drive (or skips if present), scans all 3 checkpoint dirs, logs entity/relation counts
271
+ 4. `GET /api/v1/` — returns API name, version, description, full endpoint directory with absolute URLs
272
+ 5. `GET /api/v1/health` — returns `{"status": "ok", "models_loaded": {"coins": bool, "multiproxan": bool, "kg_anomaly": bool}}`
273
+ 6. `GET /api/v1/methods` — returns 3 methods
274
+ 7. `GET /api/v1/coins/datasets` — returns 3 datasets with correct entity/relation counts
275
+ 8. `GET /api/v1/coins/datasets/freebase/entities?q=location&page=1&page_size=10` — filtered, paginated results
276
+ 9. `GET /api/v1/coins/datasets/freebase/relations?q=country` — substring search on relation names
277
+ 10. `GET /api/v1/coins/datasets/freebase/sample-triples?count=5` — 5 random triples with resolved names
278
+ 11. `GET /api/v1/coins/datasets/unknown/entities` — `{"error": {"code": "NOT_FOUND", ...}}`
279
+ 12. `GET /api/v1/coins/models` — 6 algorithms, filtered by actually available checkpoints
280
+ 13. `GET /api/v1/coins/query-structures` — 7 templates (1p, 2p, 3p, 2i, 3i, ip, pi) matching API spec
281
+ 14. `GET /api/v1/graph-generation/datasets` — QM9 + Community20 with model type availability from MultiProxAn checkpoint scan
282
+ 15. `GET /api/v1/graph-generation/sampling-modes` — standard + multiprox with parameter specs
283
+ 16. `GET /api/v1/kg-anomaly/datasets` — 3 datasets with availability from DiGress KG checkpoint scan
284
+ 17. `GET /api/v1/kg-anomaly/datasets/freebase/sample-subgraphs?count=3` 3 pre-computed subgraphs with entity/relation names
285
+ 18. Import `docs/postman/collection.json` into Postman, set environment, run all GET requests against running server
286
 
287
  ## Critical Source Files
288
 
 
291
  | `docs/api.yaml` | Authoritative API contract for all response schemas |
292
  | `src/research/COINs-KGGeneration/graph_completion/graphs/load_graph.py` | Research `Loader` class reused for dataset loading |
293
  | `src/research/COINs-KGGeneration/graph_data/load_data.py` | Dataset classes (FreeBase, WordNet, NELL) with file formats |
294
+ | `src/research/COINs-KGGeneration/graph_completion/graphs/preprocess.py` | `Sampler.get_context_subgraph_samples_dfs()` for KG anomaly sample subgraphs |
295
  | `src/research/COINs-KGGeneration/data/` | Actual data files (train/valid/test.txt, entity2id.txt, relation2id.txt) |
296
  | `src/research/MultiProxAn/checkpoints/` | MultiProxAn checkpoint dir (scanned for availability) |
297
  | `src/research/COINs-KGGeneration/graph_generation/checkpoints/` | DiGress KG checkpoint dir (scanned for availability) |
.gitignore CHANGED
@@ -4,3 +4,4 @@ __pycache__
4
  src/research/COINs-KGGeneration/graph_completion/checkpoints/
5
  src/research/COINs-KGGeneration/graph_generation/checkpoints/
6
  src/research/MultiProxAn/checkpoints/
 
 
4
  src/research/COINs-KGGeneration/graph_completion/checkpoints/
5
  src/research/COINs-KGGeneration/graph_generation/checkpoints/
6
  src/research/MultiProxAn/checkpoints/
7
+ src/backend/.checkpoint_staging/
docs/postman/collection.json CHANGED
@@ -14,6 +14,19 @@
14
  {
15
  "name": "Health",
16
  "item": [
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  {
18
  "name": "GET /health",
19
  "request": {
 
14
  {
15
  "name": "Health",
16
  "item": [
17
+ {
18
+ "name": "GET / (API root)",
19
+ "request": {
20
+ "method": "GET",
21
+ "header": [],
22
+ "url": {
23
+ "raw": "{{base_url}}/",
24
+ "host": ["{{base_url}}"],
25
+ "path": [""]
26
+ },
27
+ "description": "API root. Returns name, version, description, and full endpoint directory."
28
+ }
29
+ },
30
  {
31
  "name": "GET /health",
32
  "request": {
src/backend/api/__init__.py ADDED
File without changes
src/backend/api/apps.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from django.apps import AppConfig
2
+
3
+
4
+ class ApiConfig(AppConfig):
5
+ name = "api"
6
+ default_auto_field = "django.db.models.BigAutoField"
7
+
8
+ def ready(self):
9
+ from api.services.registry import ModelRegistry
10
+
11
+ ModelRegistry.initialize()
src/backend/api/exceptions.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rest_framework.exceptions import APIException
2
+ from rest_framework.response import Response
3
+ from rest_framework.views import exception_handler
4
+
5
+
6
+ class ApiError(APIException):
7
+ def __init__(self, code, message, status_code=400, details=None):
8
+ self.code = code
9
+ self.detail = message
10
+ self.status_code = status_code
11
+ self.details = details or {}
12
+
13
+
14
+ class NotFoundError(ApiError):
15
+ def __init__(self, message, details=None):
16
+ super().__init__("NOT_FOUND", message, status_code=404, details=details)
17
+
18
+
19
+ class InvalidRequestError(ApiError):
20
+ def __init__(self, message, details=None):
21
+ super().__init__("INVALID_REQUEST", message, status_code=400, details=details)
22
+
23
+
24
+ def api_exception_handler(exc, context):
25
+ if isinstance(exc, ApiError):
26
+ return Response(
27
+ {"error": {"code": exc.code, "message": str(exc.detail), "details": exc.details}},
28
+ status=exc.status_code,
29
+ )
30
+ response = exception_handler(exc, context)
31
+ if response is not None:
32
+ code = "UNKNOWN_ERROR"
33
+ if response.status_code == 404:
34
+ code = "NOT_FOUND"
35
+ elif response.status_code == 400:
36
+ code = "INVALID_REQUEST"
37
+ response.data = {"error": {"code": code, "message": str(response.data), "details": {}}}
38
+ return response
src/backend/api/services/__init__.py ADDED
File without changes
src/backend/api/services/registry.py ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ import os
3
+ from pathlib import Path
4
+
5
+ from django.conf import settings
6
+
7
+ logger = logging.getLogger(__name__)
8
+
9
+ GDRIVE_FOLDER_ID = "14Bf8fi4KJn0rDdh9y8EFyA5b8OpQyXWi"
10
+
11
+ # Subfolder IDs within the Google Drive folder (from gdown listing)
12
+ GDRIVE_SUBFOLDERS = {
13
+ "coins": {
14
+ "folder_name": "COINs-KGGeneration/checkpoints_coins",
15
+ "local_dir_setting": "COINS_COMPLETION_DIR",
16
+ "local_subdir": "checkpoints",
17
+ },
18
+ "kg_generation": {
19
+ "folder_name": "COINs-KGGeneration/checkpoints_kg_generation",
20
+ "local_dir_setting": "DIGRESS_KG_DIR",
21
+ "local_subdir": "checkpoints",
22
+ },
23
+ "multiproxan": {
24
+ "folder_name": "MultiProxAn/checkpoints",
25
+ "local_dir_setting": "MULTIPROXAN_DIR",
26
+ "local_subdir": "checkpoints",
27
+ },
28
+ }
29
+
30
+
31
+ class ModelRegistry:
32
+ _instance = None
33
+
34
+ def __init__(self):
35
+ self.coins_checkpoints_available = {}
36
+ self.graphgen_checkpoints_available = {}
37
+ self.kg_anomaly_checkpoints_available = {}
38
+
39
+ @classmethod
40
+ def get(cls):
41
+ if cls._instance is None:
42
+ raise RuntimeError("ModelRegistry not initialized. Call initialize() first.")
43
+ return cls._instance
44
+
45
+ @classmethod
46
+ def initialize(cls):
47
+ if cls._instance is not None:
48
+ return
49
+ instance = cls()
50
+ instance._download_checkpoints()
51
+ instance._scan_checkpoints()
52
+ cls._instance = instance
53
+ logger.info(
54
+ "ModelRegistry initialized: coins=%s, multiproxan=%s, kg_anomaly=%s",
55
+ instance.is_coins_loaded(),
56
+ instance.is_graphgen_loaded(),
57
+ instance.is_kg_anomaly_loaded(),
58
+ )
59
+
60
+ def _download_checkpoints(self):
61
+ """Download checkpoints from Google Drive if not already present locally."""
62
+ if self._all_checkpoint_dirs_populated():
63
+ logger.info("All checkpoint directories already populated, skipping Google Drive download")
64
+ return
65
+
66
+ try:
67
+ import gdown
68
+ except ImportError:
69
+ logger.warning("gdown not installed, skipping checkpoint download")
70
+ return
71
+
72
+ gdrive_url = f"https://drive.google.com/drive/folders/{GDRIVE_FOLDER_ID}"
73
+ logger.info("Downloading checkpoints from Google Drive: %s", gdrive_url)
74
+
75
+ try:
76
+ staging_dir = Path(settings.BASE_DIR) / ".checkpoint_staging"
77
+ staging_dir.mkdir(exist_ok=True)
78
+
79
+ gdown.download_folder(
80
+ gdrive_url, output=str(staging_dir), quiet=False, resume=True,
81
+ )
82
+
83
+ self._distribute_checkpoints(staging_dir)
84
+ except Exception:
85
+ logger.exception("Failed to download checkpoints from Google Drive, continuing with local files")
86
+
87
+ def _all_checkpoint_dirs_populated(self):
88
+ """Check if all checkpoint dirs already have at least one checkpoint file."""
89
+ for config in GDRIVE_SUBFOLDERS.values():
90
+ dest_dir = Path(getattr(settings, config["local_dir_setting"])) / config["local_subdir"]
91
+ if not dest_dir.exists():
92
+ return False
93
+ ckpt_files = list(dest_dir.glob("*.tar")) + list(dest_dir.glob("*.ckpt"))
94
+ if not ckpt_files:
95
+ return False
96
+ return True
97
+
98
+ def _distribute_checkpoints(self, staging_dir):
99
+ """Move downloaded files from staging into the correct checkpoint directories."""
100
+ for group, config in GDRIVE_SUBFOLDERS.items():
101
+ src_dir = staging_dir / config["folder_name"]
102
+ if not src_dir.exists():
103
+ logger.warning("Expected staging subfolder not found: %s", src_dir)
104
+ continue
105
+
106
+ dest_dir = Path(getattr(settings, config["local_dir_setting"])) / config["local_subdir"]
107
+ dest_dir.mkdir(parents=True, exist_ok=True)
108
+
109
+ for src_file in src_dir.iterdir():
110
+ if not src_file.is_file():
111
+ continue
112
+ dest_file = dest_dir / src_file.name
113
+ if dest_file.exists() and dest_file.stat().st_size == src_file.stat().st_size:
114
+ logger.debug("Checkpoint already present, skipping: %s", dest_file.name)
115
+ continue
116
+ logger.info("Installing checkpoint: %s -> %s", src_file.name, dest_dir)
117
+ src_file.replace(dest_file)
118
+
119
+ def _scan_checkpoints(self):
120
+ self._scan_coins_checkpoints()
121
+ self._scan_graphgen_checkpoints()
122
+ self._scan_kg_anomaly_checkpoints()
123
+
124
+ def _scan_coins_checkpoints(self):
125
+ ckpt_dir = Path(settings.COINS_COMPLETION_DIR) / "checkpoints"
126
+ if not ckpt_dir.exists():
127
+ logger.warning("COINs checkpoint dir not found: %s", ckpt_dir)
128
+ return
129
+ for path in ckpt_dir.glob("*.tar"):
130
+ parts = path.stem.rsplit("_", 1)
131
+ if len(parts) == 2:
132
+ dataset_id, algorithm = parts
133
+ self.coins_checkpoints_available.setdefault(dataset_id, []).append(algorithm)
134
+ logger.info("COINs checkpoints: %s", self.coins_checkpoints_available)
135
+
136
+ def _scan_graphgen_checkpoints(self):
137
+ ckpt_dir = Path(settings.MULTIPROXAN_DIR) / "checkpoints"
138
+ if not ckpt_dir.exists():
139
+ logger.warning("MultiProxAn checkpoint dir not found: %s", ckpt_dir)
140
+ return
141
+ for path in ckpt_dir.glob("*.ckpt"):
142
+ name = path.stem
143
+ if name.endswith("_c"):
144
+ dataset_id = name[:-2]
145
+ self.graphgen_checkpoints_available.setdefault(dataset_id, []).append("continuous")
146
+ else:
147
+ self.graphgen_checkpoints_available.setdefault(name, []).append("discrete")
148
+ logger.info("MultiProxAn checkpoints: %s", self.graphgen_checkpoints_available)
149
+
150
+ def _scan_kg_anomaly_checkpoints(self):
151
+ ckpt_dir = Path(settings.DIGRESS_KG_DIR) / "checkpoints"
152
+ if not ckpt_dir.exists():
153
+ logger.warning("DiGress KG checkpoint dir not found: %s", ckpt_dir)
154
+ return
155
+ for path in ckpt_dir.glob("*.ckpt"):
156
+ name = path.stem
157
+ if name.endswith("_correct"):
158
+ dataset_id = name[:-8]
159
+ self.kg_anomaly_checkpoints_available.setdefault(dataset_id, []).append("correct")
160
+ else:
161
+ self.kg_anomaly_checkpoints_available.setdefault(name, []).append("generate")
162
+ logger.info("DiGress KG checkpoints: %s", self.kg_anomaly_checkpoints_available)
163
+
164
+ def is_coins_loaded(self):
165
+ return bool(self.coins_checkpoints_available)
166
+
167
+ def is_graphgen_loaded(self):
168
+ return bool(self.graphgen_checkpoints_available)
169
+
170
+ def is_kg_anomaly_loaded(self):
171
+ return bool(self.kg_anomaly_checkpoints_available)
src/backend/api/urls.py ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ from django.urls import path
2
+
3
+ from api.views.health import ApiRootView, HealthView
4
+
5
+ urlpatterns = [
6
+ path("", ApiRootView.as_view()),
7
+ path("health", HealthView.as_view()),
8
+ ]
src/backend/api/views/__init__.py ADDED
File without changes
src/backend/api/views/health.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rest_framework.response import Response
2
+ from rest_framework.reverse import reverse
3
+ from rest_framework.views import APIView
4
+
5
+ from api.services.registry import ModelRegistry
6
+
7
+
8
+ class ApiRootView(APIView):
9
+ def get(self, request):
10
+ return Response({
11
+ "name": "Scalable Graph ML Research API",
12
+ "version": "1.0.0",
13
+ "description": (
14
+ "REST API for testing PhD research methods from "
15
+ "\"Scalable Methods for Knowledge Graph Reasoning and Generation\" "
16
+ "(Andrej Janchevski, EPFL, 2025)"
17
+ ),
18
+ "endpoints": {
19
+ "health": request.build_absolute_uri("health"),
20
+ "methods": request.build_absolute_uri("methods"),
21
+ "coins": {
22
+ "datasets": request.build_absolute_uri("coins/datasets"),
23
+ "models": request.build_absolute_uri("coins/models"),
24
+ "query_structures": request.build_absolute_uri("coins/query-structures"),
25
+ "predict": request.build_absolute_uri("coins/predict"),
26
+ },
27
+ "graph_generation": {
28
+ "datasets": request.build_absolute_uri("graph-generation/datasets"),
29
+ "sampling_modes": request.build_absolute_uri("graph-generation/sampling-modes"),
30
+ "generate": request.build_absolute_uri("graph-generation/generate"),
31
+ "continue": request.build_absolute_uri("graph-generation/continue"),
32
+ },
33
+ "kg_anomaly": {
34
+ "datasets": request.build_absolute_uri("kg-anomaly/datasets"),
35
+ "correct": request.build_absolute_uri("kg-anomaly/correct"),
36
+ "continue": request.build_absolute_uri("kg-anomaly/continue"),
37
+ },
38
+ },
39
+ })
40
+
41
+
42
+ class HealthView(APIView):
43
+ def get(self, request):
44
+ registry = ModelRegistry.get()
45
+ return Response({
46
+ "status": "ok",
47
+ "models_loaded": {
48
+ "coins": registry.is_coins_loaded(),
49
+ "multiproxan": registry.is_graphgen_loaded(),
50
+ "kg_anomaly": registry.is_kg_anomaly_loaded(),
51
+ },
52
+ })
src/backend/manage.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """Django's command-line utility for administrative tasks."""
3
+ import os
4
+ import sys
5
+
6
+
7
+ def main():
8
+ os.environ.setdefault("DJANGO_SETTINGS_MODULE", "research_api.settings")
9
+ try:
10
+ from django.core.management import execute_from_command_line
11
+ except ImportError as exc:
12
+ raise ImportError(
13
+ "Couldn't import Django. Are you sure it's installed and "
14
+ "available on your PYTHONPATH environment variable? Did you "
15
+ "forget to activate a virtual environment?"
16
+ ) from exc
17
+ execute_from_command_line(sys.argv)
18
+
19
+
20
+ if __name__ == "__main__":
21
+ main()
src/backend/research_api/__init__.py ADDED
File without changes
src/backend/research_api/settings.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from pathlib import Path
3
+
4
+ BASE_DIR = Path(__file__).resolve().parent.parent
5
+ PROJECT_ROOT = BASE_DIR.parent.parent # Website root
6
+
7
+ SECRET_KEY = os.environ.get("DJANGO_SECRET_KEY", "dev-insecure-key-change-in-production")
8
+ DEBUG = os.environ.get("DJANGO_DEBUG", "True").lower() in ("true", "1")
9
+ ALLOWED_HOSTS = os.environ.get("DJANGO_ALLOWED_HOSTS", "localhost,127.0.0.1").split(",")
10
+
11
+ INSTALLED_APPS = [
12
+ "corsheaders",
13
+ "rest_framework",
14
+ "api",
15
+ ]
16
+
17
+ MIDDLEWARE = [
18
+ "corsheaders.middleware.CorsMiddleware",
19
+ "django.middleware.common.CommonMiddleware",
20
+ ]
21
+
22
+ ROOT_URLCONF = "research_api.urls"
23
+ WSGI_APPLICATION = "research_api.wsgi.application"
24
+
25
+ DATABASES = {}
26
+
27
+ REST_FRAMEWORK = {
28
+ "DEFAULT_RENDERER_CLASSES": ["rest_framework.renderers.JSONRenderer"],
29
+ "DEFAULT_PARSER_CLASSES": ["rest_framework.parsers.JSONParser"],
30
+ "DEFAULT_PAGINATION_CLASS": None,
31
+ "DEFAULT_AUTHENTICATION_CLASSES": [],
32
+ "DEFAULT_PERMISSION_CLASSES": ["rest_framework.permissions.AllowAny"],
33
+ "EXCEPTION_HANDLER": "api.exceptions.api_exception_handler",
34
+ "UNAUTHENTICATED_USER": None,
35
+ }
36
+
37
+ CORS_ALLOWED_ORIGINS = [
38
+ "https://bani57.pythonanywhere.com",
39
+ ]
40
+ if DEBUG:
41
+ CORS_ALLOW_ALL_ORIGINS = True
42
+
43
+ # Research code paths
44
+ COINS_DATA_DIR = PROJECT_ROOT / "src" / "research" / "COINs-KGGeneration" / "data"
45
+ COINS_COMPLETION_DIR = PROJECT_ROOT / "src" / "research" / "COINs-KGGeneration" / "graph_completion"
46
+ DIGRESS_KG_DIR = PROJECT_ROOT / "src" / "research" / "COINs-KGGeneration" / "graph_generation"
47
+ MULTIPROXAN_DIR = PROJECT_ROOT / "src" / "research" / "MultiProxAn"
48
+
49
+ DEFAULT_AUTO_FIELD = "django.db.models.BigAutoField"
src/backend/research_api/urls.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ from django.urls import include, path
2
+
3
+ urlpatterns = [
4
+ path("api/v1/", include("api.urls")),
5
+ ]
src/backend/research_api/wsgi.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import os
2
+
3
+ from django.core.wsgi import get_wsgi_application
4
+
5
+ os.environ.setdefault("DJANGO_SETTINGS_MODULE", "research_api.settings")
6
+ application = get_wsgi_application()