Spaces:

Bani57
/

website

Running

App Files Files Community

Andrej Janchevski commited on Mar 23

Commit

4f1e196

1 Parent(s): 0b4eacc

Implement backend discovery endpoints

Browse files

Files changed (18) hide show

.claude/plans/backend_init.md +58 -22
CLAUDE.md +6 -0
docs/postman/collection.json +16 -16
src/backend/README.md +127 -0
src/backend/api/apps.py +10 -0
src/backend/api/pagination.py +8 -0
src/backend/api/services/constants.py +297 -0
src/backend/api/services/registry.py +304 -2
src/backend/api/urls.py +26 -1
src/backend/api/views/coins.py +106 -0
src/backend/api/views/graph_generation.py +29 -0
src/backend/api/views/health.py +6 -1
src/backend/api/views/kg_anomaly.py +42 -0
src/backend/requirements.txt +36 -0
src/backend/research_api/settings.py +8 -0
src/research/COINs-KGGeneration/graph_completion/graphs/load_graph.py +5 -34
src/research/COINs-KGGeneration/graph_completion/graphs/preprocess.py +13 -1
src/research/COINs-KGGeneration/graph_data/load_data.py +3 -112

.claude/plans/backend_init.md CHANGED Viewed

@@ -10,6 +10,7 @@ Build the Django backend foundation: project scaffolding, dataset/entity/relatio
 src/backend/
   manage.py
   requirements.txt
   research_api/
     __init__.py
     settings.py
@@ -51,17 +52,38 @@ The research `Loader` needs: `igraph`, `pandas`, `numpy`, `torch` (for device ha
    - Checks `_all_checkpoint_dirs_populated()` first — skips download if all dirs have files
    - Uses `gdown.download_folder()` to a `.checkpoint_staging` dir, then `_distribute_checkpoints()` moves files to correct locations
    - Gracefully handles failures (logs warning, continues with local files)
-2. **Loads all 3 COINs datasets** using the research `Loader`
-3. **Scans checkpoint directories** for `.tar` / `.ckpt` files to determine availability flags (does NOT load model weights)
    - COINs: parses `{dataset}_{algorithm}.tar` filenames
    - MultiProxAn: `{dataset}.ckpt` = discrete, `{dataset}_c.ckpt` = continuous
    - DiGress KG: `{dataset}.ckpt` = generate, `{dataset}_correct.ckpt` = correct
-4. **Generates sample subgraphs** for KG anomaly using the existing research code:
-   - `Loader.get_context_subgraph_dataset(max_graph_size, device)` — DFS-based partitioning of the full graph into sized subgraphs
-   - Internally calls `Sampler.get_context_subgraph_samples_dfs()` in `graph_completion/graphs/preprocess.py` (line 742)
-   - Returns `List[ContextSubgraphData]` (torch_geometric Data objects with `x`, `x_index`, `edge_index`, `edge_attr`, `subgraph_row`, `subgraph_col`)
-   - `context_subgraphs_to_tensors()` (preprocess.py line 809) converts raw DFS samples to tensors
-   - For the API sample-subgraphs endpoint: call this once per dataset at startup, store a subset, convert to the API response format (entity names + relation names resolved from the Loader's name maps)
 ### Checkpoint Download (on each boot)
@@ -87,6 +109,15 @@ Download logic in `registry.py`:
 - MultiProxAn (graph generation): glob `{MULTIPROXAN_DIR}/checkpoints/{dataset}.ckpt` and `{dataset}_c.ckpt`
 - DiGress KG (anomaly correction): glob `{DIGRESS_KG_DIR}/checkpoints/{dataset}.ckpt` and `{dataset}_correct.ckpt`
 ## Django Settings
 - `DATABASES = {}` — no database
@@ -205,7 +236,7 @@ cd src/analysis/orca && g++ -O2 -std=c++11 -o orca orca.cpp
 ### Backend requirements.txt
-The backend unifies both into one Python 3.9 environment. Since the COINs `Loader` is the only part reused for init, we pin to the DiGress/MultiProxAn versions and ensure COINs code runs under them. For PythonAnywhere (CPU only), use CPU torch wheels.
 ```
 # Django
@@ -216,11 +247,11 @@ django-cors-headers==4.*
 # Checkpoint download from Google Drive
 gdown>=4.7
-# PyTorch (CPU only for PythonAnywhere)
--f https://download.pytorch.org/whl/cpu/torch_stable.html
-torch==2.0.1+cpu
-torchvision==0.15.2+cpu
-torchaudio==2.0.2+cpu
 # PyTorch Geometric (must match torch version)
 torch-geometric==2.3.1
@@ -239,6 +270,7 @@ matplotlib==3.7.1
 imageio==2.31.1
 torchmetrics==0.11.4
 tqdm==4.65.0
 # Conda-only deps (must be pre-installed, not in pip requirements):
 # rdkit==2023.03.2 (conda create -c conda-forge)
@@ -263,25 +295,29 @@ tqdm==4.65.0
 12. Hook `ModelRegistry.initialize()` in `AppConfig.ready()`
 13. Add research `Loader` integration to registry for dataset loading + subgraph generation
 ## Verification
 1. `pip install -r requirements.txt` — all dependencies install cleanly into `website_c` env
 2. `python manage.py check` — Django system check passes, 0 issues
-3. `python manage.py runserver` — registry downloads checkpoints from Google Drive (or skips if present), scans all 3 checkpoint dirs, logs entity/relation counts
-4. `GET /api/v1/` — returns API name, version, description, full endpoint directory with absolute URLs
 5. `GET /api/v1/health` — returns `{"status": "ok", "models_loaded": {"coins": bool, "multiproxan": bool, "kg_anomaly": bool}}`
 6. `GET /api/v1/methods` — returns 3 methods
 7. `GET /api/v1/coins/datasets` — returns 3 datasets with correct entity/relation counts
-8. `GET /api/v1/coins/datasets/freebase/entities?q=location&page=1&page_size=10` — filtered, paginated results
-9. `GET /api/v1/coins/datasets/freebase/relations?q=country` — substring search on relation names
-10. `GET /api/v1/coins/datasets/freebase/sample-triples?count=5` — 5 random triples with resolved names
 11. `GET /api/v1/coins/datasets/unknown/entities` — `{"error": {"code": "NOT_FOUND", ...}}`
 12. `GET /api/v1/coins/models` — 6 algorithms, filtered by actually available checkpoints
 13. `GET /api/v1/coins/query-structures` — 7 templates (1p, 2p, 3p, 2i, 3i, ip, pi) matching API spec
-14. `GET /api/v1/graph-generation/datasets` — QM9 + Community20 with model type availability from MultiProxAn checkpoint scan
 15. `GET /api/v1/graph-generation/sampling-modes` — standard + multiprox with parameter specs
-16. `GET /api/v1/kg-anomaly/datasets` — 3 datasets with availability from DiGress KG checkpoint scan
-17. `GET /api/v1/kg-anomaly/datasets/freebase/sample-subgraphs?count=3` — 3 pre-computed subgraphs with entity/relation names
 18. Import `docs/postman/collection.json` into Postman, set environment, run all GET requests against running server
 ## Critical Source Files

 src/backend/
   manage.py
   requirements.txt
+  README.md
   research_api/
     __init__.py
     settings.py
    - Checks `_all_checkpoint_dirs_populated()` first — skips download if all dirs have files
    - Uses `gdown.download_folder()` to a `.checkpoint_staging` dir, then `_distribute_checkpoints()` moves files to correct locations
    - Gracefully handles failures (logs warning, continues with local files)
+2. **Scans checkpoint directories** for `.tar` / `.ckpt` files to determine availability flags (does NOT load model weights)
    - COINs: parses `{dataset}_{algorithm}.tar` filenames
    - MultiProxAn: `{dataset}.ckpt` = discrete, `{dataset}_c.ckpt` = continuous
    - DiGress KG: `{dataset}.ckpt` = generate, `{dataset}_correct.ckpt` = correct
+3. **Loads one lightweight COINs Loader per dataset** using vanilla seeds, then frees heavy arrays
+4. **Generates sample subgraphs** for KG anomaly using the Loader's DFS-based context subgraph partitioning, capped via `max_samples` to avoid O(subgraphs²) memory
+### Memory Management
+Each Loader's `load_graph()` produces heavy arrays not needed for discovery endpoints:
+- `node_neighbours`: shape `(num_nodes, num_relations, num_neighbours)` — ~275MB for freebase alone
+- `com_neighbours`, `node_adjacency`, `com_adjacency`: large dicts/arrays
+- `graph` (igraph object), degree arrays, importance arrays, machine assignments
+After init, `_free_heavy_arrays()` sets all these to `None`. Only dataset name maps, train edge data, graph indexes, and community metadata are kept.
+The DFS subgraph partitioning (`get_context_subgraph_samples_dfs`) allocates an O(subgraphs²) edge matrix. A `max_samples` parameter caps partitioning to only process enough nodes to produce the needed samples. The sampler's `context_subgraphs_nodes`/`context_subgraphs_edges` are also freed after sample generation.
+Django's dev server auto-reloader spawns two processes both calling `ready()`. The outer (file watcher) process is skipped via `RUN_MAIN` env var check in `apps.py` to avoid double memory usage.
+### Per-Algorithm Checkpoint Configs
+Different algorithms were trained with different random seeds, producing different Leiden community structures:
+- **Vanilla group** (transe/distmult/complex/rotate): shared seed per dataset
+- **Q2B, KBGAT, nell_complex**: individual seeds from experiment runs
+Community structure groups:
+- freebase: transe/distmult/complex/rotate (1092 com) | q2b (1030) | kbgat (1025)
+- wordnet: transe/distmult/complex/rotate (66 com) | q2b (74) | kbgat (88)
+- nell: transe/distmult/rotate (282 com) | complex (188) | q2b (282, diff sizes) | kbgat (275)
+`_CHECKPOINT_SEEDS` maps `(dataset, algorithm)` → training seed. `get_checkpoint_config(dataset_id, algorithm)` returns the full config. At startup only one Loader per dataset (vanilla seed) loads for discovery. Per-algorithm Loaders for inference load on demand.
 ### Checkpoint Download (on each boot)
 - MultiProxAn (graph generation): glob `{MULTIPROXAN_DIR}/checkpoints/{dataset}.ckpt` and `{dataset}_c.ckpt`
 - DiGress KG (anomaly correction): glob `{DIGRESS_KG_DIR}/checkpoints/{dataset}.ckpt` and `{dataset}_correct.ckpt`
+### Sample Subgraph Generation
+For KG anomaly `sample-subgraphs` endpoint:
+- `Sampler.get_context_subgraph_samples_dfs()` in `graph_completion/graphs/preprocess.py` — DFS-based partitioning of the graph into sized subgraphs
+- Returns `List[ContextSubgraph]` tuples of `(subgraph_row, subgraph_col, nodes_row, nodes_col, edges)`
+- `max_samples` parameter caps how many subgraphs to partition (avoids full-graph traversal)
+- Called once per dataset at startup, stored as `SubgraphInfo`, converted to API response format with entity/relation names resolved from the Loader's name maps
+- Sampler's internal partitioning data freed after generation
 ## Django Settings
 - `DATABASES = {}` — no database
 ### Backend requirements.txt
+The backend unifies both into one Python 3.9 environment. Since the COINs `Loader` is the only part reused for init, we pin to the DiGress/MultiProxAn versions and ensure COINs code runs under them. PyTorch CUDA 11.8 for local dev, falls back to CPU at runtime on PythonAnywhere.
 ```
 # Django
 # Checkpoint download from Google Drive
 gdown>=4.7
+# PyTorch with CUDA 11.8 (falls back to CPU at runtime if no GPU present)
+--extra-index-url https://download.pytorch.org/whl/cu118
+torch==2.0.1+cu118
+torchvision==0.15.2+cu118
+torchaudio==2.0.2+cu118
 # PyTorch Geometric (must match torch version)
 torch-geometric==2.3.1
 imageio==2.31.1
 torchmetrics==0.11.4
 tqdm==4.65.0
+scikit-learn>=1.0
 # Conda-only deps (must be pre-installed, not in pip requirements):
 # rdkit==2023.03.2 (conda create -c conda-forge)
 12. Hook `ModelRegistry.initialize()` in `AppConfig.ready()`
 13. Add research `Loader` integration to registry for dataset loading + subgraph generation
+## Known Issues
+- **Leiden non-reproducibility**: igraph 0.9.11's `community_leiden()` has no seed parameter. Re-running produces different community structures than the original training runs. For inference, the original cached `subgraphing.gz` files from the training server may be needed, or a matching igraph version that reproduces the results.
 ## Verification
 1. `pip install -r requirements.txt` — all dependencies install cleanly into `website_c` env
 2. `python manage.py check` — Django system check passes, 0 issues
+3. `python manage.py runserver` — registry downloads checkpoints (or skips), scans dirs, logs entity/relation counts
+4. `GET /api/v1/` — returns API name, version, description, full endpoint directory
 5. `GET /api/v1/health` — returns `{"status": "ok", "models_loaded": {"coins": bool, "multiproxan": bool, "kg_anomaly": bool}}`
 6. `GET /api/v1/methods` — returns 3 methods
 7. `GET /api/v1/coins/datasets` — returns 3 datasets with correct entity/relation counts
+8. `GET /api/v1/coins/datasets/wordnet/entities?q=dog&page=1&page_size=10` — filtered, paginated results
+9. `GET /api/v1/coins/datasets/wordnet/relations?q=hyper` — substring search on relation names
+10. `GET /api/v1/coins/datasets/wordnet/sample-triples?count=5` — 5 random triples with resolved names
 11. `GET /api/v1/coins/datasets/unknown/entities` — `{"error": {"code": "NOT_FOUND", ...}}`
 12. `GET /api/v1/coins/models` — 6 algorithms, filtered by actually available checkpoints
 13. `GET /api/v1/coins/query-structures` — 7 templates (1p, 2p, 3p, 2i, 3i, ip, pi) matching API spec
+14. `GET /api/v1/graph-generation/datasets` — QM9 + Community20 with model type availability
 15. `GET /api/v1/graph-generation/sampling-modes` — standard + multiprox with parameter specs
+16. `GET /api/v1/kg-anomaly/datasets` — 3 datasets with availability from checkpoint scan
+17. `GET /api/v1/kg-anomaly/datasets/wordnet/sample-subgraphs?count=3` — 3 pre-computed subgraphs with entity/relation names
 18. Import `docs/postman/collection.json` into Postman, set environment, run all GET requests against running server
 ## Critical Source Files

CLAUDE.md CHANGED Viewed

@@ -35,3 +35,9 @@ PythonAnywhere and hosted by them. API requests only from the front-end are allo
   system boot
 - `src/backend`: Django backend and endpoint files location
 - `src/frontend`: Vue.js and Semantic UI frontend files location

   system boot
 - `src/backend`: Django backend and endpoint files location
 - `src/frontend`: Vue.js and Semantic UI frontend files location
+## Backend Documentation
+When making any backend change (new endpoint, config change, dependency, startup behavior), update `src/backend/README.md`
+to reflect the change. Keep the endpoint table, project structure, and startup sequence sections current.

docs/postman/collection.json CHANGED Viewed

@@ -77,13 +77,13 @@
             "method": "GET",
             "header": [],
             "url": {
-              "raw": "{{base_url}}/coins/datasets/freebase/entities?page=1&page_size=10&q=location",
               "host": ["{{base_url}}"],
-              "path": ["coins", "datasets", "freebase", "entities"],
               "query": [
                 { "key": "page", "value": "1" },
                 { "key": "page_size", "value": "10" },
-                { "key": "q", "value": "location", "description": "Substring search filter" }
               ]
             },
             "description": "Paginated, searchable entity list."
@@ -95,13 +95,13 @@
             "method": "GET",
             "header": [],
             "url": {
-              "raw": "{{base_url}}/coins/datasets/freebase/relations?page=1&page_size=10&q=country",
               "host": ["{{base_url}}"],
-              "path": ["coins", "datasets", "freebase", "relations"],
               "query": [
                 { "key": "page", "value": "1" },
                 { "key": "page_size", "value": "10" },
-                { "key": "q", "value": "country", "description": "Substring search filter" }
               ]
             },
             "description": "Paginated, searchable relation list."
@@ -113,9 +113,9 @@
             "method": "GET",
             "header": [],
             "url": {
-              "raw": "{{base_url}}/coins/datasets/freebase/sample-triples?count=5",
               "host": ["{{base_url}}"],
-              "path": ["coins", "datasets", "freebase", "sample-triples"],
               "query": [
                 { "key": "count", "value": "5" }
               ]
@@ -163,7 +163,7 @@
             ],
             "body": {
               "mode": "raw",
-              "raw": "{\n  \"dataset_id\": \"freebase\",\n  \"algorithm\": \"rotate\",\n  \"query_structure\": \"1p\",\n  \"anchors\": {\"a\": 123},\n  \"relations\": {\"r1\": 45},\n  \"top_k\": 10\n}"
             },
             "url": {
               "raw": "{{base_url}}/coins/predict",
@@ -182,7 +182,7 @@
             ],
             "body": {
               "mode": "raw",
-              "raw": "{\n  \"dataset_id\": \"freebase\",\n  \"algorithm\": \"q2b\",\n  \"query_structure\": \"2i\",\n  \"anchors\": {\"a1\": 123, \"a2\": 456},\n  \"relations\": {\"r1\": 45, \"r2\": 78},\n  \"top_k\": 10\n}"
             },
             "url": {
               "raw": "{{base_url}}/coins/predict",
@@ -201,7 +201,7 @@
             ],
             "body": {
               "mode": "raw",
-              "raw": "{\n  \"dataset_id\": \"freebase\",\n  \"algorithm\": \"q2b\",\n  \"query_structure\": \"ip\",\n  \"anchors\": {\"a1\": 123, \"a2\": 456},\n  \"relations\": {\"r1\": 45, \"r2\": 78, \"r3\": 12},\n  \"top_k\": 10\n}"
             },
             "url": {
               "raw": "{{base_url}}/coins/predict",
@@ -328,9 +328,9 @@
             "method": "GET",
             "header": [],
             "url": {
-              "raw": "{{base_url}}/kg-anomaly/datasets/freebase/sample-subgraphs?count=3",
               "host": ["{{base_url}}"],
-              "path": ["kg-anomaly", "datasets", "freebase", "sample-subgraphs"],
               "query": [
                 { "key": "count", "value": "3" }
               ]
@@ -352,7 +352,7 @@
             ],
             "body": {
               "mode": "raw",
-              "raw": "{\n  \"dataset_id\": \"freebase\",\n  \"sampling_mode\": \"standard\",\n  \"task\": \"correct\",\n  \"subgraph\": {\n    \"nodes\": [\n      {\"entity_id\": 123, \"type_id\": 2},\n      {\"entity_id\": 456, \"type_id\": 1},\n      {\"entity_id\": 789, \"type_id\": 0}\n    ],\n    \"edges\": [\n      {\"source_idx\": 0, \"target_idx\": 1, \"relation_id\": 45},\n      {\"source_idx\": 1, \"target_idx\": 2, \"relation_id\": 12}\n    ]\n  },\n  \"diffusion_steps\": 500,\n  \"chain_frames\": 20\n}"
             },
             "url": {
               "raw": "{{base_url}}/kg-anomaly/correct",
@@ -371,7 +371,7 @@
             ],
             "body": {
               "mode": "raw",
-              "raw": "{\n  \"dataset_id\": \"freebase\",\n  \"sampling_mode\": \"standard\",\n  \"task\": \"generate\",\n  \"subgraph\": {\n    \"nodes\": [\n      {\"entity_id\": 123, \"type_id\": 2},\n      {\"entity_id\": 456, \"type_id\": 1}\n    ],\n    \"edges\": [\n      {\"source_idx\": 0, \"target_idx\": 1, \"relation_id\": 45}\n    ]\n  },\n  \"diffusion_steps\": 500,\n  \"chain_frames\": 20\n}"
             },
             "url": {
               "raw": "{{base_url}}/kg-anomaly/correct",
@@ -390,7 +390,7 @@
             ],
             "body": {
               "mode": "raw",
-              "raw": "{\n  \"dataset_id\": \"freebase\",\n  \"sampling_mode\": \"multiprox\",\n  \"task\": \"correct\",\n  \"subgraph\": {\n    \"nodes\": [\n      {\"entity_id\": 123, \"type_id\": 2},\n      {\"entity_id\": 456, \"type_id\": 1}\n    ],\n    \"edges\": [\n      {\"source_idx\": 0, \"target_idx\": 1, \"relation_id\": 45}\n    ]\n  },\n  \"diffusion_steps\": 500,\n  \"multiprox_params\": {\n    \"m\": 10,\n    \"t\": 0.5,\n    \"t_prime\": 0.1\n  }\n}"
             },
             "url": {
               "raw": "{{base_url}}/kg-anomaly/correct",

             "method": "GET",
             "header": [],
             "url": {
+              "raw": "{{base_url}}/coins/datasets/wordnet/entities?page=1&page_size=10&q=dog",
               "host": ["{{base_url}}"],
+              "path": ["coins", "datasets", "wordnet", "entities"],
               "query": [
                 { "key": "page", "value": "1" },
                 { "key": "page_size", "value": "10" },
+                { "key": "q", "value": "dog", "description": "Substring search filter" }
               ]
             },
             "description": "Paginated, searchable entity list."
             "method": "GET",
             "header": [],
             "url": {
+              "raw": "{{base_url}}/coins/datasets/wordnet/relations?page=1&page_size=10&q=hyper",
               "host": ["{{base_url}}"],
+              "path": ["coins", "datasets", "wordnet", "relations"],
               "query": [
                 { "key": "page", "value": "1" },
                 { "key": "page_size", "value": "10" },
+                { "key": "q", "value": "hyper", "description": "Substring search filter" }
               ]
             },
             "description": "Paginated, searchable relation list."
             "method": "GET",
             "header": [],
             "url": {
+              "raw": "{{base_url}}/coins/datasets/wordnet/sample-triples?count=5",
               "host": ["{{base_url}}"],
+              "path": ["coins", "datasets", "wordnet", "sample-triples"],
               "query": [
                 { "key": "count", "value": "5" }
               ]
             ],
             "body": {
               "mode": "raw",
+              "raw": "{\n  \"dataset_id\": \"wordnet\",\n  \"algorithm\": \"rotate\",\n  \"query_structure\": \"1p\",\n  \"anchors\": {\"a\": 11754},\n  \"relations\": {\"r1\": 3},\n  \"top_k\": 10\n}"
             },
             "url": {
               "raw": "{{base_url}}/coins/predict",
             ],
             "body": {
               "mode": "raw",
+              "raw": "{\n  \"dataset_id\": \"wordnet\",\n  \"algorithm\": \"q2b\",\n  \"query_structure\": \"2i\",\n  \"anchors\": {\"a1\": 11754, \"a2\": 5142},\n  \"relations\": {\"r1\": 3, \"r2\": 1},\n  \"top_k\": 10\n}"
             },
             "url": {
               "raw": "{{base_url}}/coins/predict",
             ],
             "body": {
               "mode": "raw",
+              "raw": "{\n  \"dataset_id\": \"wordnet\",\n  \"algorithm\": \"q2b\",\n  \"query_structure\": \"ip\",\n  \"anchors\": {\"a1\": 11754, \"a2\": 5142},\n  \"relations\": {\"r1\": 3, \"r2\": 1, \"r3\": 2},\n  \"top_k\": 10\n}"
             },
             "url": {
               "raw": "{{base_url}}/coins/predict",
             "method": "GET",
             "header": [],
             "url": {
+              "raw": "{{base_url}}/kg-anomaly/datasets/wordnet/sample-subgraphs?count=3",
               "host": ["{{base_url}}"],
+              "path": ["kg-anomaly", "datasets", "wordnet", "sample-subgraphs"],
               "query": [
                 { "key": "count", "value": "3" }
               ]
             ],
             "body": {
               "mode": "raw",
+              "raw": "{\n  \"dataset_id\": \"wordnet\",\n  \"sampling_mode\": \"standard\",\n  \"task\": \"correct\",\n  \"subgraph\": {\n    \"nodes\": [\n      {\"entity_id\": 11754, \"type_id\": 3},\n      {\"entity_id\": 5142, \"type_id\": 3},\n      {\"entity_id\": 8142, \"type_id\": 3}\n    ],\n    \"edges\": [\n      {\"source_idx\": 0, \"target_idx\": 1, \"relation_id\": 3},\n      {\"source_idx\": 1, \"target_idx\": 2, \"relation_id\": 1}\n    ]\n  },\n  \"diffusion_steps\": 500,\n  \"chain_frames\": 20\n}"
             },
             "url": {
               "raw": "{{base_url}}/kg-anomaly/correct",
             ],
             "body": {
               "mode": "raw",
+              "raw": "{\n  \"dataset_id\": \"wordnet\",\n  \"sampling_mode\": \"standard\",\n  \"task\": \"generate\",\n  \"subgraph\": {\n    \"nodes\": [\n      {\"entity_id\": 11754, \"type_id\": 3},\n      {\"entity_id\": 5142, \"type_id\": 3}\n    ],\n    \"edges\": [\n      {\"source_idx\": 0, \"target_idx\": 1, \"relation_id\": 3}\n    ]\n  },\n  \"diffusion_steps\": 500,\n  \"chain_frames\": 20\n}"
             },
             "url": {
               "raw": "{{base_url}}/kg-anomaly/correct",
             ],
             "body": {
               "mode": "raw",
+              "raw": "{\n  \"dataset_id\": \"wordnet\",\n  \"sampling_mode\": \"multiprox\",\n  \"task\": \"correct\",\n  \"subgraph\": {\n    \"nodes\": [\n      {\"entity_id\": 11754, \"type_id\": 3},\n      {\"entity_id\": 5142, \"type_id\": 3}\n    ],\n    \"edges\": [\n      {\"source_idx\": 0, \"target_idx\": 1, \"relation_id\": 3}\n    ]\n  },\n  \"diffusion_steps\": 500,\n  \"multiprox_params\": {\n    \"m\": 10,\n    \"t\": 0.5,\n    \"t_prime\": 0.1\n  }\n}"
             },
             "url": {
               "raw": "{{base_url}}/kg-anomaly/correct",

src/backend/README.md ADDED Viewed

	@@ -0,0 +1,127 @@

+# Backend — Django REST API
+Stateless REST API serving the PhD research models. No database — PyTorch checkpoints are loaded into memory at startup and used to answer each request independently.
+## Prerequisites
+1. **Conda environment** with pre-installed system deps:
+   ```bash
+   conda create -n website python=3.9
+   conda activate website
+   conda install -c conda-forge rdkit=2023.03.2 graph-tool=2.45
+   ```
+2. **Pip dependencies**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+3. **Model checkpoints** — downloaded automatically from Google Drive on first boot (via `gdown`). Alternatively, manually place `.tar` / `.ckpt` files in:
+   - `src/research/COINs-KGGeneration/graph_completion/checkpoints/` (COINs: `{dataset}_{algorithm}.tar`)
+   - `src/research/COINs-KGGeneration/graph_generation/checkpoints/` (KG anomaly: `{dataset}.ckpt`, `{dataset}_correct.ckpt`)
+   - `src/research/MultiProxAn/checkpoints/` (graph generation: `{dataset}.ckpt`, `{dataset}_c.ckpt`)
+4. **Dataset files** — the raw KG data files must be present under `src/research/COINs-KGGeneration/data/` (FB15k-237, WN18RR, NELL-995).
+## Running
+From `src/backend/`:
+```bash
+# Development server
+python manage.py runserver 8000
+# With custom settings
+DJANGO_DEBUG=True DJANGO_SECRET_KEY=my-secret python manage.py runserver
+```
+The API is served at `http://localhost:8000/api/v1/`.
+## Environment Variables
+| Variable | Default | Description |
+|---|---|---|
+| `DJANGO_SECRET_KEY` | `dev-insecure-key-change-in-production` | Django secret key. **Set in production.** |
+| `DJANGO_DEBUG` | `True` | Enable debug mode. Set to `False` in production. |
+| `DJANGO_ALLOWED_HOSTS` | `localhost,127.0.0.1` | Comma-separated allowed hosts. |
+## Startup Sequence
+On boot (`ApiConfig.ready()`), the `ModelRegistry` initializes:
+1. **Download checkpoints** from Google Drive if not already present locally
+2. **Scan checkpoint directories** to detect available models per method
+3. **Load lightweight COINs Loaders** — one per dataset (freebase, wordnet, nell), loading graph data, name maps, and train/val/test splits. Heavy arrays (node neighbours ~275MB each, community neighbours, adjacency dicts) are freed after initialization to keep memory low.
+4. **Generate sample subgraphs** for KG anomaly using the COINs Loaders
+All model weights (COINs inference, graph generation, KG anomaly) are loaded lazily at first inference request.
+## API Endpoints
+All endpoints are prefixed with `/api/v1/`.
+### Health & Discovery
+| Method | Path | Description |
+|---|---|---|
+| `GET` | `/health` | Service health + model availability |
+| `GET` | `/methods` | List the 3 research methods |
+### COINs — KG Reasoning
+| Method | Path | Description |
+|---|---|---|
+| `GET` | `/coins/datasets` | List datasets with entity/relation counts |
+| `GET` | `/coins/datasets/{id}/entities` | Paginated entity search (`?q=&page=&page_size=`) |
+| `GET` | `/coins/datasets/{id}/relations` | Paginated relation search (`?q=&page=&page_size=`) |
+| `GET` | `/coins/datasets/{id}/sample-triples` | Random training triples (`?count=10`) |
+| `GET` | `/coins/models` | Available algorithms + supported query structures |
+| `GET` | `/coins/query-structures` | Query graph templates for frontend rendering |
+| `POST` | `/coins/predict` | Run inference (not yet implemented) |
+### Graph Generation — MultiProxAn
+| Method | Path | Description |
+|---|---|---|
+| `GET` | `/graph-generation/datasets` | List graph types with node/edge types |
+| `GET` | `/graph-generation/sampling-modes` | Sampling strategies with parameter specs |
+| `POST` | `/graph-generation/generate` | Generate a graph (not yet implemented) |
+| `POST` | `/graph-generation/continue` | Continue MultiProx generation (not yet implemented) |
+### KG Anomaly Correction
+| Method | Path | Description |
+|---|---|---|
+| `GET` | `/kg-anomaly/datasets` | List datasets with correction models |
+| `GET` | `/kg-anomaly/datasets/{id}/sample-subgraphs` | Pre-computed example subgraphs (`?count=5`) |
+| `POST` | `/kg-anomaly/correct` | Run correction (not yet implemented) |
+| `POST` | `/kg-anomaly/continue` | Continue MultiProx correction (not yet implemented) |
+## Project Structure
+```
+src/backend/
+  manage.py
+  requirements.txt
+  research_api/          # Django project settings
+    settings.py
+    urls.py
+    wsgi.py
+  api/                   # Django app
+    apps.py              # Triggers ModelRegistry.initialize() on startup
+    urls.py              # Route definitions
+    pagination.py        # Shared pagination helper
+    exceptions.py        # Custom error envelope
+    services/
+      constants.py       # Dataset metadata, model configs, query structures
+      registry.py        # ModelRegistry — checkpoint download, scanning, Loader init
+    views/
+      health.py          # /health, /methods
+      coins.py           # /coins/* endpoints
+      graph_generation.py  # /graph-generation/* endpoints
+      kg_anomaly.py      # /kg-anomaly/* endpoints
+```
+## Testing with Postman
+Import the collection and environment from `docs/postman/` to test all discovery endpoints.

src/backend/api/apps.py CHANGED Viewed

@@ -6,6 +6,16 @@ class ApiConfig(AppConfig):
     default_auto_field = "django.db.models.BigAutoField"
     def ready(self):
         from api.services.registry import ModelRegistry
         ModelRegistry.initialize()

     default_auto_field = "django.db.models.BigAutoField"
     def ready(self):
+        import os
+        import sys
+        # Django's runserver auto-reloader spawns two processes, both calling ready().
+        # The inner (serving) process has RUN_MAIN="true"; the outer (watcher) does not.
+        # Skip initialization in the outer process to avoid double memory usage.
+        uses_reloader = "runserver" in sys.argv and "--noreload" not in sys.argv
+        if uses_reloader and os.environ.get("RUN_MAIN") != "true":
+            return
         from api.services.registry import ModelRegistry
         ModelRegistry.initialize()

src/backend/api/pagination.py ADDED Viewed

	@@ -0,0 +1,8 @@

+def paginate_list(items, page=1, page_size=50):
+    """Paginate a list with 1-indexed pages. Returns (page_items, total)."""
+    page = max(1, page)
+    page_size = max(1, min(200, page_size))
+    total = len(items)
+    start = (page - 1) * page_size
+    end = start + page_size
+    return items[start:end], total

src/backend/api/services/constants.py ADDED Viewed

	@@ -0,0 +1,297 @@

+METHODS = [
+    {
+        "id": "coins",
+        "name": "COINs - Knowledge Graph Reasoning",
+        "thesis_section": "3.1",
+        "description": (
+            "Community-Informed Graph Embeddings (COINs) for scalable knowledge graph link prediction "
+            "and complex query answering. Uses community detection to localize embedding computation, "
+            "achieving significant speedups over full-graph methods."
+        ),
+    },
+    {
+        "id": "multiproxan",
+        "name": "MultiProxAn - Graph Generation",
+        "thesis_section": "4.3",
+        "description": (
+            "Discrete denoising diffusion model for graph generation with MultiProx sampling. "
+            "Generates molecular graphs (QM9) and synthetic community graphs using iterative "
+            "multi-measurement Gibbs sampling for improved sample quality."
+        ),
+    },
+    {
+        "id": "kg_anomaly",
+        "name": "KG Anomaly Correction",
+        "thesis_section": "4.4",
+        "description": (
+            "Diffusion-based knowledge graph subgraph correction. Applies the DiGress denoising "
+            "diffusion model to knowledge graph subgraphs to detect and correct anomalous edges."
+        ),
+    },
+]
+COINS_DATASET_META = {
+    "freebase": {
+        "name": "FB15k-237",
+        "description": "Subset of Freebase knowledge base with 237 relation types",
+        "data_dir": "FB15k-237",
+    },
+    "wordnet": {
+        "name": "WN18RR",
+        "description": "Subset of WordNet lexical database with 11 relation types",
+        "data_dir": "WN18RR",
+    },
+    "nell": {
+        "name": "NELL-995",
+        "description": "Never-Ending Language Learner knowledge base with 200 relation types",
+        "data_dir": "NELL-995",
+    },
+}
+COINS_MODELS = [
+    {
+        "algorithm": "transe",
+        "name": "TransE",
+        "description": "Translation-based embedding model",
+        "supported_query_structures": ["1p"],
+    },
+    {
+        "algorithm": "distmult",
+        "name": "DistMult",
+        "description": "Bilinear diagonal embedding model",
+        "supported_query_structures": ["1p"],
+    },
+    {
+        "algorithm": "complex",
+        "name": "ComplEx",
+        "description": "Complex-valued embedding model",
+        "supported_query_structures": ["1p"],
+    },
+    {
+        "algorithm": "rotate",
+        "name": "RotatE",
+        "description": "Rotation-based embedding model in complex space",
+        "supported_query_structures": ["1p"],
+    },
+    {
+        "algorithm": "q2b",
+        "name": "Query2Box",
+        "description": "Box embedding model for complex logical queries",
+        "supported_query_structures": ["1p", "2p", "3p", "2i", "3i", "ip", "pi"],
+    },
+    {
+        "algorithm": "kbgat",
+        "name": "KBGAT",
+        "description": "Knowledge base graph attention network",
+        "supported_query_structures": ["1p"],
+    },
+]
+QUERY_STRUCTURES = [
+    {
+        "id": "1p",
+        "name": "Single Hop",
+        "description": "Direct link prediction: who/what is connected to the anchor via this relation?",
+        "nodes": [
+            {"id": "a", "type": "anchor", "label": "Anchor"},
+            {"id": "t", "type": "target", "label": "?"},
+        ],
+        "edges": [
+            {"id": "r1", "source": "a", "target": "t", "label": "Relation"},
+        ],
+    },
+    {
+        "id": "2p",
+        "name": "Two Hop",
+        "description": "Two-step chain: anchor -> variable -> target",
+        "nodes": [
+            {"id": "a", "type": "anchor", "label": "Anchor"},
+            {"id": "v1", "type": "variable", "label": "Variable"},
+            {"id": "t", "type": "target", "label": "?"},
+        ],
+        "edges": [
+            {"id": "r1", "source": "a", "target": "v1", "label": "Relation 1"},
+            {"id": "r2", "source": "v1", "target": "t", "label": "Relation 2"},
+        ],
+    },
+    {
+        "id": "3p",
+        "name": "Three Hop",
+        "description": "Three-step chain: anchor -> v1 -> v2 -> target",
+        "nodes": [
+            {"id": "a", "type": "anchor", "label": "Anchor"},
+            {"id": "v1", "type": "variable", "label": "Variable 1"},
+            {"id": "v2", "type": "variable", "label": "Variable 2"},
+            {"id": "t", "type": "target", "label": "?"},
+        ],
+        "edges": [
+            {"id": "r1", "source": "a", "target": "v1", "label": "Relation 1"},
+            {"id": "r2", "source": "v1", "target": "v2", "label": "Relation 2"},
+            {"id": "r3", "source": "v2", "target": "t", "label": "Relation 3"},
+        ],
+    },
+    {
+        "id": "2i",
+        "name": "Two Intersection",
+        "description": "Intersection of two single-hop queries sharing the same target",
+        "nodes": [
+            {"id": "a1", "type": "anchor", "label": "Anchor 1"},
+            {"id": "a2", "type": "anchor", "label": "Anchor 2"},
+            {"id": "t", "type": "target", "label": "?"},
+        ],
+        "edges": [
+            {"id": "r1", "source": "a1", "target": "t", "label": "Relation 1"},
+            {"id": "r2", "source": "a2", "target": "t", "label": "Relation 2"},
+        ],
+    },
+    {
+        "id": "3i",
+        "name": "Three Intersection",
+        "description": "Intersection of three single-hop queries sharing the same target",
+        "nodes": [
+            {"id": "a1", "type": "anchor", "label": "Anchor 1"},
+            {"id": "a2", "type": "anchor", "label": "Anchor 2"},
+            {"id": "a3", "type": "anchor", "label": "Anchor 3"},
+            {"id": "t", "type": "target", "label": "?"},
+        ],
+        "edges": [
+            {"id": "r1", "source": "a1", "target": "t", "label": "Relation 1"},
+            {"id": "r2", "source": "a2", "target": "t", "label": "Relation 2"},
+            {"id": "r3", "source": "a3", "target": "t", "label": "Relation 3"},
+        ],
+    },
+    {
+        "id": "ip",
+        "name": "Intersection then Projection",
+        "description": "Two anchors intersect, then the result projects via a third relation to the target",
+        "nodes": [
+            {"id": "a1", "type": "anchor", "label": "Anchor 1"},
+            {"id": "a2", "type": "anchor", "label": "Anchor 2"},
+            {"id": "v1", "type": "variable", "label": "Variable"},
+            {"id": "t", "type": "target", "label": "?"},
+        ],
+        "edges": [
+            {"id": "r1", "source": "a1", "target": "v1", "label": "Relation 1"},
+            {"id": "r2", "source": "a2", "target": "v1", "label": "Relation 2"},
+            {"id": "r3", "source": "v1", "target": "t", "label": "Relation 3"},
+        ],
+    },
+    {
+        "id": "pi",
+        "name": "Projection then Intersection",
+        "description": "One anchor projects then intersects with a direct connection from a second anchor",
+        "nodes": [
+            {"id": "a1", "type": "anchor", "label": "Anchor 1"},
+            {"id": "v1", "type": "variable", "label": "Variable"},
+            {"id": "a2", "type": "anchor", "label": "Anchor 2"},
+            {"id": "t", "type": "target", "label": "?"},
+        ],
+        "edges": [
+            {"id": "r1", "source": "a1", "target": "v1", "label": "Relation 1"},
+            {"id": "r2", "source": "v1", "target": "t", "label": "Relation 2"},
+            {"id": "r3", "source": "a2", "target": "t", "label": "Relation 3"},
+        ],
+    },
+]
+GRAPHGEN_DATASETS = {
+    "qm9": {
+        "name": "QM9",
+        "type": "molecular",
+        "description": "Small organic molecules with up to 9 heavy atoms (C, N, O, F)",
+        "node_types": ["C", "N", "O", "F"],
+        "edge_types": ["none", "single", "double", "triple", "aromatic"],
+        "max_nodes": 9,
+    },
+    "comm20": {
+        "name": "Community20",
+        "type": "synthetic",
+        "description": "Synthetic community-structured graphs with 12-20 nodes",
+        "node_types": ["node"],
+        "edge_types": ["none", "edge"],
+        "max_nodes": 20,
+    },
+}
+GRAPHGEN_SAMPLING_MODES = [
+    {
+        "id": "standard",
+        "name": "Standard Denoising",
+        "description": "Iterative denoising from T to 0. Full quality, slower.",
+        "parameters": [
+            {
+                "name": "diffusion_steps",
+                "type": "integer",
+                "description": "Number of diffusion steps T",
+                "default": 500,
+                "min": 50,
+                "max": 1000,
+            },
+            {
+                "name": "chain_frames",
+                "type": "integer",
+                "description": "Number of denoising snapshots in the GIF",
+                "default": 20,
+                "min": 10,
+                "max": 30,
+            },
+        ],
+    },
+    {
+        "id": "multiprox",
+        "name": "MultiProx Sampling",
+        "description": (
+            "Multi-measurement Gibbs sampling with proximal steps. "
+            "Step-by-step generation with controllable noise levels."
+        ),
+        "parameters": [
+            {
+                "name": "diffusion_steps",
+                "type": "integer",
+                "description": "Number of diffusion steps T",
+                "default": 500,
+                "min": 50,
+                "max": 1000,
+            },
+            {
+                "name": "m",
+                "type": "integer",
+                "description": "Number of parallel samples per multi-measurement step",
+                "default": 10,
+                "min": 2,
+                "max": 100,
+            },
+            {
+                "name": "t",
+                "type": "float",
+                "description": "First noise level (normalized, 0-1)",
+                "default": 0.5,
+                "min": 0.0,
+                "max": 1.0,
+            },
+            {
+                "name": "t_prime",
+                "type": "float",
+                "description": "Second noise level (normalized, 0-1). Must satisfy t_prime <= t.",
+                "default": 0.1,
+                "min": 0.0,
+                "max": 1.0,
+            },
+        ],
+    },
+]
+KG_ANOMALY_DATASET_META = {
+    "freebase": {
+        "name": "FB15k-237",
+        "description": "Diffusion model trained on Freebase subgraphs",
+    },
+    "wordnet": {
+        "name": "WN18RR",
+        "description": "Diffusion model trained on WordNet subgraphs",
+    },
+    "nell": {
+        "name": "NELL-995",
+        "description": "Diffusion model trained on NELL subgraphs",
+    },
+}

src/backend/api/services/registry.py CHANGED Viewed

@@ -1,14 +1,16 @@
 import logging
 import os
 from pathlib import Path
 from django.conf import settings
 logger = logging.getLogger(__name__)
 GDRIVE_FOLDER_ID = "14Bf8fi4KJn0rDdh9y8EFyA5b8OpQyXWi"
-# Subfolder IDs within the Google Drive folder (from gdown listing)
 GDRIVE_SUBFOLDERS = {
     "coins": {
         "folder_name": "COINs-KGGeneration/checkpoints_coins",
@@ -27,6 +29,104 @@ GDRIVE_SUBFOLDERS = {
     },
 }
 class ModelRegistry:
     _instance = None
@@ -35,6 +135,8 @@ class ModelRegistry:
         self.coins_checkpoints_available = {}
         self.graphgen_checkpoints_available = {}
         self.kg_anomaly_checkpoints_available = {}
     @classmethod
     def get(cls):
@@ -49,14 +151,19 @@ class ModelRegistry:
         instance = cls()
         instance._download_checkpoints()
         instance._scan_checkpoints()
         cls._instance = instance
         logger.info(
-            "ModelRegistry initialized: coins=%s, multiproxan=%s, kg_anomaly=%s",
             instance.is_coins_loaded(),
             instance.is_graphgen_loaded(),
             instance.is_kg_anomaly_loaded(),
         )
     def _download_checkpoints(self):
         """Download checkpoints from Google Drive if not already present locally."""
         if self._all_checkpoint_dirs_populated():
@@ -116,6 +223,8 @@ class ModelRegistry:
                 logger.info("Installing checkpoint: %s -> %s", src_file.name, dest_dir)
                 src_file.replace(dest_file)
     def _scan_checkpoints(self):
         self._scan_coins_checkpoints()
         self._scan_graphgen_checkpoints()
@@ -161,6 +270,199 @@ class ModelRegistry:
                 self.kg_anomaly_checkpoints_available.setdefault(name, []).append("generate")
         logger.info("DiGress KG checkpoints: %s", self.kg_anomaly_checkpoints_available)
     def is_coins_loaded(self):
         return bool(self.coins_checkpoints_available)

 import logging
 import os
+import random
 from pathlib import Path
 from django.conf import settings
+from api.services.constants import COINS_DATASET_META
 logger = logging.getLogger(__name__)
 GDRIVE_FOLDER_ID = "14Bf8fi4KJn0rDdh9y8EFyA5b8OpQyXWi"
 GDRIVE_SUBFOLDERS = {
     "coins": {
         "folder_name": "COINs-KGGeneration/checkpoints_coins",
     },
 }
+# Shared sampler hyperparameters used across all COINs experiments
+_SAMPLER_HPARS = {
+    "query_structure": ["1p"],
+    "num_negative_samples": 128,
+    "num_neighbours": 10,
+    "random_walk_length": 10,
+    "context_radius": 2,
+    "pagerank_importances": True,
+    "walks_relation_specific": True,
+}
+# Per-dataset base config (leiden resolution, loader hyperparams)
+_DATASET_BASE = {
+    "freebase": {
+        "leiden_resolution": 5.0e-3,
+        "loader_hpars": {
+            "dataset_name": "freebase", "simulated": False,
+            "sample_source": "smore", "sampler_hpars": _SAMPLER_HPARS,
+        },
+    },
+    "wordnet": {
+        "leiden_resolution": None,  # computed as 1/num_nodes
+        "loader_hpars": {
+            "dataset_name": "wordnet", "simulated": False,
+            "sample_source": "smore", "sampler_hpars": _SAMPLER_HPARS,
+        },
+    },
+    "nell": {
+        "leiden_resolution": 2.0e-5,
+        "loader_hpars": {
+            "dataset_name": "nell", "simulated": False,
+            "sample_source": "smore", "sampler_hpars": _SAMPLER_HPARS,
+        },
+    },
+}
+# Training seed per (dataset, algorithm), from PhD Thesis/Code experiment runs.
+# Verified by matching num_communities in test_log.txt against checkpoint embedding shapes.
+# Groups sharing the same community structure:
+#   freebase: transe/distmult/complex/rotate (1092 com) | q2b (1030) | kbgat (1025)
+#   wordnet:  transe/distmult/complex/rotate (66 com)   | q2b (74)   | kbgat (88)
+#   nell:     transe/distmult/rotate (282 com) | complex (188) | q2b (282, diff sizes) | kbgat (275)
+_CHECKPOINT_SEEDS = {
+    ("freebase", "transe"): 4089853924, ("freebase", "distmult"): 4089853924,
+    ("freebase", "complex"): 4089853924, ("freebase", "rotate"): 4089853924,
+    ("freebase", "q2b"): 1503136574, ("freebase", "kbgat"): 123456789,
+    ("wordnet", "transe"): 1919180054, ("wordnet", "distmult"): 1919180054,
+    ("wordnet", "complex"): 1919180054, ("wordnet", "rotate"): 1919180054,
+    ("wordnet", "q2b"): 3312854056, ("wordnet", "kbgat"): 123456789,
+    ("nell", "transe"): 3192206669, ("nell", "distmult"): 3192206669,
+    ("nell", "rotate"): 3192206669,
+    ("nell", "complex"): 2409194445, ("nell", "q2b"): 3793326028, ("nell", "kbgat"): 123456789,
+}
+# Vanilla seeds per dataset — used for metadata Loaders (entity/relation search, sample triples).
+VANILLA_SEEDS = {
+    "freebase": 4089853924,
+    "wordnet": 1919180054,
+    "nell": 3192206669,
+}
+def get_checkpoint_config(dataset_id, algorithm):
+    """Return the full config for a specific (dataset, algorithm) checkpoint."""
+    base = _DATASET_BASE[dataset_id]
+    seed = _CHECKPOINT_SEEDS.get((dataset_id, algorithm))
+    if seed is None:
+        seed = VANILLA_SEEDS[dataset_id]
+    return {"seed": seed, **base}
+def _free_heavy_arrays(loader):
+    """Free memory-intensive arrays from a Loader that aren't needed for discovery endpoints."""
+    loader.node_neighbours = None
+    loader.com_neighbours = None
+    loader.node_adjacency = None
+    loader.com_adjacency = None
+    loader.label_community_edge_freqs = None
+    loader.label_community_edge_freqs_index = None
+    loader.machines = None
+    loader.graph = None
+    loader.node_importances = None
+    loader.neighbour_importances = None
+    loader.out_degrees = None
+    loader.in_degrees = None
+    loader.degrees = None
+    loader.node_degree_type_freqs = None
+    loader.relation_freqs = None
+class SubgraphInfo:
+    """Holds pre-computed sample subgraphs for a KG anomaly dataset."""
+    __slots__ = ("subgraphs",)
+    def __init__(self, subgraphs):
+        self.subgraphs = subgraphs
 class ModelRegistry:
     _instance = None
         self.coins_checkpoints_available = {}
         self.graphgen_checkpoints_available = {}
         self.kg_anomaly_checkpoints_available = {}
+        self.loaders = {}               # dataset_id -> lightweight Loader for discovery
+        self.kg_anomaly_subgraphs = {}  # dataset_id -> SubgraphInfo
     @classmethod
     def get(cls):
         instance = cls()
         instance._download_checkpoints()
         instance._scan_checkpoints()
+        instance._load_all_loaders()
+        instance._generate_sample_subgraphs()
         cls._instance = instance
         logger.info(
+            "ModelRegistry initialized: coins=%s, multiproxan=%s, kg_anomaly=%s, loaders=%s",
             instance.is_coins_loaded(),
             instance.is_graphgen_loaded(),
             instance.is_kg_anomaly_loaded(),
+            list(instance.loaders.keys()),
         )
+    # ---- Checkpoint download -------------------------------------------
     def _download_checkpoints(self):
         """Download checkpoints from Google Drive if not already present locally."""
         if self._all_checkpoint_dirs_populated():
                 logger.info("Installing checkpoint: %s -> %s", src_file.name, dest_dir)
                 src_file.replace(dest_file)
+    # ---- Checkpoint scanning -------------------------------------------
     def _scan_checkpoints(self):
         self._scan_coins_checkpoints()
         self._scan_graphgen_checkpoints()
                 self.kg_anomaly_checkpoints_available.setdefault(name, []).append("generate")
         logger.info("DiGress KG checkpoints: %s", self.kg_anomaly_checkpoints_available)
+    # ---- Loader initialization -----------------------------------------
+    def _load_all_loaders(self):
+        """Initialize one lightweight Loader per dataset for discovery endpoints.
+        Loads dataset, name maps, train/val/test split, and graph indexes.
+        Heavy arrays (node_neighbours, com_neighbours, adjacency dicts) are freed
+        after startup to save memory. Full Loaders for inference are loaded on demand.
+        """
+        coins_root = str(Path(settings.COINS_DATA_DIR).parent)
+        original_cwd = os.getcwd()
+        try:
+            os.chdir(coins_root)
+            from graph_completion.graphs.load_graph import Loader, LoaderHpars
+            for dataset_id in _DATASET_BASE:
+                seed = VANILLA_SEEDS[dataset_id]
+                config = get_checkpoint_config(dataset_id, "transe")
+                try:
+                    logger.info("Initializing Loader for %s (seed=%d)...", dataset_id, seed)
+                    loader = LoaderHpars.from_dict(config["loader_hpars"]).make()
+                    leiden_resolution = config["leiden_resolution"]
+                    if leiden_resolution is None:
+                        dataset_obj = Loader.datasets[dataset_id]
+                        dataset_obj.load_from_disk()
+                        leiden_resolution = 1.0 / len(dataset_obj.node_data)
+                        dataset_obj.unload_from_memory()
+                    loader.load_graph(
+                        seed=seed, device="cpu", val_size=0.01, test_size=0.02,
+                        community_method="leiden", leiden_resolution=leiden_resolution,
+                    )
+                    # Free heavy arrays not needed for discovery endpoints
+                    _free_heavy_arrays(loader)
+                    self.loaders[dataset_id] = loader
+                    logger.info(
+                        "Loader ready for %s: %d entities, %d relations, %d train triples",
+                        dataset_id, loader.num_nodes, loader.num_relations, len(loader.train_edge_data),
+                    )
+                except Exception:
+                    logger.exception("Failed to initialize Loader for %s", dataset_id)
+        finally:
+            os.chdir(original_cwd)
+    # ---- Loader accessor helpers ---------------------------------------
+    def get_loader(self, dataset_id):
+        """Return the metadata Loader for a dataset, or None."""
+        return self.loaders.get(dataset_id)
+    def get_entity_count(self, dataset_id):
+        loader = self.loaders.get(dataset_id)
+        return loader.num_nodes if loader else 0
+    def get_relation_count(self, dataset_id):
+        loader = self.loaders.get(dataset_id)
+        return loader.num_relations if loader else 0
+    def get_inverted_name_maps(self, dataset_id):
+        """Return (inv_node_names, inv_node_types, inv_relation_names) Series for a dataset."""
+        loader = self.loaders.get(dataset_id)
+        if loader is None:
+            return None, None, None
+        return loader.dataset.get_inverted_name_maps()
+    def search_entities(self, dataset_id, query=None, page=1, page_size=50):
+        """Search entities by substring, return paginated (id, name) list and total."""
+        loader = self.loaders.get(dataset_id)
+        if loader is None:
+            return [], 0
+        inv_nodes, _, _ = loader.dataset.get_inverted_name_maps()
+        items = [(int(idx), str(name)) for idx, name in inv_nodes.items()]
+        if query:
+            q = query.lower()
+            items = [(eid, name) for eid, name in items if q in name.lower()]
+        total = len(items)
+        start = (max(1, page) - 1) * page_size
+        return items[start:start + page_size], total
+    def search_relations(self, dataset_id, query=None, page=1, page_size=50):
+        """Search relations by substring, return paginated (id, name) list and total."""
+        loader = self.loaders.get(dataset_id)
+        if loader is None:
+            return [], 0
+        _, _, inv_relations = loader.dataset.get_inverted_name_maps()
+        items = [(int(idx), str(name)) for idx, name in inv_relations.items()]
+        if query:
+            q = query.lower()
+            items = [(rid, name) for rid, name in items if q in name.lower()]
+        total = len(items)
+        start = (max(1, page) - 1) * page_size
+        return items[start:start + page_size], total
+    def sample_triples(self, dataset_id, count=10):
+        """Return random triples with resolved entity/relation names."""
+        loader = self.loaders.get(dataset_id)
+        if loader is None:
+            return []
+        inv_nodes, _, inv_relations = loader.dataset.get_inverted_name_maps()
+        edge_data = loader.train_edge_data
+        count = min(count, len(edge_data))
+        indices = random.sample(range(len(edge_data)), count)
+        result = []
+        for i in indices:
+            row = edge_data.iloc[i]
+            h, r, t = int(row.s), int(row.r), int(row.t)
+            result.append({
+                "head": {"id": h, "name": str(inv_nodes.get(h, h))},
+                "relation": {"id": r, "name": str(inv_relations.get(r, r))},
+                "tail": {"id": t, "name": str(inv_nodes.get(t, t))},
+            })
+        return result
+    # ---- Sample subgraph generation ------------------------------------
+    def _generate_sample_subgraphs(self):
+        """Generate sample subgraphs for KG anomaly using the Loader's context subgraph DFS."""
+        for dataset_id in COINS_DATASET_META:
+            loader = self.loaders.get(dataset_id)
+            if loader is None:
+                continue
+            try:
+                subgraphs = self._build_sample_subgraphs(dataset_id, loader)
+                self.kg_anomaly_subgraphs[dataset_id] = SubgraphInfo(subgraphs)
+                logger.info("Generated %d sample subgraphs for %s", len(subgraphs), dataset_id)
+            except Exception:
+                logger.exception("Failed to generate sample subgraphs for %s", dataset_id)
+    def _build_sample_subgraphs(self, dataset_id, loader, num_subgraphs=20, max_graph_size=10):
+        """Build sample subgraphs using the Sampler's DFS-based context subgraph partitioning."""
+        inv_nodes, _, inv_relations = loader.dataset.get_inverted_name_maps()
+        node_types = loader.dataset.node_data.type.values
+        # Use the Sampler's DFS partitioning to get context subgraphs
+        samples = loader.sampler.get_context_subgraph_samples_dfs(
+            max_graph_size, loader.graph_indexes, loader.num_nodes,
+            max_samples=num_subgraphs * 5, disable_tqdm=True,
+        )
+        subgraphs = []
+        for subgraph_row, subgraph_col, nodes_row, nodes_col, edges in samples:
+            if len(subgraphs) >= num_subgraphs:
+                break
+            if len(edges) < 3:
+                continue
+            if subgraph_row == subgraph_col:
+                sg_nodes = nodes_row
+            else:
+                sg_nodes = nodes_row + nodes_col
+            node_idx = {n: i for i, n in enumerate(sg_nodes)}
+            nodes = []
+            for n in sg_nodes:
+                type_id = int(node_types[n]) if n < len(node_types) else 0
+                nodes.append({
+                    "entity_id": n,
+                    "entity_name": str(inv_nodes.get(n, n)),
+                    "type_id": type_id,
+                })
+            edge_list = []
+            for h, r, t in edges:
+                if h in node_idx and t in node_idx:
+                    edge_list.append({
+                        "source_idx": node_idx[h],
+                        "target_idx": node_idx[t],
+                        "relation_id": r,
+                        "relation_name": str(inv_relations.get(r, r)),
+                        "entity_name_source": str(inv_nodes.get(h, h)),
+                        "entity_name_target": str(inv_nodes.get(t, t)),
+                    })
+            subgraphs.append({
+                "id": f"sample_{len(subgraphs) + 1}",
+                "num_nodes": len(nodes),
+                "num_edges": len(edge_list),
+                "nodes": nodes,
+                "edges": edge_list,
+            })
+        # Free the partitioning data stored on the sampler
+        loader.sampler.context_subgraphs_nodes = None
+        loader.sampler.context_subgraphs_edges = None
+        return subgraphs
+    # ---- Status --------------------------------------------------------
     def is_coins_loaded(self):
         return bool(self.coins_checkpoints_available)

src/backend/api/urls.py CHANGED Viewed

@@ -1,8 +1,33 @@
 from django.urls import path
-from api.views.health import ApiRootView, HealthView
 urlpatterns = [
     path("", ApiRootView.as_view()),
     path("health", HealthView.as_view()),
 ]

 from django.urls import path
+from api.views.coins import (
+    CoinsDatasetsView,
+    CoinsEntitiesView,
+    CoinsModelsView,
+    CoinsQueryStructuresView,
+    CoinsRelationsView,
+    CoinsSampleTriplesView,
+)
+from api.views.graph_generation import GraphGenDatasetsView, GraphGenSamplingModesView
+from api.views.health import ApiRootView, HealthView, MethodsView
+from api.views.kg_anomaly import KgAnomalyDatasetsView, KgAnomalySampleSubgraphsView
 urlpatterns = [
+    # Health & discovery
     path("", ApiRootView.as_view()),
     path("health", HealthView.as_view()),
+    path("methods", MethodsView.as_view()),
+    # COINs
+    path("coins/datasets", CoinsDatasetsView.as_view()),
+    path("coins/datasets/<str:dataset_id>/entities", CoinsEntitiesView.as_view()),
+    path("coins/datasets/<str:dataset_id>/relations", CoinsRelationsView.as_view()),
+    path("coins/datasets/<str:dataset_id>/sample-triples", CoinsSampleTriplesView.as_view()),
+    path("coins/models", CoinsModelsView.as_view()),
+    path("coins/query-structures", CoinsQueryStructuresView.as_view()),
+    # Graph generation
+    path("graph-generation/datasets", GraphGenDatasetsView.as_view()),
+    path("graph-generation/sampling-modes", GraphGenSamplingModesView.as_view()),
+    # KG anomaly
+    path("kg-anomaly/datasets", KgAnomalyDatasetsView.as_view()),
+    path("kg-anomaly/datasets/<str:dataset_id>/sample-subgraphs", KgAnomalySampleSubgraphsView.as_view()),
 ]

src/backend/api/views/coins.py ADDED Viewed

	@@ -0,0 +1,106 @@

+from rest_framework.response import Response
+from rest_framework.views import APIView
+from api.exceptions import NotFoundError
+from api.services.constants import COINS_DATASET_META, COINS_MODELS, QUERY_STRUCTURES
+from api.services.registry import ModelRegistry
+def _require_loader(dataset_id):
+    """Validate dataset_id and ensure its Loader is available."""
+    if dataset_id not in COINS_DATASET_META:
+        raise NotFoundError(f"Dataset '{dataset_id}' not found")
+    registry = ModelRegistry.get()
+    if registry.get_loader(dataset_id) is None:
+        raise NotFoundError(f"Dataset '{dataset_id}' data not loaded")
+    return registry
+class CoinsDatasetsView(APIView):
+    def get(self, request):
+        registry = ModelRegistry.get()
+        datasets = []
+        for dataset_id, meta in COINS_DATASET_META.items():
+            datasets.append({
+                "id": dataset_id,
+                "name": meta["name"],
+                "num_entities": registry.get_entity_count(dataset_id),
+                "num_relations": registry.get_relation_count(dataset_id),
+                "description": meta["description"],
+            })
+        return Response({"datasets": datasets})
+class CoinsEntitiesView(APIView):
+    def get(self, request, dataset_id):
+        registry = _require_loader(dataset_id)
+        q = request.query_params.get("q", None)
+        page = int(request.query_params.get("page", 1))
+        page_size = int(request.query_params.get("page_size", 50))
+        page_size = max(1, min(200, page_size))
+        page_items, total = registry.search_entities(dataset_id, q, page, page_size)
+        return Response({
+            "dataset_id": dataset_id,
+            "total": total,
+            "page": page,
+            "page_size": page_size,
+            "entities": [{"id": eid, "name": name} for eid, name in page_items],
+        })
+class CoinsRelationsView(APIView):
+    def get(self, request, dataset_id):
+        registry = _require_loader(dataset_id)
+        q = request.query_params.get("q", None)
+        page = int(request.query_params.get("page", 1))
+        page_size = int(request.query_params.get("page_size", 50))
+        page_size = max(1, min(200, page_size))
+        page_items, total = registry.search_relations(dataset_id, q, page, page_size)
+        return Response({
+            "dataset_id": dataset_id,
+            "total": total,
+            "page": page,
+            "page_size": page_size,
+            "relations": [{"id": rid, "name": name} for rid, name in page_items],
+        })
+class CoinsSampleTriplesView(APIView):
+    def get(self, request, dataset_id):
+        registry = _require_loader(dataset_id)
+        count = int(request.query_params.get("count", 10))
+        count = max(1, min(50, count))
+        return Response({
+            "dataset_id": dataset_id,
+            "triples": registry.sample_triples(dataset_id, count),
+        })
+class CoinsModelsView(APIView):
+    def get(self, request):
+        registry = ModelRegistry.get()
+        models = []
+        for model in COINS_MODELS:
+            available_datasets = []
+            for dataset_id in COINS_DATASET_META:
+                algos = registry.coins_checkpoints_available.get(dataset_id, [])
+                if model["algorithm"] in algos:
+                    available_datasets.append(dataset_id)
+            models.append({
+                "algorithm": model["algorithm"],
+                "name": model["name"],
+                "description": model["description"],
+                "supported_query_structures": model["supported_query_structures"],
+                "available_datasets": available_datasets,
+            })
+        return Response({"models": models})
+class CoinsQueryStructuresView(APIView):
+    def get(self, request):
+        return Response({"query_structures": QUERY_STRUCTURES})

src/backend/api/views/graph_generation.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from rest_framework.response import Response
+from rest_framework.views import APIView
+from api.services.constants import GRAPHGEN_DATASETS, GRAPHGEN_SAMPLING_MODES
+from api.services.registry import ModelRegistry
+class GraphGenDatasetsView(APIView):
+    def get(self, request):
+        registry = ModelRegistry.get()
+        datasets = []
+        for dataset_id, meta in GRAPHGEN_DATASETS.items():
+            available_types = registry.graphgen_checkpoints_available.get(dataset_id, [])
+            datasets.append({
+                "id": dataset_id,
+                "name": meta["name"],
+                "type": meta["type"],
+                "description": meta["description"],
+                "node_types": meta["node_types"],
+                "edge_types": meta["edge_types"],
+                "max_nodes": meta["max_nodes"],
+                "available_model_types": available_types,
+            })
+        return Response({"datasets": datasets})
+class GraphGenSamplingModesView(APIView):
+    def get(self, request):
+        return Response({"sampling_modes": GRAPHGEN_SAMPLING_MODES})

src/backend/api/views/health.py CHANGED Viewed

@@ -1,7 +1,7 @@
 from rest_framework.response import Response
-from rest_framework.reverse import reverse
 from rest_framework.views import APIView
 from api.services.registry import ModelRegistry
@@ -50,3 +50,8 @@ class HealthView(APIView):
                 "kg_anomaly": registry.is_kg_anomaly_loaded(),
             },
         })

 from rest_framework.response import Response
 from rest_framework.views import APIView
+from api.services.constants import METHODS
 from api.services.registry import ModelRegistry
                 "kg_anomaly": registry.is_kg_anomaly_loaded(),
             },
         })
+class MethodsView(APIView):
+    def get(self, request):
+        return Response({"methods": METHODS})

src/backend/api/views/kg_anomaly.py ADDED Viewed

	@@ -0,0 +1,42 @@

+from rest_framework.response import Response
+from rest_framework.views import APIView
+from api.exceptions import NotFoundError
+from api.services.constants import KG_ANOMALY_DATASET_META
+from api.services.registry import ModelRegistry
+class KgAnomalyDatasetsView(APIView):
+    def get(self, request):
+        registry = ModelRegistry.get()
+        datasets = []
+        for dataset_id, meta in KG_ANOMALY_DATASET_META.items():
+            available = registry.kg_anomaly_checkpoints_available.get(dataset_id, [])
+            datasets.append({
+                "id": dataset_id,
+                "name": meta["name"],
+                "description": meta["description"],
+                "available_tasks": available,
+            })
+        return Response({"datasets": datasets})
+class KgAnomalySampleSubgraphsView(APIView):
+    def get(self, request, dataset_id):
+        if dataset_id not in KG_ANOMALY_DATASET_META:
+            raise NotFoundError(f"Dataset '{dataset_id}' not found")
+        registry = ModelRegistry.get()
+        sg_info = registry.kg_anomaly_subgraphs.get(dataset_id)
+        if sg_info is None:
+            raise NotFoundError(f"No sample subgraphs available for dataset '{dataset_id}'")
+        count = int(request.query_params.get("count", 5))
+        count = max(1, min(10, count))
+        subgraphs = sg_info.subgraphs[:count]
+        return Response({
+            "dataset_id": dataset_id,
+            "subgraphs": subgraphs,
+        })

src/backend/requirements.txt ADDED Viewed

	@@ -0,0 +1,36 @@

+# Django
+django==4.2.*
+djangorestframework==3.14.*
+django-cors-headers==4.*
+# Checkpoint download from Google Drive
+gdown>=4.7
+# PyTorch with CUDA 11.8 (falls back to CPU at runtime if no GPU present)
+--extra-index-url https://download.pytorch.org/whl/cu118
+torch==2.0.1+cu118
+torchvision==0.15.2+cu118
+torchaudio==2.0.2+cu118
+# PyTorch Geometric (must match torch version)
+torch-geometric==2.3.1
+# Research shared deps
+pytorch-lightning==2.0.4
+hydra-core==1.3.2
+omegaconf==2.3.0
+numpy==1.23
+pandas==1.4
+scipy==1.11.0
+igraph==0.9.11
+PyYAML==6.0
+networkx==2.8.7
+matplotlib==3.7.1
+imageio==2.31.1
+torchmetrics==0.11.4
+tqdm==4.65.0
+scikit-learn>=1.0
+# Conda-only deps (must be pre-installed, not in pip requirements):
+# rdkit==2023.03.2 (conda create -c conda-forge)
+# graph-tool==2.45 (conda install -c conda-forge)

src/backend/research_api/settings.py CHANGED Viewed

@@ -1,9 +1,17 @@
 import os
 from pathlib import Path
 BASE_DIR = Path(__file__).resolve().parent.parent
 PROJECT_ROOT = BASE_DIR.parent.parent  # Website root
 SECRET_KEY = os.environ.get("DJANGO_SECRET_KEY", "dev-insecure-key-change-in-production")
 DEBUG = os.environ.get("DJANGO_DEBUG", "True").lower() in ("true", "1")
 ALLOWED_HOSTS = os.environ.get("DJANGO_ALLOWED_HOSTS", "localhost,127.0.0.1").split(",")

 import os
+import sys
 from pathlib import Path
 BASE_DIR = Path(__file__).resolve().parent.parent
 PROJECT_ROOT = BASE_DIR.parent.parent  # Website root
+# Add research repos to sys.path so their modules can be imported
+_COINS_KG_ROOT = str(PROJECT_ROOT / "src" / "research" / "COINs-KGGeneration")
+_MULTIPROXAN_ROOT = str(PROJECT_ROOT / "src" / "research" / "MultiProxAn")
+for _path in (_COINS_KG_ROOT, _MULTIPROXAN_ROOT):
+    if _path not in sys.path:
+        sys.path.insert(0, _path)
 SECRET_KEY = os.environ.get("DJANGO_SECRET_KEY", "dev-insecure-key-change-in-production")
 DEBUG = os.environ.get("DJANGO_DEBUG", "True").lower() in ("true", "1")
 ALLOWED_HOSTS = os.environ.get("DJANGO_ALLOWED_HOSTS", "localhost,127.0.0.1").split(",")

src/research/COINs-KGGeneration/graph_completion/graphs/load_graph.py CHANGED Viewed

@@ -23,14 +23,14 @@ from graph_completion.graphs.queries import balanced_partition, balanced_partiti
 from graph_completion.utils import AbstractConf
 from graph_data.load_data import CoDExL, Dataset, FreeBase, HighEnergyPhysicsCitations, NELL, \
     OGBBioKG, OGBCitation2, OGBLSCWikiKG90Mv2, \
-    PatentCitations, RedditHyperlinks, Swisscom, SwisscomBig, Transport, WordNet, YAGO310
 from graph_data.serialization import load_object, save_object
 class Loader:
     datasets = {"reddit": RedditHyperlinks(), "hep": HighEnergyPhysicsCitations(), "patent": PatentCitations(),
                 "ogbl-citation2": OGBCitation2(), "ogbl-biokg": OGBBioKG(), "ogb-lsc-wikikg90m2": OGBLSCWikiKG90Mv2(),
-                "swisscom": SwisscomBig(), "freebase": FreeBase(), "wordnet": WordNet(), "nell": NELL(),
                 "codex-l": CoDExL(), "yago3-10": YAGO310()}
     transport_cities = [path.split(os_path_sep)[-1] for path in glob("data/transport/*") if "." not in path[2:]]
     for city in transport_cities:
@@ -74,7 +74,7 @@ class Loader:
         self.inter_community_map: np.ndarray = None
         self.com_neighbours: Optional[np.ndarray] = None
         self.node_neighbours: Optional[np.ndarray] = None
-        self.num_machines = min(num_machines, device_count())
         self.machines: np.ndarray = None
         self.graph_indexes: Iterable[AdjacencyIndex] = None
@@ -96,37 +96,8 @@ class Loader:
         self.leiden_resolution = leiden_resolution
         self.dataset = Loader.datasets[self.dataset_name]
-        if self.dataset_name == "swisscom" and glob("data/swisscom/relations_orgs*"):
-            self.dataset.node_names_map = pd.read_csv(
-                f"data/swisscom/node_names_orgs.csv",
-                header=0, index_col=0, squeeze=True, encoding="utf-8"
-            )
-            self.dataset.node_data = pd.read_csv(
-                f"data/swisscom/nodes_orgs.csv",
-                header=0, index_col=None, parse_dates=["time", ], encoding="utf-8"
-            )
-            self.dataset.edge_data = pd.read_csv(
-                f"data/swisscom/relations_orgs.csv",
-                header=0, index_col=None, parse_dates=["time", ], encoding="utf-8"
-            )
-        elif self.dataset_name == "swisscom":
-            entrypoints = pd.concat([
-                pd.read_csv(f"{customer_orgs_file}", header=0, squeeze=True, encoding="utf-8")
-                for customer_orgs_file in glob("data/swisscom/customer_orgs_*")
-            ], ignore_index=True).drop_duplicates().sort_values(ignore_index=True)
-            swisscom_subgraphs = [Swisscom(entrypoint) for entrypoint in entrypoints]
-            self.dataset.load_from_disk(swisscom_subgraphs)
-            self.dataset.node_names_map.to_csv(
-                f"data/swisscom/node_names_orgs.csv",
-                header=True, index=True, encoding="utf-8")
-            self.dataset.node_data.to_csv(f"data/swisscom/nodes_orgs.csv",
-                                          header=True, index=False, encoding="utf-8")
-            self.dataset.edge_data.to_csv(f"data/swisscom/relations_orgs.csv",
-                                          header=True, index=False, encoding="utf-8")
-        else:
-            self.dataset.load_from_disk()
-            self.dataset.time_sort_and_numerize()
         self.graph = Graph(self.dataset_name)
         self.graph.update_graph(self.dataset.node_data, self.dataset.edge_data)
         print("Computing required metrics...")

 from graph_completion.utils import AbstractConf
 from graph_data.load_data import CoDExL, Dataset, FreeBase, HighEnergyPhysicsCitations, NELL, \
     OGBBioKG, OGBCitation2, OGBLSCWikiKG90Mv2, \
+    PatentCitations, RedditHyperlinks, Transport, WordNet, YAGO310
 from graph_data.serialization import load_object, save_object
 class Loader:
     datasets = {"reddit": RedditHyperlinks(), "hep": HighEnergyPhysicsCitations(), "patent": PatentCitations(),
                 "ogbl-citation2": OGBCitation2(), "ogbl-biokg": OGBBioKG(), "ogb-lsc-wikikg90m2": OGBLSCWikiKG90Mv2(),
+                "freebase": FreeBase(), "wordnet": WordNet(), "nell": NELL(),
                 "codex-l": CoDExL(), "yago3-10": YAGO310()}
     transport_cities = [path.split(os_path_sep)[-1] for path in glob("data/transport/*") if "." not in path[2:]]
     for city in transport_cities:
         self.inter_community_map: np.ndarray = None
         self.com_neighbours: Optional[np.ndarray] = None
         self.node_neighbours: Optional[np.ndarray] = None
+        self.num_machines = max(1, min(num_machines, device_count() or 1))
         self.machines: np.ndarray = None
         self.graph_indexes: Iterable[AdjacencyIndex] = None
         self.leiden_resolution = leiden_resolution
         self.dataset = Loader.datasets[self.dataset_name]
+        self.dataset.load_from_disk()
+        self.dataset.time_sort_and_numerize()
         self.graph = Graph(self.dataset_name)
         self.graph.update_graph(self.dataset.node_data, self.dataset.edge_data)
         print("Computing required metrics...")

src/research/COINs-KGGeneration/graph_completion/graphs/preprocess.py CHANGED Viewed

@@ -741,15 +741,21 @@ class Sampler:
     def get_context_subgraph_samples_dfs(self, max_graph_size: int, graph_indexes: Iterable[AdjacencyIndex],
                                          num_nodes: int, allow_disc: bool = False,
                                          disable_tqdm: bool = False) -> Iterable[ContextSubgraph]:
         _, adj_s_to_t, adj_t_to_s, _, _ = graph_indexes
         assignment = -np.ones(num_nodes, dtype=int)
         subgraph = 0
         subgraph_size = 0
         progress_bar = tqdm(desc="Assigning nodes to context subgraphs", total=num_nodes, leave=False,
                             disable=disable_tqdm)
         for i in range(num_nodes):
             if assignment[i] >= 0:
                 continue
             stack = [i, ]
@@ -759,6 +765,7 @@ class Sampler:
                     continue
                 assignment[v] = subgraph
                 subgraph_size += 1
                 if subgraph_size == max_graph_size:
                     subgraph_size = 0
                     subgraph += 1
@@ -781,11 +788,16 @@ class Sampler:
         progress_bar.close()
         subgraphs_nodes = [[] for _ in range(subgraph)]
         subgraphs_edges = [[[] for _ in range(subgraph)] for _ in range(subgraph)]
-        for v in tqdm(range(num_nodes), desc="Assigning nodes to subgraphs", leave=False, disable=disable_tqdm):
             subgraphs_nodes[assignment[v]].append(v)
         for s, n_s in tqdm(adj_s_to_t.items(), desc="Assigning edges to subgraphs", leave=False, disable=disable_tqdm):
             for r, n_s_r in n_s.items():
                 for t in n_s_r:
                     subgraphs_edges[assignment[s]][assignment[t]].append((s, r, t))
         samples = [(k, l, subgraphs_nodes[k], subgraphs_nodes[l], subgraphs_edges[k][l])
                    for k in range(subgraph) for l in range(subgraph) if len(subgraphs_edges[k][l]) >= 5]

     def get_context_subgraph_samples_dfs(self, max_graph_size: int, graph_indexes: Iterable[AdjacencyIndex],
                                          num_nodes: int, allow_disc: bool = False,
+                                         max_samples: int = 0,
                                          disable_tqdm: bool = False) -> Iterable[ContextSubgraph]:
         _, adj_s_to_t, adj_t_to_s, _, _ = graph_indexes
         assignment = -np.ones(num_nodes, dtype=int)
         subgraph = 0
         subgraph_size = 0
+        nodes_assigned = 0
+        # If max_samples is set, only partition enough nodes to produce that many subgraphs
+        max_subgraphs = max_samples * 2 if max_samples > 0 else 0
         progress_bar = tqdm(desc="Assigning nodes to context subgraphs", total=num_nodes, leave=False,
                             disable=disable_tqdm)
         for i in range(num_nodes):
+            if max_subgraphs > 0 and subgraph >= max_subgraphs:
+                break
             if assignment[i] >= 0:
                 continue
             stack = [i, ]
                     continue
                 assignment[v] = subgraph
                 subgraph_size += 1
+                nodes_assigned += 1
                 if subgraph_size == max_graph_size:
                     subgraph_size = 0
                     subgraph += 1
         progress_bar.close()
         subgraphs_nodes = [[] for _ in range(subgraph)]
         subgraphs_edges = [[[] for _ in range(subgraph)] for _ in range(subgraph)]
+        assigned_nodes = set(np.where(assignment >= 0)[0])
+        for v in tqdm(assigned_nodes, desc="Assigning nodes to subgraphs", leave=False, disable=disable_tqdm):
             subgraphs_nodes[assignment[v]].append(v)
         for s, n_s in tqdm(adj_s_to_t.items(), desc="Assigning edges to subgraphs", leave=False, disable=disable_tqdm):
+            if assignment[s] < 0:
+                continue
             for r, n_s_r in n_s.items():
                 for t in n_s_r:
+                    if assignment[t] < 0:
+                        continue
                     subgraphs_edges[assignment[s]][assignment[t]].append((s, r, t))
         samples = [(k, l, subgraphs_nodes[k], subgraphs_nodes[l], subgraphs_edges[k][l])
                    for k in range(subgraph) for l in range(subgraph) if len(subgraphs_edges[k][l]) >= 5]

src/research/COINs-KGGeneration/graph_data/load_data.py CHANGED Viewed

@@ -327,7 +327,7 @@ class Transport(Dataset):
 class FreeBase(Dataset):
     def __init__(self):
-        super().__init__("fb15k-237", "data/academic/FB15k-237")
     def load_from_disk(self, *args, **kwargs):
         train_edges = pd.read_csv(f"{self.file_location}/train.txt",
@@ -366,7 +366,7 @@ class FreeBase(Dataset):
 class WordNet(Dataset):
     def __init__(self):
-        super().__init__("wn18rr", "data/academic/WN18RR")
     def load_from_disk(self, *args, **kwargs):
         train_edges = pd.read_csv(f"{self.file_location}/train.txt",
@@ -405,7 +405,7 @@ class WordNet(Dataset):
 class NELL(Dataset):
     def __init__(self):
-        super().__init__("nell-995", "data/academic/NELL-995")
     def load_from_disk(self, *args, **kwargs):
         edge_data = pd.read_csv(f"{self.file_location}/raw.kb",
@@ -525,115 +525,6 @@ class YAGO310(Dataset):
         return self.clean_split(train_edges), self.clean_split(valid_edges), self.clean_split(test_edges)
-swisscom_node_types_map = pd.read_csv(f"data/swisscom_node_types.csv",
-                                      header=None, index_col=None, encoding="utf-8",
-                                      squeeze=True).reset_index(name="type").set_index("type")["index"]
-swisscom_relation_types_map = pd.read_csv(f"data/swisscom_relation_types.csv",
-                                          header=None, index_col=None, encoding="utf-8",
-                                          squeeze=True).reset_index(name="r").set_index("r")["index"]
-class Swisscom(Dataset):
-    def __init__(self, customer_org: str):
-        super().__init__(f"swisscom-{customer_org}", f"data/swisscom/{customer_org}")
-        self.customer_org = customer_org
-        self.node_types_map = swisscom_node_types_map
-        self.relation_names_map = swisscom_relation_types_map
-    def load_from_disk(self):
-        def compute_final_node_label(row):
-            if "Alert" in str(row["labels"]):
-                return "Alert"
-            elif str(row["labels"]) in ["Base", "-", ""] or pd.isna(row["n.inventory_type"]):
-                return "-"
-            else:
-                return str(row["n.inventory_type"]).capitalize()
-        node_data = pd.read_csv(glob(f"{self.file_location}/nodes_*")[-1],
-                                sep=",", header=0, index_col=None, encoding="utf-8",
-                                usecols=["n.id", "labels", "n.inventory_type"])
-        node_data = node_data.assign(type=node_data.apply(compute_final_node_label, axis=1), time=pd.NaT)
-        self.node_data = node_data.drop(columns=["labels", "n.inventory_type"]).rename(columns={"n.id": "n"})
-        del node_data
-        edge_data = pd.read_csv(glob(f"{self.file_location}/relations_*")[-1],
-                                sep=",", header=0, index_col=None, encoding="utf-8",
-                                usecols=["r.source_item_id", "relation_type_enriched", "r.target_item_id"])
-        self.edge_data = edge_data.rename(columns={"r.source_item_id": "s",
-                                                   "relation_type_enriched": "r",
-                                                   "r.target_item_id": "t"}).assign(time=pd.NaT).dropna(how="all")
-        del edge_data
-class SwisscomBig(Dataset):
-    def __init__(self):
-        super().__init__(f"swisscom-big", f"data/swisscom/swisscom-big")
-        self.node_types_map = swisscom_node_types_map
-        self.relation_names_map = swisscom_relation_types_map
-    def load_from_disk(self, subgraphs: List[Swisscom]):
-        self.node_data = pd.DataFrame(columns=["n", "type", "time"]).astype(
-            {"n": "int", "type": "int", "time": "datetime64[ns]"}
-        )
-        self.edge_data = pd.DataFrame(columns=["s", "r", "t", "time"]).astype(
-            {"s": "int", "r": "int", "t": "int", "time": "datetime64[ns]"}
-        )
-        self.node_names_map = pd.Series(name="index")
-        self.node_names_map.index.name = "n"
-        for dataset in tqdm(subgraphs, desc="Merging subgraphs into the big graph", leave=False):
-            dataset.load_from_disk()
-            is_new_node = ~dataset.node_data.n.isin(self.node_names_map.index)
-            new_nodes = dataset.node_data.loc[is_new_node]
-            self.node_names_map = self.node_names_map.append(
-                len(self.node_names_map)
-                + new_nodes.n.reset_index(name="n", drop=True).reset_index().set_index("n")["index"]
-            )
-            dataset.node_data.loc[is_new_node, "n"] = new_nodes.n.map(self.node_names_map)
-            dataset.node_data.loc[is_new_node, "type"] = new_nodes.type.map(self.node_types_map)
-            self.node_data = pd.concat((self.node_data, dataset.node_data.loc[is_new_node]), ignore_index=True)
-            new_edges = dataset.edge_data
-            new_edges.s = new_edges.s.map(self.node_names_map.rename("s"))
-            new_edges.r = new_edges.r.map(self.relation_names_map)
-            new_edges.t = new_edges.t.map(self.node_names_map.rename("t"))
-            if len(self.edge_data) > 0:
-                new_edges = new_edges.loc[~new_edges.s.isin(self.edge_data.s)
-                                          | ~new_edges.t.isin(self.edge_data.t)
-                                          | ~new_edges.r.isin(self.edge_data.r)]
-            self.edge_data = pd.concat((self.edge_data, new_edges), ignore_index=True)
-            dataset.unload_from_memory()
-class SwisscomCommunity(Dataset):
-    def __init__(self, community_id: int):
-        self.community_id = community_id
-        super().__init__(f"swisscom-community-{community_id}", f"data/swisscom/community-{community_id}")
-        self.node_types_map = swisscom_node_types_map
-        self.relation_names_map = swisscom_relation_types_map
-    def load_from_disk(self):
-        return
-class SwisscomSubgraphOverlap(Dataset):
-    def __init__(self, customer_org: str, customer_org_2: str):
-        self.customer_org = customer_org
-        self.customer_org_2 = customer_org_2
-        super().__init__(f"swisscom-subgraph-overlap-{customer_org}-{customer_org_2}",
-                         f"data/swisscom/subgraph-overlap-{customer_org}-{customer_org_2}")
-        self.node_types_map = swisscom_node_types_map
-        self.relation_names_map = swisscom_relation_types_map
-    def load_from_disk(self, subgraph_1: Swisscom, subgraph_2: Swisscom):
-        subgraph_1.load_from_disk()
-        subgraph_2.load_from_disk()
-        self.node_data = subgraph_1.node_data.merge(subgraph_2.node_data, how="inner", on=["n", "type"])
-        self.node_data.drop(columns="time_y", inplace=True)
-        self.node_data.rename(columns={"time_x": "time"}, inplace=True)
-        self.edge_data = subgraph_1.edge_data.merge(subgraph_2.edge_data, how="inner", on=["s", "r", "t"])
-        self.edge_data.drop(columns="time_y", inplace=True)
-        self.edge_data.rename(columns={"time_x": "time"}, inplace=True)
         subgraph_1.unload_from_memory()
         subgraph_2.unload_from_memory()

 class FreeBase(Dataset):
     def __init__(self):
+        super().__init__("fb15k-237", "data/FB15k-237")
     def load_from_disk(self, *args, **kwargs):
         train_edges = pd.read_csv(f"{self.file_location}/train.txt",
 class WordNet(Dataset):
     def __init__(self):
+        super().__init__("wn18rr", "data/WN18RR")
     def load_from_disk(self, *args, **kwargs):
         train_edges = pd.read_csv(f"{self.file_location}/train.txt",
 class NELL(Dataset):
     def __init__(self):
+        super().__init__("nell-995", "data/NELL-995")
     def load_from_disk(self, *args, **kwargs):
         edge_data = pd.read_csv(f"{self.file_location}/raw.kb",
         return self.clean_split(train_edges), self.clean_split(valid_edges), self.clean_split(test_edges)
         subgraph_1.unload_from_memory()
         subgraph_2.unload_from_memory()