Andrej Janchevski commited on
Commit
5ed6f37
Β·
1 Parent(s): 5375c2e

docs(deploy): refresh for the post-launch deployment iteration

Browse files

Captures the architecture and operational decisions made during the
local Docker bring-up: the unified RESEARCH_ROOT / CHECKPOINTS_ROOT
layout, the single preloaded gunicorn worker, the AppConfig skip-list
for management commands, the share_memory and TransE-init dim-
expansion monkey-patches around experiment.prepare(), the MultiProx
symmetry safeguard, the bundled Loader caches that keep cold-start
RAM under control, and the WSL2 / shm_size requirements for running
the container locally.

Also corrects every HF Hub reference to the case-sensitive Bani57
namespace (account is registered as "Bani57", not "bani57"), and
clarifies the HF_TOKEN env var as recommended (read-scope token
lifts anonymous rate limits and roughly triples cold-start download
throughput; required only for private repos).

.claude/plans/deploy_huggingface_spaces.md CHANGED
@@ -12,8 +12,8 @@ We pivot to **Hugging Face Spaces with the Docker SDK**:
12
  - Free, no credit card. 16 GB RAM, 2 vCPU, 50 GB ephemeral disk, 7860 default port.
13
  - Native Docker β€” we provide a `Dockerfile`, HF builds and runs it.
14
  - Purpose-built for ML demos (the very thing this site is).
15
- - Checkpoints move from Google Drive to a free HF Hub model repo (`bani57/checkpoints`), served by HF's CDN β€” much faster than `gdown`, no quota games, resumable uploads/downloads.
16
- - Space is `bani57/website` β†’ public URL `https://bani57-website.hf.space`. (HF Spaces URLs are always `<owner>-<spacename>.hf.space`; a bare `bani57.hf.space` isn't possible.)
17
  - A single container serves both the API (Django/Gunicorn) and the built Vue SPA (via WhiteNoise) on the same origin β†’ no CORS, one URL.
18
 
19
  Trade-off accepted: free Spaces sleep after 48 h idle (first hit wakes it, ~2 min cold start because checkpoints are re-pulled from HF Hub on a fresh ephemeral disk). \$5/mo Persistent Storage upgrade eliminates this if it ever bothers us.
@@ -22,7 +22,7 @@ Trade-off accepted: free Spaces sleep after 48 h idle (first hit wakes it, ~2 mi
22
 
23
  ```
24
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
25
- β”‚ HF Space: bani57/website β”‚
26
  Browser ─HTTPS─▢│ https://bani57-website.hf.space β”‚
27
  β”‚ (Docker container, port 7860) β”‚
28
  β”‚ β”‚
@@ -38,7 +38,7 @@ Trade-off accepted: free Spaces sleep after 48 h idle (first hit wakes it, ~2 mi
38
  β”‚ β–² snapshot_download() at boot β”‚
39
  β””β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
40
  β”‚
41
- HF Hub model repo: bani57/checkpoints
42
  (free, unlimited, CDN-backed, resumable)
43
  ```
44
 
@@ -50,11 +50,11 @@ Trade-off accepted: free Spaces sleep after 48 h idle (first hit wakes it, ~2 mi
50
  - `docker-compose.yml` (repo root) β€” local dev convenience only; HF Spaces ignores it and consumes only the Dockerfile.
51
  - `.dockerignore` β€” keep `node_modules`, `.git`, `src/research/**/checkpoints`, `data.zip`, `docs/`, screenshots, `*.npz`, `*.gz` out of the build context.
52
  - `entrypoint.sh` β€” runs `huggingface_hub.snapshot_download` (idempotent), then `exec gunicorn …` on `0.0.0.0:7860`.
53
- - `scripts/upload_checkpoints.py` (NEW) β€” one-shot script that walks the local research checkpoint dirs and pushes everything to `bani57/checkpoints` on HF Hub, preserving directory structure and resumable on partial failure.
54
  - Top-level `README.md` for the Space (a thin file with YAML front-matter HF requires; rendered as the Space card). Either commit it at repo root and let the Space repo serve as the deploy target, or maintain a small Space-only fork β€” see Deployment Steps for the chosen flow.
55
 
56
  **Modified:**
57
- - `src/backend/api/services/registry.py` β€” replace the `gdown` Google-Drive call with `huggingface_hub.snapshot_download(repo_id="bani57/checkpoints", local_dir=…)`. Keep the registry's lazy-load pattern intact. Mirror directory layout exactly so existing scan logic still finds files.
58
  - `src/backend/api/apps.py` β€” *remove* the eager checkpoint download from `AppConfig.ready()`. Downloads happen in `entrypoint.sh` *before* gunicorn starts, so workers never block.
59
  - `src/backend/research_api/settings.py` β€” add WhiteNoise to `MIDDLEWARE` (right after `SecurityMiddleware`); set `STATIC_ROOT = BASE_DIR / "staticfiles"`, `STATICFILES_STORAGE = "whitenoise.storage.CompressedManifestStaticFilesStorage"`; environment-drive `ALLOWED_HOSTS` and `CORS_ALLOWED_ORIGINS` (replace hardcoded `bani57.pythonanywhere.com` with `bani57-website.hf.space`); read `CHECKPOINTS_DIR` from env (default `/tmp/checkpoints` in container, repo path in dev).
60
  - `src/backend/research_api/urls.py` β€” add SPA shell view: `re_path(r"^(?!api/|static/).*$", spa_index_view)` returning `dist/index.html` for any non-API, non-static path. **This does not replace the Lara Croft 404** β€” Vue Router's existing `/:pathMatch(.*)*` route still catches unknown paths and renders `NotFoundView.vue` (Lara) client-side. Django's job ends at handing the SPA shell to the browser.
@@ -131,7 +131,7 @@ ENTRYPOINT ["/usr/local/bin/_entrypoint.sh", "/entrypoint.sh"]
131
  #!/bin/sh
132
  set -e
133
  python -c "from huggingface_hub import snapshot_download; \
134
- snapshot_download(repo_id='bani57/checkpoints', \
135
  local_dir='/tmp/checkpoints', \
136
  local_dir_use_symlinks=False, \
137
  max_workers=4)"
@@ -183,25 +183,25 @@ huggingface-cli login # paste the hf_… token; it's stored in ~/.cache/huggi
183
  **3. Create the repo (one-time, public).**
184
  ```bash
185
  huggingface-cli repo create checkpoints --type model --yes
186
- # Repo URL: https://huggingface.co/bani57/checkpoints
187
  ```
188
 
189
  **4. Upload, preserving directory layout.** The remote layout matches the on-disk one exactly so the Django registry's existing scan logic finds files unchanged.
190
  ```bash
191
  # COINs graph completion (~2.4 GB)
192
- huggingface-cli upload bani57/checkpoints \
193
  src/research/COINs-KGGeneration/graph_completion/checkpoints \
194
  COINs-KGGeneration/graph_completion/checkpoints \
195
  --repo-type model
196
 
197
  # COINs graph generation / DIGRESS (~2.6 GB)
198
- huggingface-cli upload bani57/checkpoints \
199
  src/research/COINs-KGGeneration/graph_generation/checkpoints \
200
  COINs-KGGeneration/graph_generation/checkpoints \
201
  --repo-type model
202
 
203
  # MultiProxAn (~364 MB)
204
- huggingface-cli upload bani57/checkpoints \
205
  src/research/MultiProxAn/checkpoints \
206
  MultiProxAn/checkpoints \
207
  --repo-type model
@@ -209,10 +209,10 @@ huggingface-cli upload bani57/checkpoints \
209
 
210
  `huggingface-cli upload` chunks files and is resumable on retry. Total upload at typical home upstream (~10 MB/s) β‰ˆ 15–20 min.
211
 
212
- **5. Verify.** Open <https://huggingface.co/bani57/checkpoints/tree/main>, confirm the three top-level folders. Then locally:
213
  ```bash
214
  python -c "from huggingface_hub import snapshot_download; \
215
- snapshot_download(repo_id='bani57/checkpoints', local_dir='/tmp/ck-test')"
216
  ls -lah /tmp/ck-test
217
  ```
218
  Expect ~5.4 GB total. This is the same call the container makes at boot.
@@ -224,10 +224,10 @@ A `scripts/upload_checkpoints.py` wrapper (idempotent β€” uses `HfApi.upload_fol
224
  1. **Upload checkpoints to HF Hub** β€” as above.
225
  2. **Land the code changes** above on `master` (one PR is fine β€” single deployment unit).
226
  3. **Local smoke test**: `docker compose up --build`, hit `http://localhost:7860`, exercise each demo (KG completion, KG anomaly, MultiProxAn graph generation), verify SSE streaming and the Postman collection in `docs/postman/`. Confirm the Lara Croft page renders at `http://localhost:7860/this/is/a/wrong/path`.
227
- 4. **Create the Space**: on huggingface.co/new-space, name `bani57/website`, SDK = Docker, hardware = CPU basic (free), visibility = public.
228
  5. **Push to the Space**:
229
  ```bash
230
- git remote add hf https://huggingface.co/spaces/bani57/website
231
  git push hf master:main
232
  ```
233
  HF Spaces uses git on a `main` branch. First build takes ~20–30 min (mamba env solve + GPU torch + apt deps + frontend build + first checkpoint download).
@@ -235,7 +235,7 @@ A `scripts/upload_checkpoints.py` wrapper (idempotent β€” uses `HfApi.upload_fol
235
  - `DJANGO_SECRET_KEY` β€” generate fresh (`python -c "import secrets; print(secrets.token_urlsafe(50))"`).
236
  - `DJANGO_DEBUG=False`.
237
  - `DJANGO_ALLOWED_HOSTS=bani57-website.hf.space`.
238
- - `HF_TOKEN` β€” only if `bani57/checkpoints` is private; for a public model repo no token is needed at runtime.
239
  7. **Verification on the live Space**:
240
  - Open `https://bani57-website.hf.space` β€” landing page renders, navigation works, hard-refresh on a deep route still resolves.
241
  - Visit `https://bani57-website.hf.space/foo/bar/quux` β€” Lara 404 page appears (confirms catch-all + Vue Router still cooperate in prod).
 
12
  - Free, no credit card. 16 GB RAM, 2 vCPU, 50 GB ephemeral disk, 7860 default port.
13
  - Native Docker β€” we provide a `Dockerfile`, HF builds and runs it.
14
  - Purpose-built for ML demos (the very thing this site is).
15
+ - Checkpoints move from Google Drive to a free HF Hub model repo (`Bani57/checkpoints`), served by HF's CDN β€” much faster than `gdown`, no quota games, resumable uploads/downloads.
16
+ - Space is `Bani57/website` β†’ public URL `https://bani57-website.hf.space`. (HF Spaces URLs are always `<owner>-<spacename>.hf.space`; a bare `bani57.hf.space` isn't possible.)
17
  - A single container serves both the API (Django/Gunicorn) and the built Vue SPA (via WhiteNoise) on the same origin β†’ no CORS, one URL.
18
 
19
  Trade-off accepted: free Spaces sleep after 48 h idle (first hit wakes it, ~2 min cold start because checkpoints are re-pulled from HF Hub on a fresh ephemeral disk). \$5/mo Persistent Storage upgrade eliminates this if it ever bothers us.
 
22
 
23
  ```
24
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
25
+ β”‚ HF Space: Bani57/website β”‚
26
  Browser ─HTTPS─▢│ https://bani57-website.hf.space β”‚
27
  β”‚ (Docker container, port 7860) β”‚
28
  β”‚ β”‚
 
38
  β”‚ β–² snapshot_download() at boot β”‚
39
  β””β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
40
  β”‚
41
+ HF Hub model repo: Bani57/checkpoints
42
  (free, unlimited, CDN-backed, resumable)
43
  ```
44
 
 
50
  - `docker-compose.yml` (repo root) β€” local dev convenience only; HF Spaces ignores it and consumes only the Dockerfile.
51
  - `.dockerignore` β€” keep `node_modules`, `.git`, `src/research/**/checkpoints`, `data.zip`, `docs/`, screenshots, `*.npz`, `*.gz` out of the build context.
52
  - `entrypoint.sh` β€” runs `huggingface_hub.snapshot_download` (idempotent), then `exec gunicorn …` on `0.0.0.0:7860`.
53
+ - `scripts/upload_checkpoints.py` (NEW) β€” one-shot script that walks the local research checkpoint dirs and pushes everything to `Bani57/checkpoints` on HF Hub, preserving directory structure and resumable on partial failure.
54
  - Top-level `README.md` for the Space (a thin file with YAML front-matter HF requires; rendered as the Space card). Either commit it at repo root and let the Space repo serve as the deploy target, or maintain a small Space-only fork β€” see Deployment Steps for the chosen flow.
55
 
56
  **Modified:**
57
+ - `src/backend/api/services/registry.py` β€” replace the `gdown` Google-Drive call with `huggingface_hub.snapshot_download(repo_id="Bani57/checkpoints", local_dir=…)`. Keep the registry's lazy-load pattern intact. Mirror directory layout exactly so existing scan logic still finds files.
58
  - `src/backend/api/apps.py` β€” *remove* the eager checkpoint download from `AppConfig.ready()`. Downloads happen in `entrypoint.sh` *before* gunicorn starts, so workers never block.
59
  - `src/backend/research_api/settings.py` β€” add WhiteNoise to `MIDDLEWARE` (right after `SecurityMiddleware`); set `STATIC_ROOT = BASE_DIR / "staticfiles"`, `STATICFILES_STORAGE = "whitenoise.storage.CompressedManifestStaticFilesStorage"`; environment-drive `ALLOWED_HOSTS` and `CORS_ALLOWED_ORIGINS` (replace hardcoded `bani57.pythonanywhere.com` with `bani57-website.hf.space`); read `CHECKPOINTS_DIR` from env (default `/tmp/checkpoints` in container, repo path in dev).
60
  - `src/backend/research_api/urls.py` β€” add SPA shell view: `re_path(r"^(?!api/|static/).*$", spa_index_view)` returning `dist/index.html` for any non-API, non-static path. **This does not replace the Lara Croft 404** β€” Vue Router's existing `/:pathMatch(.*)*` route still catches unknown paths and renders `NotFoundView.vue` (Lara) client-side. Django's job ends at handing the SPA shell to the browser.
 
131
  #!/bin/sh
132
  set -e
133
  python -c "from huggingface_hub import snapshot_download; \
134
+ snapshot_download(repo_id='Bani57/checkpoints', \
135
  local_dir='/tmp/checkpoints', \
136
  local_dir_use_symlinks=False, \
137
  max_workers=4)"
 
183
  **3. Create the repo (one-time, public).**
184
  ```bash
185
  huggingface-cli repo create checkpoints --type model --yes
186
+ # Repo URL: https://huggingface.co/Bani57/checkpoints
187
  ```
188
 
189
  **4. Upload, preserving directory layout.** The remote layout matches the on-disk one exactly so the Django registry's existing scan logic finds files unchanged.
190
  ```bash
191
  # COINs graph completion (~2.4 GB)
192
+ huggingface-cli upload Bani57/checkpoints \
193
  src/research/COINs-KGGeneration/graph_completion/checkpoints \
194
  COINs-KGGeneration/graph_completion/checkpoints \
195
  --repo-type model
196
 
197
  # COINs graph generation / DIGRESS (~2.6 GB)
198
+ huggingface-cli upload Bani57/checkpoints \
199
  src/research/COINs-KGGeneration/graph_generation/checkpoints \
200
  COINs-KGGeneration/graph_generation/checkpoints \
201
  --repo-type model
202
 
203
  # MultiProxAn (~364 MB)
204
+ huggingface-cli upload Bani57/checkpoints \
205
  src/research/MultiProxAn/checkpoints \
206
  MultiProxAn/checkpoints \
207
  --repo-type model
 
209
 
210
  `huggingface-cli upload` chunks files and is resumable on retry. Total upload at typical home upstream (~10 MB/s) β‰ˆ 15–20 min.
211
 
212
+ **5. Verify.** Open <https://huggingface.co/Bani57/checkpoints/tree/main>, confirm the three top-level folders. Then locally:
213
  ```bash
214
  python -c "from huggingface_hub import snapshot_download; \
215
+ snapshot_download(repo_id='Bani57/checkpoints', local_dir='/tmp/ck-test')"
216
  ls -lah /tmp/ck-test
217
  ```
218
  Expect ~5.4 GB total. This is the same call the container makes at boot.
 
224
  1. **Upload checkpoints to HF Hub** β€” as above.
225
  2. **Land the code changes** above on `master` (one PR is fine β€” single deployment unit).
226
  3. **Local smoke test**: `docker compose up --build`, hit `http://localhost:7860`, exercise each demo (KG completion, KG anomaly, MultiProxAn graph generation), verify SSE streaming and the Postman collection in `docs/postman/`. Confirm the Lara Croft page renders at `http://localhost:7860/this/is/a/wrong/path`.
227
+ 4. **Create the Space**: on huggingface.co/new-space, name `Bani57/website`, SDK = Docker, hardware = CPU basic (free), visibility = public.
228
  5. **Push to the Space**:
229
  ```bash
230
+ git remote add hf https://huggingface.co/spaces/Bani57/website
231
  git push hf master:main
232
  ```
233
  HF Spaces uses git on a `main` branch. First build takes ~20–30 min (mamba env solve + GPU torch + apt deps + frontend build + first checkpoint download).
 
235
  - `DJANGO_SECRET_KEY` β€” generate fresh (`python -c "import secrets; print(secrets.token_urlsafe(50))"`).
236
  - `DJANGO_DEBUG=False`.
237
  - `DJANGO_ALLOWED_HOSTS=bani57-website.hf.space`.
238
+ - `HF_TOKEN` β€” only if `Bani57/checkpoints` is private; for a public model repo no token is needed at runtime.
239
  7. **Verification on the live Space**:
240
  - Open `https://bani57-website.hf.space` β€” landing page renders, navigation works, hard-refresh on a deep route still resolves.
241
  - Visit `https://bani57-website.hf.space/foo/bar/quux` β€” Lara 404 page appears (confirms catch-all + Vue Router still cooperate in prod).
CLAUDE.md CHANGED
@@ -21,7 +21,7 @@ allowed (HTTPS and other security).
21
 
22
  ## Architecture
23
 
24
- - `https://bani57-website.hf.space`: Live website URL (HF Space `bani57/website`, Docker SDK)
25
  - `docs/cvAndrejJanchevski.pdf`: My CV, perfectly mapped to a subpage `/cv`, titled "CV".
26
  - `docs/janchevski_scalable_2025.pdf`: My PhD thesis, not served on website (only the
27
  link https://infoscience.epfl.ch/entities/publication/87acf391-feef-43a0-b665-7f2f0bc70b2c given), but used as
@@ -31,12 +31,12 @@ allowed (HTTPS and other security).
31
  - `src/research/COINs-KGGeneration`: repo dump for the COINs KG reasoning and KG generation experiments (3.1 and 4.4 of
32
  PhD thesis)
33
  - `src/research/MultiProxAn`: repo dump for the MultiProxAn graph generation experiments (4.3 of PhD thesis)
34
- - `https://huggingface.co/bani57/checkpoints`: HF Hub model repo holding all PyTorch checkpoints. The container's
35
  entrypoint pre-warms a `snapshot_download` into `/app/checkpoints` (mirrors the on-disk `src/research/...`
36
  layout) before gunicorn starts, and `ModelRegistry._download_checkpoints` is idempotent on warm starts.
37
  - `Dockerfile`, `environment.yml`, `entrypoint.sh`, `docker-compose.yml`, `.dockerignore`: deployment assets at
38
  the repo root. `docker compose up --build` reproduces the production container locally on `:7860`.
39
- - `scripts/upload_checkpoints.py`: one-shot helper to (re-)publish local checkpoints to `bani57/checkpoints`.
40
  - `src/backend`: Django backend and endpoint files location
41
  - `src/frontend`: Vue.js and Semantic UI frontend files location
42
 
 
21
 
22
  ## Architecture
23
 
24
+ - `https://bani57-website.hf.space`: Live website URL (HF Space `Bani57/website`, Docker SDK)
25
  - `docs/cvAndrejJanchevski.pdf`: My CV, perfectly mapped to a subpage `/cv`, titled "CV".
26
  - `docs/janchevski_scalable_2025.pdf`: My PhD thesis, not served on website (only the
27
  link https://infoscience.epfl.ch/entities/publication/87acf391-feef-43a0-b665-7f2f0bc70b2c given), but used as
 
31
  - `src/research/COINs-KGGeneration`: repo dump for the COINs KG reasoning and KG generation experiments (3.1 and 4.4 of
32
  PhD thesis)
33
  - `src/research/MultiProxAn`: repo dump for the MultiProxAn graph generation experiments (4.3 of PhD thesis)
34
+ - `https://huggingface.co/Bani57/checkpoints`: HF Hub model repo holding all PyTorch checkpoints. The container's
35
  entrypoint pre-warms a `snapshot_download` into `/app/checkpoints` (mirrors the on-disk `src/research/...`
36
  layout) before gunicorn starts, and `ModelRegistry._download_checkpoints` is idempotent on warm starts.
37
  - `Dockerfile`, `environment.yml`, `entrypoint.sh`, `docker-compose.yml`, `.dockerignore`: deployment assets at
38
  the repo root. `docker compose up --build` reproduces the production container locally on `:7860`.
39
+ - `scripts/upload_checkpoints.py`: one-shot helper to (re-)publish local checkpoints to `Bani57/checkpoints`.
40
  - `src/backend`: Django backend and endpoint files location
41
  - `src/frontend`: Vue.js and Semantic UI frontend files location
42
 
README.md CHANGED
@@ -13,7 +13,7 @@ license: mit
13
 
14
  Live demos for the PhD thesis _Scalable Methods for Knowledge Graph Reasoning and Generation_ (Andrej Janchevski, EPFL, 2025). The thesis is mirrored at [infoscience.epfl.ch](https://infoscience.epfl.ch/entities/publication/87acf391-feef-43a0-b665-7f2f0bc70b2c); a CV is rendered at `/cv` from `docs/cvAndrejJanchevski.pdf`.
15
 
16
- The site is a single Django + Vue Docker container. The Django backend serves a stateless REST API at `/api/v1/*` and WhiteNoise serves the built Vue SPA on every other path β€” same origin, no CORS in production. Deployment target: Hugging Face [Space](docs/glossary.md#hf-space) `bani57/website` β†’ <https://bani57-website.hf.space>.
17
 
18
  ## What's in the demos
19
 
 
13
 
14
  Live demos for the PhD thesis _Scalable Methods for Knowledge Graph Reasoning and Generation_ (Andrej Janchevski, EPFL, 2025). The thesis is mirrored at [infoscience.epfl.ch](https://infoscience.epfl.ch/entities/publication/87acf391-feef-43a0-b665-7f2f0bc70b2c); a CV is rendered at `/cv` from `docs/cvAndrejJanchevski.pdf`.
15
 
16
+ The site is a single Django + Vue Docker container. The Django backend serves a stateless REST API at `/api/v1/*` and WhiteNoise serves the built Vue SPA on every other path β€” same origin, no CORS in production. Deployment target: Hugging Face [Space](docs/glossary.md#hf-space) `Bani57/website` β†’ <https://bani57-website.hf.space>.
17
 
18
  ## What's in the demos
19
 
docs/README.md CHANGED
@@ -1,6 +1,6 @@
1
  # Documentation
2
 
3
- Technical documentation for the PhD-research demo website. The site is a single Django + Vue Docker container deployed to a Hugging Face [Space](glossary.md#hf-space) (`bani57/website` β†’ <https://bani57-website.hf.space>) and showcases three research methods from the thesis _Scalable Methods for Knowledge Graph Reasoning and Generation_ (Andrej Janchevski, EPFL, 2025): [COINs](glossary.md#coins), [MultiProxAn](glossary.md#multiproxan) and [KG anomaly correction](glossary.md#task-kg-anomaly).
4
 
5
  This document is a routing layer. Pick the document type that matches your goal β€” definitions, understanding, lookup, or task.
6
 
 
1
  # Documentation
2
 
3
+ Technical documentation for the PhD-research demo website. The site is a single Django + Vue Docker container deployed to a Hugging Face [Space](glossary.md#hf-space) (`Bani57/website` β†’ <https://bani57-website.hf.space>) and showcases three research methods from the thesis _Scalable Methods for Knowledge Graph Reasoning and Generation_ (Andrej Janchevski, EPFL, 2025): [COINs](glossary.md#coins), [MultiProxAn](glossary.md#multiproxan) and [KG anomaly correction](glossary.md#task-kg-anomaly).
4
 
5
  This document is a routing layer. Pick the document type that matches your goal β€” definitions, understanding, lookup, or task.
6
 
docs/explanation/architecture.md CHANGED
@@ -9,15 +9,14 @@ The site runs as a single Docker container on a Hugging Face [Space](../glossary
9
  flowchart LR
10
  Browser([Browser])
11
  subgraph Container["HF Space container :7860"]
12
- Gunicorn[gunicorn<br/>2 workers]
13
  WN[WhiteNoise<br/>middleware]
14
  DRF[DRF views<br/>api/views/]
15
  SPA[(SPA dist/<br/>index.html and assets)]
16
  Stat[(staticfiles/<br/>collectstatic)]
17
- CK[(/app/checkpoints<br/>PyTorch weights)]
18
- Research[(/app/research<br/>research code)]
19
  end
20
- Hub[(HF Hub<br/>bani57/checkpoints)]
21
 
22
  Browser -->|HTTPS| Gunicorn
23
  Gunicorn --> WN
@@ -26,9 +25,8 @@ flowchart LR
26
  WN -->|miss| URLs{Django<br/>URL router}
27
  URLs -->|/api/v1/*| DRF
28
  URLs -->|non-API path| SPA
29
- DRF -.->|imports| Research
30
- DRF -.->|loads weights| CK
31
- CK -.->|cold-start populate| Hub
32
  ```
33
 
34
  Routing precedence inside the container:
@@ -53,7 +51,13 @@ Long-running inference is exposed as [SSE](../glossary.md#sse-server-sent-events
53
 
54
  ## Concurrency
55
 
56
- Free-tier HF Spaces run on 2 vCPU with no GPU. The container starts gunicorn with two workers and two threads, and `ModelRegistry._inference_lock` serializes inference globally. A second concurrent inference request gets HTTP 429 with `INFERENCE_BUSY`. Discovery endpoints (datasets, entities, methods, health) acquire no lock and run freely in parallel.
 
 
 
 
 
 
57
 
58
  ## Container lifecycle
59
 
@@ -65,9 +69,9 @@ sequenceDiagram
65
  participant G as gunicorn
66
  participant D as Django (AppConfig.ready)
67
  HF->>E: container start
68
- E->>Hub: snapshot_download(bani57/checkpoints)
69
- Hub-->>E: 5.4 GB into /app/checkpoints
70
- E->>G: exec gunicorn :7860
71
  G->>D: import research_api.wsgi
72
  D->>D: ModelRegistry.initialize()
73
  D->>D: scan checkpoint dirs
@@ -84,8 +88,8 @@ Detail on each step lives in [inference-lifecycle.md](inference-lifecycle.md).
84
  .
85
  β”œβ”€β”€ Dockerfile # multi-stage: Node SPA build β†’ micromamba runtime
86
  β”œβ”€β”€ environment.yml # conda half (rdkit 2023.03.2, boost, cairo, …)
87
- β”œβ”€β”€ docker-compose.yml # local-dev parity, named volume for checkpoints
88
- β”œβ”€β”€ entrypoint.sh # snapshot_download then exec gunicorn
89
  β”œβ”€β”€ README.md # HF Space landing card (YAML front-matter)
90
  β”œβ”€β”€ src/
91
  β”‚ β”œβ”€β”€ backend/
@@ -96,7 +100,7 @@ Detail on each step lives in [inference-lifecycle.md](inference-lifecycle.md).
96
  β”‚ β”œβ”€β”€ frontend/ # Vue 3 SPA (built into dist/ at image time)
97
  β”‚ └── research/ # research code, imported via sys.path tweaks
98
  └── scripts/
99
- └── upload_checkpoints.py # one-shot publisher to bani57/checkpoints
100
  ```
101
 
102
  ## Cross-references
 
9
  flowchart LR
10
  Browser([Browser])
11
  subgraph Container["HF Space container :7860"]
12
+ Gunicorn[gunicorn master<br/>1 worker, 4 threads<br/>--preload]
13
  WN[WhiteNoise<br/>middleware]
14
  DRF[DRF views<br/>api/views/]
15
  SPA[(SPA dist/<br/>index.html and assets)]
16
  Stat[(staticfiles/<br/>collectstatic)]
17
+ Research[(/app/research<br/>code, configs,<br/>Loader caches,<br/>HF-Hub weights)]
 
18
  end
19
+ Hub[(HF Hub<br/>Bani57/checkpoints)]
20
 
21
  Browser -->|HTTPS| Gunicorn
22
  Gunicorn --> WN
 
25
  WN -->|miss| URLs{Django<br/>URL router}
26
  URLs -->|/api/v1/*| DRF
27
  URLs -->|non-API path| SPA
28
+ DRF -.->|imports + loads| Research
29
+ Research -.->|cold-start populate| Hub
 
30
  ```
31
 
32
  Routing precedence inside the container:
 
51
 
52
  ## Concurrency
53
 
54
+ Free-tier HF Spaces run on 2 vCPU with no GPU. The container starts gunicorn with **one preloaded worker** and four threads. The single-worker choice is deliberate:
55
+
56
+ - The `ModelRegistry` holds multi-GB of COINs Loaders + lazily-loaded model weights in process RAM. A second worker would duplicate every byte.
57
+ - `ModelRegistry._inference_lock` serializes inference globally. A second worker would only ever queue on the same lock β€” no throughput gain.
58
+ - `--preload` runs Django setup (and `ModelRegistry.initialize`) once in the master before forking, so the worker inherits a ready state via copy-on-write and gunicorn's silent-time timeout doesn't fire mid-init.
59
+
60
+ A second concurrent inference request gets HTTP 429 with `INFERENCE_BUSY`. Discovery endpoints (datasets, entities, methods, health) acquire no lock and run freely in parallel via the worker's threads.
61
 
62
  ## Container lifecycle
63
 
 
69
  participant G as gunicorn
70
  participant D as Django (AppConfig.ready)
71
  HF->>E: container start
72
+ E->>Hub: snapshot_download(Bani57/checkpoints)
73
+ Hub-->>E: 5.4 GB into /app/research
74
+ E->>G: exec gunicorn --preload :7860
75
  G->>D: import research_api.wsgi
76
  D->>D: ModelRegistry.initialize()
77
  D->>D: scan checkpoint dirs
 
88
  .
89
  β”œβ”€β”€ Dockerfile # multi-stage: Node SPA build β†’ micromamba runtime
90
  β”œβ”€β”€ environment.yml # conda half (rdkit 2023.03.2, boost, cairo, …)
91
+ β”œβ”€β”€ docker-compose.yml # local-dev parity (no volume β€” checkpoints re-pull on each up)
92
+ β”œβ”€β”€ entrypoint.sh # snapshot_download then exec gunicorn --preload
93
  β”œβ”€β”€ README.md # HF Space landing card (YAML front-matter)
94
  β”œβ”€β”€ src/
95
  β”‚ β”œβ”€β”€ backend/
 
100
  β”‚ β”œβ”€β”€ frontend/ # Vue 3 SPA (built into dist/ at image time)
101
  β”‚ └── research/ # research code, imported via sys.path tweaks
102
  └── scripts/
103
+ └── upload_checkpoints.py # one-shot publisher to Bani57/checkpoints
104
  ```
105
 
106
  ## Cross-references
docs/explanation/inference-lifecycle.md CHANGED
@@ -4,22 +4,26 @@ How models get from disk to RAM to a response. The site optimizes for slow cold
4
 
5
  ## Boot sequence
6
 
7
- `api/apps.ApiConfig.ready()` runs once when Django imports the app β€” before gunicorn accepts traffic. It calls `ModelRegistry.initialize()`, which executes four steps in order:
8
 
9
  ```mermaid
10
  flowchart TB
11
- Start([AppConfig.ready]) --> Skip{outer auto-reloader?}
12
- Skip -->|yes| Done([return])
 
 
13
  Skip -->|no| Init[ModelRegistry.initialize]
14
  Init --> DL[_download_checkpoints<br/>HF Hub snapshot_download]
15
  DL --> Scan[_scan_checkpoints<br/>list available files]
16
  Scan --> Loaders[_load_all_loaders<br/>3 lightweight Loaders]
17
  Loaders --> Sub[_generate_sample_subgraphs<br/>per-dataset DFS partitions]
18
- Sub --> Ready([gunicorn accepts traffic])
19
  ```
20
 
 
 
21
  ### 1. Download checkpoints (idempotent)
22
- `_download_checkpoints` checks every expected subdir under `CHECKPOINTS_ROOT`; if any is empty it calls `huggingface_hub.snapshot_download(repo_id="bani57/checkpoints")` to pull the missing files. The repo's directory layout mirrors the on-disk one, so files land in their final location and the scan code doesn't need a redistribution step.
23
 
24
  In production the `entrypoint.sh` script also calls `snapshot_download` *before* gunicorn starts. That makes Django's call a no-op on a normal cold start; it only does real work when run outside the container or when the entrypoint was bypassed.
25
 
@@ -52,6 +56,13 @@ First request for a `(dataset, algorithm)` or `(dataset, model_type)` combinatio
52
 
53
  The COINs registry also reuses Loaders across algorithms via `_coins_loaders[(dataset_id, seed, leiden_resolution)]`: all four `transe / distmult / complex / rotate` checkpoints on a dataset share the same seed and Leiden resolution, so they share one Loader and don't reload the graph four times.
54
 
 
 
 
 
 
 
 
55
  ## Concurrency and the inference lock
56
 
57
  `ModelRegistry._inference_lock` is a single `threading.Lock`. Every endpoint that runs PyTorch inference acquires it non-blocking; if it can't, it raises `InferenceBusy` (HTTP 429). The lock is released in a `finally` after the response is fully streamed:
@@ -75,6 +86,23 @@ The free-tier HF Space gives 16 GB RAM. Approximate usage:
75
 
76
  Worst case (everything ever requested in one Space lifetime): well under 16 GB. The current code does not evict β€” caches grow monotonically until the container restarts.
77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  ## Cross-references
79
 
80
  - [explanation/architecture.md](architecture.md) β€” where this lifecycle sits in the request flow.
 
4
 
5
  ## Boot sequence
6
 
7
+ `api/apps.ApiConfig.ready()` runs once when Django imports the app β€” typically inside the gunicorn master under `--preload`, before any worker forks. It calls `ModelRegistry.initialize()`, which executes four steps in order:
8
 
9
  ```mermaid
10
  flowchart TB
11
+ Start([AppConfig.ready]) --> Mgmt{argv in<br/>SKIP_REGISTRY_INIT?}
12
+ Mgmt -->|yes: collectstatic,<br/>migrate, check, …| Done([return])
13
+ Mgmt -->|no| Skip{outer auto-reloader?}
14
+ Skip -->|yes| Done
15
  Skip -->|no| Init[ModelRegistry.initialize]
16
  Init --> DL[_download_checkpoints<br/>HF Hub snapshot_download]
17
  DL --> Scan[_scan_checkpoints<br/>list available files]
18
  Scan --> Loaders[_load_all_loaders<br/>3 lightweight Loaders]
19
  Loaders --> Sub[_generate_sample_subgraphs<br/>per-dataset DFS partitions]
20
+ Sub --> Ready([gunicorn worker forks via copy-on-write])
21
  ```
22
 
23
+ The `SKIP_REGISTRY_INIT` set covers `collectstatic`, `migrate`, `makemigrations`, `check`, `shell`, `test`, etc. Without that guard, `python manage.py collectstatic --noinput` at image build time would trigger the full ~6 GB checkpoint download into a throwaway layer.
24
+
25
  ### 1. Download checkpoints (idempotent)
26
+ `_download_checkpoints` checks every expected subdir under `CHECKPOINTS_ROOT`; if any is empty it calls `huggingface_hub.snapshot_download(repo_id="Bani57/checkpoints")` to pull the missing files. The repo's directory layout mirrors the on-disk one, so files land in their final location and the scan code doesn't need a redistribution step.
27
 
28
  In production the `entrypoint.sh` script also calls `snapshot_download` *before* gunicorn starts. That makes Django's call a no-op on a normal cold start; it only does real work when run outside the container or when the entrypoint was bypassed.
29
 
 
56
 
57
  The COINs registry also reuses Loaders across algorithms via `_coins_loaders[(dataset_id, seed, leiden_resolution)]`: all four `transe / distmult / complex / rotate` checkpoints on a dataset share the same seed and Leiden resolution, so they share one Loader and don't reload the graph four times.
58
 
59
+ ### Monkey-patches around `experiment.prepare()`
60
+
61
+ Two patches wrap each call to `experiment.prepare()` in `_load_coins_experiment`, restored in a `finally`:
62
+
63
+ - **`Module.share_memory` β†’ no-op.** The research code's `prepare()` calls `embedder.share_memory()` to share weights across multi-process training workers. Inference is single-process; the call is gratuitous, and on Linux containers with a small `/dev/shm` (Docker default 64 MB, free HF Spaces tmpfs similar) it raises a `Bus error` mid-prepare. The no-op makes `prepare()` return cleanly.
64
+ - **`torch.load` β†’ TransE-init dim expansion.** `prepare()` loads `transe_model.tar` to seed the embedder's `entity_embeddings_initial` buffers. The KBGAT embedder's `__init__` then assigns `weight.data = init`, which silently re-shapes the YAML-declared embedding layer to the init's shape. For wordnet KBGAT this is fatal: the trained checkpoint was 200d but the wordnet TransE init is 100d, so the embedder ends up at 100d and the trained `load_state_dict` blows up on the dim mismatch. The patch detects TransE state dicts being loaded and repeats them along the embedding axis (e.g. 100d β†’ 200d via `cat([init, init])`) when the YAML's `embedding_dim` is an integer multiple of the init's dim β€” same trick `_adapt_kbgat_state_dict` already uses for the GATConv multi-head expansion.
65
+
66
  ## Concurrency and the inference lock
67
 
68
  `ModelRegistry._inference_lock` is a single `threading.Lock`. Every endpoint that runs PyTorch inference acquires it non-blocking; if it can't, it raises `InferenceBusy` (HTTP 429). The lock is released in a `finally` after the response is fully streamed:
 
86
 
87
  Worst case (everything ever requested in one Space lifetime): well under 16 GB. The current code does not evict β€” caches grow monotonically until the container restarts.
88
 
89
+ ### Local-dev memory floor
90
+
91
+ Loading all three Loaders + computing graph metrics for NELL peaks at ~5–6 GB transient RAM unless the bundled Loader caches (`results/<dataset>/*.npz` and `*.gz`, ~10 MB total in the image) are present, which let the boot read precomputed arrays instead of recomputing them. WSL2 should be configured for at least 12 GB to give Docker enough headroom; the recommended `.wslconfig` is:
92
+
93
+ ```ini
94
+ [wsl2]
95
+ memory=12GB
96
+ processors=4
97
+ swap=4GB
98
+ ```
99
+
100
+ `docker-compose.yml` also sets `shm_size: "2gb"` to avoid `Bus error` from PyTorch's shared-memory paths under Docker's 64 MB `/dev/shm` default.
101
+
102
+ ## MultiProx symmetry safeguard
103
+
104
+ `graphgen_inference._collapse_final` symmetrises the edge tensor before calling `model.sample_discrete_graph_given_z0`. The model has a strict `assert (pred_E == pred_E.T).all()`; the MultiProx Gibbs aggregation (mean / median over multiple chains) can introduce ULP-level asymmetry that survives into `pred_E` and trips the assert on some BLAS / vectorization stacks (notably the Linux `+cu118` torch wheel inside the deployment container, while the same code runs fine on the Windows wheel in dev). `E = (E + E.T) / 2` is a no-op on already-symmetric input and a one-line invariant fix when it isn't.
105
+
106
  ## Cross-references
107
 
108
  - [explanation/architecture.md](architecture.md) β€” where this lifecycle sits in the request flow.
docs/glossary.md CHANGED
@@ -64,10 +64,10 @@ Boot-time: pre-warm checkpoints from HF Hub, scan checkpoint dirs, load lightwei
64
  ## Deployment
65
 
66
  ### HF Space
67
- A Hugging Face Spaces application running this repo's `Dockerfile`. The deployed URL is `https://bani57-website.hf.space`. The Space repo is `bani57/website`.
68
 
69
  ### HF Hub model repo
70
- `bani57/checkpoints` β€” holds all PyTorch weights. Mirrors the on-disk layout under `CHECKPOINTS_ROOT` so `huggingface_hub.snapshot_download` populates files in their expected paths and the registry's scan logic finds them unchanged.
71
 
72
  ### Persistent storage (HF Spaces)
73
  A paid `/data` volume that survives Space restarts. Free Spaces have 50 GB ephemeral disk that resets on restart. Without persistent storage, every cold start re-downloads checkpoints from HF Hub.
 
64
  ## Deployment
65
 
66
  ### HF Space
67
+ A Hugging Face Spaces application running this repo's `Dockerfile`. The deployed URL is `https://bani57-website.hf.space`. The Space repo is `Bani57/website`.
68
 
69
  ### HF Hub model repo
70
+ `Bani57/checkpoints` β€” holds all PyTorch weights. Mirrors the on-disk layout under `CHECKPOINTS_ROOT` so `huggingface_hub.snapshot_download` populates files in their expected paths and the registry's scan logic finds them unchanged.
71
 
72
  ### Persistent storage (HF Spaces)
73
  A paid `/data` volume that survives Space restarts. Free Spaces have 50 GB ephemeral disk that resets on restart. Without persistent storage, every cold start re-downloads checkpoints from HF Hub.
docs/guides/deploy.md CHANGED
@@ -1,12 +1,12 @@
1
  # How to deploy to Hugging Face Spaces
2
 
3
- Push a new version of the site to the HF Space `bani57/website`. Everything is git-based β€” there is no build button. The Space rebuilds its Docker image from whatever is on the `main` branch of its repo. For the rationale and full design see [`plans/deploy_huggingface_spaces.md`](../../plans/deploy_huggingface_spaces.md).
4
 
5
  ## Prerequisites
6
 
7
- - The Space `bani57/website` exists with **SDK = Docker**, **Hardware = CPU basic (free)**.
8
  - Space secrets are set: `DJANGO_SECRET_KEY`, `DJANGO_DEBUG=False`, `DJANGO_ALLOWED_HOSTS=bani57-website.hf.space`.
9
- - The HF Hub model repo `bani57/checkpoints` exists and contains the current weights. See [Refreshing checkpoints](#refreshing-checkpoints) below.
10
  - Local working tree is on the deployment branch (typically `master`) and tests pass.
11
 
12
  ## 1. Local container smoke test
@@ -36,7 +36,7 @@ If anything fails, fix it locally β€” never push a broken image to the Space.
36
  Add the HF git remote (one-time):
37
 
38
  ```bash
39
- git remote add hf https://huggingface.co/spaces/bani57/website
40
  ```
41
 
42
  Then for each release:
@@ -75,7 +75,7 @@ Force-push triggers an image rebuild from the rolled-back commit. Checkpoints in
75
 
76
  ## Refreshing checkpoints
77
 
78
- Checkpoints live in `bani57/checkpoints` on HF Hub, separate from the code repo. To publish new weights:
79
 
80
  ```bash
81
  huggingface-cli login # one-time, paste a write token
@@ -92,7 +92,7 @@ In the Space Settings β†’ Variables and secrets:
92
 
93
  - **Secrets** (encrypted, not exposed in logs):
94
  - `DJANGO_SECRET_KEY` β€” required. Generate with `python -c "import secrets; print(secrets.token_urlsafe(50))"`.
95
- - `HF_TOKEN` β€” only if `bani57/checkpoints` is private.
96
  - **Variables** (visible in logs and to viewers of the Space metadata):
97
  - `DJANGO_DEBUG=False`.
98
  - `DJANGO_ALLOWED_HOSTS=bani57-website.hf.space`.
@@ -103,7 +103,7 @@ Changing a variable or secret triggers an automatic Space restart.
103
 
104
  ## Optional: persistent storage
105
 
106
- Free Spaces have 50 GB ephemeral disk that resets on restart, so every cold start re-downloads ~5.4 GB of checkpoints. The first request after a restart waits for `snapshot_download` to finish. For ~$5/month, the Space's **Persistent Storage** tier puts `/data` on a permanent volume; mount the checkpoints there (set `CHECKPOINTS_ROOT=/data/checkpoints`) and cold starts become instant. Worth doing only when traffic justifies it.
107
 
108
  ## See also
109
 
 
1
  # How to deploy to Hugging Face Spaces
2
 
3
+ Push a new version of the site to the HF Space `Bani57/website`. Everything is git-based β€” there is no build button. The Space rebuilds its Docker image from whatever is on the `main` branch of its repo. For the rationale and full design see [`plans/deploy_huggingface_spaces.md`](../../plans/deploy_huggingface_spaces.md).
4
 
5
  ## Prerequisites
6
 
7
+ - The Space `Bani57/website` exists with **SDK = Docker**, **Hardware = CPU basic (free)**.
8
  - Space secrets are set: `DJANGO_SECRET_KEY`, `DJANGO_DEBUG=False`, `DJANGO_ALLOWED_HOSTS=bani57-website.hf.space`.
9
+ - The HF Hub model repo `Bani57/checkpoints` exists and contains the current weights. See [Refreshing checkpoints](#refreshing-checkpoints) below.
10
  - Local working tree is on the deployment branch (typically `master`) and tests pass.
11
 
12
  ## 1. Local container smoke test
 
36
  Add the HF git remote (one-time):
37
 
38
  ```bash
39
+ git remote add hf https://huggingface.co/spaces/Bani57/website
40
  ```
41
 
42
  Then for each release:
 
75
 
76
  ## Refreshing checkpoints
77
 
78
+ Checkpoints live in `Bani57/checkpoints` on HF Hub, separate from the code repo. To publish new weights:
79
 
80
  ```bash
81
  huggingface-cli login # one-time, paste a write token
 
92
 
93
  - **Secrets** (encrypted, not exposed in logs):
94
  - `DJANGO_SECRET_KEY` β€” required. Generate with `python -c "import secrets; print(secrets.token_urlsafe(50))"`.
95
+ - `HF_TOKEN` β€” strongly recommended. A read-scope token (huggingface.co/settings/tokens β†’ New token β†’ Read) lifts anonymous rate limits and roughly triples checkpoint download throughput on cold starts. Required only if the checkpoint repo is private.
96
  - **Variables** (visible in logs and to viewers of the Space metadata):
97
  - `DJANGO_DEBUG=False`.
98
  - `DJANGO_ALLOWED_HOSTS=bani57-website.hf.space`.
 
103
 
104
  ## Optional: persistent storage
105
 
106
+ Free Spaces have 50 GB ephemeral disk that resets on restart, so every cold start re-downloads ~5.4 GB of checkpoints. The first request after a restart waits for `snapshot_download` to finish. For ~$5/month, the Space's **Persistent Storage** tier puts `/data` on a permanent volume. To use it, set the variable `CHECKPOINTS_ROOT=/data/checkpoints` in the Space settings. The container's `entrypoint.sh` will write the snapshot there; subsequent restarts find it already populated. The default unifies `CHECKPOINTS_ROOT` with `RESEARCH_ROOT=/app/research`, so checkpoints land alongside the bundled research code on free tier β€” clean for one-shot deploys, costly to re-pull on every restart.
107
 
108
  ## See also
109
 
docs/guides/local-development.md CHANGED
@@ -8,6 +8,14 @@ End-to-end development setup: backend Python env, frontend dev server, checkpoin
8
  - Node 20+ and npm.
9
  - ~6 GB of free disk for the checkpoints if you intend to run inference locally.
10
  - Optional: a Hugging Face account if you want to publish or update checkpoints.
 
 
 
 
 
 
 
 
11
 
12
  ## 1. Create the Python environment
13
 
@@ -23,12 +31,12 @@ This mirrors what the deployment image installs. The CUDA 11.8 wheels work on CP
23
 
24
  ## 2. Pull checkpoints
25
 
26
- If you have a Hugging Face account and the `bani57/checkpoints` repo is accessible to you:
27
 
28
  ```bash
29
  huggingface-cli login # paste a read token
30
  python -c "from huggingface_hub import snapshot_download; \
31
- snapshot_download(repo_id='bani57/checkpoints', \
32
  local_dir='src/research', local_dir_use_symlinks=False)"
33
  ```
34
 
 
8
  - Node 20+ and npm.
9
  - ~6 GB of free disk for the checkpoints if you intend to run inference locally.
10
  - Optional: a Hugging Face account if you want to publish or update checkpoints.
11
+ - For Docker (`docker compose up`): WSL2 with at least **12 GB** of memory. Bumping `/dev/shm` is handled by `docker-compose.yml` (`shm_size: "2gb"`). Edit `%UserProfile%\.wslconfig`:
12
+ ```ini
13
+ [wsl2]
14
+ memory=12GB
15
+ processors=4
16
+ swap=4GB
17
+ ```
18
+ then `wsl --shutdown` and restart Docker Desktop. The peak boot RAM is ~5–6 GB while the three COINs Loaders compute graph metrics.
19
 
20
  ## 1. Create the Python environment
21
 
 
31
 
32
  ## 2. Pull checkpoints
33
 
34
+ If you have a Hugging Face account and the `Bani57/checkpoints` repo is accessible to you:
35
 
36
  ```bash
37
  huggingface-cli login # paste a read token
38
  python -c "from huggingface_hub import snapshot_download; \
39
+ snapshot_download(repo_id='Bani57/checkpoints', \
40
  local_dir='src/research', local_dir_use_symlinks=False)"
41
  ```
42
 
docs/reference/backend-services.md CHANGED
@@ -13,7 +13,10 @@ Module-by-module reference for `src/backend/api/`. The Django app is named `api`
13
  ## `api/` β€” Django app
14
 
15
  ### `apps.py`
16
- `ApiConfig.ready()` runs once at boot. Skips initialization in the outer `runserver` reloader process (avoids double-loading models in dev) and calls `ModelRegistry.initialize()` in the inner one.
 
 
 
17
 
18
  ### `urls.py`
19
  Maps every endpoint listed in [reference/api.md](api.md) to the matching view class.
@@ -103,6 +106,11 @@ Checkpoint loading helpers live in the same module:
103
  - `_adapt_shape_mismatches`, `_adapt_mlp_bn_keys`, `_adapt_kbgat_state_dict` β€” torch-geometric 2.0.x β†’ 2.3.x weight-format compatibility shims.
104
  - `_free_heavy_arrays` β€” discards memory-intensive Loader fields after init.
105
 
 
 
 
 
 
106
  ### `coins_inference.py`
107
  `coins_predict_inner(experiment, dataset_id, algorithm, query_structure_id, anchors, variables, relations_map, top_k)` β€” runs a single COINs prediction. Validates the query, builds the embedding query, scores candidate tails, returns the top-k with cleaned names and the community-rank info.
108
 
@@ -113,6 +121,7 @@ The MultiProxAn / DiGress sampling loop.
113
  - `run_multiprox_init(model, num_nodes, n, m, t, t_prime, gibbs_chain_freq, dataset_id)` β€” initial denoise to step `t_prime`. Returns the partial state for a `/continue` follow-up.
114
  - `run_multiprox_step(model, state, dataset_id)` β€” one Gibbs round.
115
  - `encode_state_blob` / `decode_state_blob` β€” base64 round-trip for the [continuation token](../glossary.md#continuation-token--state-blob).
 
116
 
117
  ### `kg_anomaly_inference.py`
118
  The KG-subgraph correction loop. Mirrors `graphgen_inference.py` but operates on knowledge-graph subgraphs and computes the KG log-likelihood metric per frame using the frozen COINs link ranker.
 
13
  ## `api/` β€” Django app
14
 
15
  ### `apps.py`
16
+ `ApiConfig.ready()` runs once at boot. Two skip-checks before calling `ModelRegistry.initialize()`:
17
+
18
+ - `sys.argv[1]` against `_SKIP_REGISTRY_INIT` (`collectstatic`, `migrate`, `makemigrations`, `check`, `shell`, `showmigrations`, `diffsettings`, `test`, `compilemessages`, `makemessages`). Stops `python manage.py collectstatic --noinput` from triggering a multi-GB checkpoint download into a throwaway image layer.
19
+ - The outer `runserver` reloader process (`RUN_MAIN != "true"`). Stops dev mode from doing the heavy boot twice.
20
 
21
  ### `urls.py`
22
  Maps every endpoint listed in [reference/api.md](api.md) to the matching view class.
 
106
  - `_adapt_shape_mismatches`, `_adapt_mlp_bn_keys`, `_adapt_kbgat_state_dict` β€” torch-geometric 2.0.x β†’ 2.3.x weight-format compatibility shims.
107
  - `_free_heavy_arrays` β€” discards memory-intensive Loader fields after init.
108
 
109
+ `_load_coins_experiment` wraps each `experiment.prepare()` call in two monkey-patches (restored in a `finally`) β€” see [explanation/inference-lifecycle.md](../explanation/inference-lifecycle.md#monkey-patches-around-experimentprepare) for the rationale:
110
+
111
+ - `Module.share_memory` β†’ no-op (avoids `Bus error` from PyTorch shared-memory paths under tight `/dev/shm`).
112
+ - `torch.load` β†’ TransE-init dim expansion (repeats `transe_model.tar` weights along the embedding axis when YAML's `embedding_dim` is an integer multiple of the init's dim, so KBGAT's `weight.data = init` doesn't clobber the model's declared dim).
113
+
114
  ### `coins_inference.py`
115
  `coins_predict_inner(experiment, dataset_id, algorithm, query_structure_id, anchors, variables, relations_map, top_k)` β€” runs a single COINs prediction. Validates the query, builds the embedding query, scores candidate tails, returns the top-k with cleaned names and the community-rank info.
116
 
 
121
  - `run_multiprox_init(model, num_nodes, n, m, t, t_prime, gibbs_chain_freq, dataset_id)` β€” initial denoise to step `t_prime`. Returns the partial state for a `/continue` follow-up.
122
  - `run_multiprox_step(model, state, dataset_id)` β€” one Gibbs round.
123
  - `encode_state_blob` / `decode_state_blob` β€” base64 round-trip for the [continuation token](../glossary.md#continuation-token--state-blob).
124
+ - `_collapse_final` symmetrises `E` (`E = (E + E.T) / 2`) before calling `model.sample_discrete_graph_given_z0`. The model has a strict symmetry assert that's tripped by ULP-level drift from the MultiProx aggregation on some BLAS stacks. See the [MultiProx symmetry safeguard](../explanation/inference-lifecycle.md#multiprox-symmetry-safeguard) note.
125
 
126
  ### `kg_anomaly_inference.py`
127
  The KG-subgraph correction loop. Mirrors `graphgen_inference.py` but operates on knowledge-graph subgraphs and computes the KG log-likelihood metric per frame using the frozen COINs link ranker.
src/backend/README.md CHANGED
@@ -28,7 +28,7 @@ This README covers the practical surface: running the backend, where things live
28
  ```
29
 
30
  3. **Model checkpoints** β€” downloaded automatically from the Hugging Face Hub model repo
31
- `bani57/checkpoints` on first boot. The remote layout mirrors the on-disk one, so
32
  `huggingface_hub.snapshot_download(local_dir=CHECKPOINTS_ROOT)` drops files directly
33
  into the expected paths:
34
  - `src/research/COINs-KGGeneration/graph_completion/checkpoints/` (COINs: `{dataset}_{algorithm}.tar`)
@@ -67,10 +67,11 @@ The API is served at `http://localhost:8000/api/v1/`.
67
  | `DJANGO_ALLOWED_HOSTS` | `localhost,127.0.0.1` | Comma-separated allowed hosts. |
68
  | `CORS_ALLOWED_ORIGINS` | `https://bani57-website.hf.space` | Comma-separated allowed CORS origins. |
69
  | `TORCH_DEVICE` | Auto (`cuda:0` if available, else `cpu`) | PyTorch device for model inference. |
70
- | `RESEARCH_ROOT` | `<repo>/src/research` | Where the research-code modules live. |
71
- | `CHECKPOINTS_ROOT` | Same as `RESEARCH_ROOT` | Where `huggingface_hub` deposits weights. In the container this is `/app/checkpoints` on a writable volume. |
72
- | `HF_CHECKPOINTS_REPO` | `bani57/checkpoints` | HF Hub model repo holding all weights. |
73
- | `HF_TOKEN` | unset | Only needed if the checkpoint repo is private. |
 
74
  | `SPA_DIST_DIR` | `<backend>/dist` | Folder containing `index.html` from `npm run build`. WhiteNoise serves assets from here. |
75
 
76
  ## Startup Sequence
@@ -79,7 +80,7 @@ In the deployment container the entrypoint script pre-warms the checkpoint downl
79
  from the Hugging Face Hub *before* gunicorn starts, so workers never block on the
80
  network. Then on Django boot (`ApiConfig.ready()`), the `ModelRegistry` initializes:
81
 
82
- 1. **Verify / download checkpoints** from `bani57/checkpoints` on HF Hub if any expected
83
  subdir is missing. Idempotent β€” a no-op when the entrypoint already populated the tree
84
  or when running locally with weights on disk.
85
  2. **Scan checkpoint directories** to detect available models per method
@@ -91,7 +92,7 @@ All model weights (COINs inference, graph generation, KG anomaly) are loaded laz
91
  ## Deployment
92
 
93
  The site is packaged as a single Docker image and deployed to a Hugging Face Space
94
- (`bani57/website` -> <https://bani57-website.hf.space>). The image:
95
 
96
  - builds the Vue SPA with `npm run build` in a Node 20 stage,
97
  - assembles a `mambaorg/micromamba` runtime mirroring the local `website_c` env from
@@ -99,7 +100,7 @@ The site is packaged as a single Docker image and deployed to a Hugging Face Spa
99
  - copies the SPA `dist/` next to Django so WhiteNoise serves it on the same origin as
100
  `/api/v1/`,
101
  - runs `entrypoint.sh`, which `snapshot_download`s checkpoints from
102
- `bani57/checkpoints` on HF Hub into `/app/checkpoints` and execs `gunicorn` on `0.0.0.0:7860`.
103
 
104
  Local reproduction:
105
  ```bash
@@ -109,7 +110,7 @@ docker compose up --build
109
 
110
  Push to the Space (one-time remote setup):
111
  ```bash
112
- git remote add hf https://huggingface.co/spaces/bani57/website
113
  git push hf master:main
114
  ```
115
 
 
28
  ```
29
 
30
  3. **Model checkpoints** β€” downloaded automatically from the Hugging Face Hub model repo
31
+ `Bani57/checkpoints` on first boot. The remote layout mirrors the on-disk one, so
32
  `huggingface_hub.snapshot_download(local_dir=CHECKPOINTS_ROOT)` drops files directly
33
  into the expected paths:
34
  - `src/research/COINs-KGGeneration/graph_completion/checkpoints/` (COINs: `{dataset}_{algorithm}.tar`)
 
67
  | `DJANGO_ALLOWED_HOSTS` | `localhost,127.0.0.1` | Comma-separated allowed hosts. |
68
  | `CORS_ALLOWED_ORIGINS` | `https://bani57-website.hf.space` | Comma-separated allowed CORS origins. |
69
  | `TORCH_DEVICE` | Auto (`cuda:0` if available, else `cpu`) | PyTorch device for model inference. |
70
+ | `RESEARCH_ROOT` | `<repo>/src/research` (dev), `/app/research` (image) | Where the research-code modules live. |
71
+ | `CHECKPOINTS_ROOT` | Same as `RESEARCH_ROOT` | Where `huggingface_hub` deposits weights. Override to e.g. `/data/checkpoints` on a paid HF Space with persistent storage. |
72
+ | `HF_CHECKPOINTS_REPO` | `Bani57/checkpoints` | HF Hub model repo holding all weights. |
73
+ | `HF_TOKEN` | unset | Recommended. Read-scope token lifts anonymous rate limits and roughly triples cold-start download throughput. Required if the repo is private. Empty values are unset by `entrypoint.sh` to avoid a malformed `Bearer ` header. |
74
+ | `HF_HUB_ENABLE_HF_TRANSFER` | `1` (image), unset (dev) | Enables the Rust-accelerated `hf_transfer` backend for `snapshot_download`. |
75
  | `SPA_DIST_DIR` | `<backend>/dist` | Folder containing `index.html` from `npm run build`. WhiteNoise serves assets from here. |
76
 
77
  ## Startup Sequence
 
80
  from the Hugging Face Hub *before* gunicorn starts, so workers never block on the
81
  network. Then on Django boot (`ApiConfig.ready()`), the `ModelRegistry` initializes:
82
 
83
+ 1. **Verify / download checkpoints** from `Bani57/checkpoints` on HF Hub if any expected
84
  subdir is missing. Idempotent β€” a no-op when the entrypoint already populated the tree
85
  or when running locally with weights on disk.
86
  2. **Scan checkpoint directories** to detect available models per method
 
92
  ## Deployment
93
 
94
  The site is packaged as a single Docker image and deployed to a Hugging Face Space
95
+ (`Bani57/website` -> <https://bani57-website.hf.space>). The image:
96
 
97
  - builds the Vue SPA with `npm run build` in a Node 20 stage,
98
  - assembles a `mambaorg/micromamba` runtime mirroring the local `website_c` env from
 
100
  - copies the SPA `dist/` next to Django so WhiteNoise serves it on the same origin as
101
  `/api/v1/`,
102
  - runs `entrypoint.sh`, which `snapshot_download`s checkpoints from
103
+ `Bani57/checkpoints` on HF Hub into `/app/checkpoints` and execs `gunicorn` on `0.0.0.0:7860`.
104
 
105
  Local reproduction:
106
  ```bash
 
110
 
111
  Push to the Space (one-time remote setup):
112
  ```bash
113
+ git remote add hf https://huggingface.co/spaces/Bani57/website
114
  git push hf master:main
115
  ```
116