# Inference lifecycle

How models get from disk to RAM to a response. The site optimizes for slow cold starts but fast warm requests: heavy work happens once at boot, individual model weights load on first inference, and a single [inference lock](../glossary.md#inference-lock) prevents concurrent expensive jobs from sharing a 2 vCPU box.

## Boot sequence

`api/apps.ApiConfig.ready()` runs once when Django imports the app — typically inside the gunicorn master under `--preload`, before any worker forks. It calls `ModelRegistry.initialize()`, which executes four steps in order:

```mermaid
flowchart TB
    Start([AppConfig.ready]) --> Mgmt{argv in<br/>SKIP_REGISTRY_INIT?}
    Mgmt -->|yes: collectstatic,<br/>migrate, check, …| Done([return])
    Mgmt -->|no| Skip{outer auto-reloader?}
    Skip -->|yes| Done
    Skip -->|no| Init[ModelRegistry.initialize]
    Init --> DL[_download_checkpoints<br/>HF Hub snapshot_download]
    DL --> Scan[_scan_checkpoints<br/>list available files]
    Scan --> Loaders[_load_all_loaders<br/>3 lightweight Loaders]
    Loaders --> Sub[_generate_sample_subgraphs<br/>per-dataset DFS partitions]
    Sub --> Ready([gunicorn worker forks via copy-on-write])
```

The `SKIP_REGISTRY_INIT` set covers `collectstatic`, `migrate`, `makemigrations`, `check`, `shell`, `test`, etc. Without that guard, `python manage.py collectstatic --noinput` at image build time would trigger the full ~6 GB checkpoint download into a throwaway layer.

### 1. Download checkpoints (idempotent)
`_download_checkpoints` checks every expected subdir under `CHECKPOINTS_ROOT`; if any is empty it calls `huggingface_hub.snapshot_download(repo_id="Bani57/checkpoints")` to pull the missing files. The repo's directory layout mirrors the on-disk one, so files land in their final location and the scan code doesn't need a redistribution step.

In production the `entrypoint.sh` script also calls `snapshot_download` *before* gunicorn starts. That makes Django's call a no-op on a normal cold start; it only does real work when run outside the container or when the entrypoint was bypassed.

### 2. Scan checkpoint directories
Three globs:

- `COINs-KGGeneration/graph_completion/checkpoints/*.tar` → COINs algorithms per dataset.
- `MultiProxAn/checkpoints/*.ckpt` → discrete or continuous (`_c` suffix) per dataset.
- `COINs-KGGeneration/graph_generation/checkpoints/*.ckpt` → KG anomaly `generate` or `correct` (`_correct` suffix) per dataset.

The result populates `coins_checkpoints_available`, `graphgen_checkpoints_available`, and `kg_anomaly_checkpoints_available` on the registry. Discovery endpoints (`/coins/datasets`, `/graph-generation/datasets`, `/kg-anomaly/datasets`) read these dictionaries.

### 3. Load lightweight Loaders
One `Loader` per [COINs](../glossary.md#coins) dataset (freebase, wordnet, nell). Each Loader wraps the dataset, name maps (entity-id → string, relation-id → string), and graph indices.

After construction, the registry frees the heavy arrays (`node_neighbours`, `community_neighbours`, adjacency dicts, machines vector) — they can take ~275 MB each per dataset and aren't needed for the discovery endpoints. They get rebuilt on demand when a full COINs experiment is loaded for inference.

### 4. Pre-compute sample subgraphs
For each COINs dataset, `_build_sample_subgraphs` walks the DFS context-subgraph partitioner and stores the first ~40 well-formed subgraphs (alternating bipartite and unipartite). The KG-anomaly demo serves these from `/kg-anomaly/datasets/{id}/sample-subgraphs` so visitors can click an example instead of constructing a graph by hand.

## Lazy weight loading

The Loaders, samples and metadata are eager. The actual inference weights are not. Each method has a per-instance cache:

- `_coins_experiments[(dataset_id, algorithm)]` — full COINs `Experiment` (embedder + link ranker), built lazily by `_load_coins_experiment`.
- `_graphgen_models[(dataset_id, model_type)]` — DiGress / lifted diffusion model, built lazily by `_load_graphgen_model`.
- `_kg_anomaly_models[(dataset_id, task)]` — KG anomaly model, built lazily by `_load_kg_anomaly_model`. Reuses the matching COINs experiment for its `kg_experiment` field.

First request for a `(dataset, algorithm)` or `(dataset, model_type)` combination loads the checkpoint into RAM and then keeps it. Subsequent requests skip the load.

The COINs registry also reuses Loaders across algorithms via `_coins_loaders[(dataset_id, seed, leiden_resolution)]`: all four `transe / distmult / complex / rotate` checkpoints on a dataset share the same seed and Leiden resolution, so they share one Loader and don't reload the graph four times.

### Monkey-patches around `experiment.prepare()`

Two patches wrap each call to `experiment.prepare()` in `_load_coins_experiment`, restored in a `finally`:

- **`Module.share_memory` → no-op.** The research code's `prepare()` calls `embedder.share_memory()` to share weights across multi-process training workers. Inference is single-process; the call is gratuitous, and on Linux containers with a small `/dev/shm` (Docker default 64 MB, free HF Spaces tmpfs similar) it raises a `Bus error` mid-prepare. The no-op makes `prepare()` return cleanly.
- **`torch.load` → TransE-init dim expansion.** `prepare()` loads `transe_model.tar` to seed the embedder's `entity_embeddings_initial` buffers. The KBGAT embedder's `__init__` then assigns `weight.data = init`, which silently re-shapes the YAML-declared embedding layer to the init's shape. For wordnet KBGAT this is fatal: the trained checkpoint was 200d but the wordnet TransE init is 100d, so the embedder ends up at 100d and the trained `load_state_dict` blows up on the dim mismatch. The patch detects TransE state dicts being loaded and repeats them along the embedding axis (e.g. 100d → 200d via `cat([init, init])`) when the YAML's `embedding_dim` is an integer multiple of the init's dim — same trick `_adapt_kbgat_state_dict` already uses for the GATConv multi-head expansion.

## Concurrency and the inference lock

`ModelRegistry._inference_lock` is a single `threading.Lock`. Every endpoint that runs PyTorch inference acquires it non-blocking; if it can't, it raises `InferenceBusy` (HTTP 429). The lock is released in a `finally` after the response is fully streamed:

- COINs `/coins/predict` — synchronous; releases on return.
- `/graph-generation/generate`, `/continue` — generator functions release the lock in their `finally` block.
- `/kg-anomaly/correct`, `/continue` — same pattern.

The `_inference_lock_owner` string records who currently holds the lock. The health endpoint `/api/v1/health` reports it, so a visitor can see what's running.

If a client disconnects mid-stream and Python's generator cleanup doesn't fire (rare but possible behind some proxies), the lock can stick. `/api/v1/debug/force-unlock` releases it, but only with `DEBUG=True` — not exposed in production.

## Memory budget

The free-tier HF Space gives 16 GB RAM. Approximate usage:

- Three lightweight COINs Loaders after `_free_heavy_arrays`: ~150 MB total.
- Sample subgraph caches: <10 MB.
- One COINs experiment fully loaded (embedder + link ranker for one algorithm/dataset): a few hundred MB.
- One MultiProxAn / KG anomaly diffusion model: hundreds of MB to ~1 GB depending on dataset.

Worst case (everything ever requested in one Space lifetime): well under 16 GB. The current code does not evict — caches grow monotonically until the container restarts.

### Local-dev memory floor

Loading all three Loaders + computing graph metrics for NELL peaks at ~5–6 GB transient RAM unless the bundled Loader caches (`results/<dataset>/*.npz` and `*.gz`, ~10 MB total in the image) are present, which let the boot read precomputed arrays instead of recomputing them. WSL2 should be configured for at least 12 GB to give Docker enough headroom; the recommended `.wslconfig` is:

```ini
[wsl2]
memory=12GB
processors=4
swap=4GB
```

`docker-compose.yml` also sets `shm_size: "2gb"` to avoid `Bus error` from PyTorch's shared-memory paths under Docker's 64 MB `/dev/shm` default.

## MultiProx symmetry safeguard

`graphgen_inference._collapse_final` symmetrises the edge tensor before calling `model.sample_discrete_graph_given_z0`. The model has a strict `assert (pred_E == pred_E.T).all()`; the MultiProx Gibbs aggregation (mean / median over multiple chains) can introduce ULP-level asymmetry that survives into `pred_E` and trips the assert on some BLAS / vectorization stacks (notably the Linux `+cu118` torch wheel inside the deployment container, while the same code runs fine on the Windows wheel in dev). `E = (E + E.T) / 2` is a no-op on already-symmetric input and a one-line invariant fix when it isn't.

## Cross-references

- [explanation/architecture.md](architecture.md) — where this lifecycle sits in the request flow.
- [reference/backend-services.md](../reference/backend-services.md) — module-by-module reference for `ModelRegistry` and the inference helpers.
- [reference/sse-protocol.md](../reference/sse-protocol.md) — the streaming envelope each diffusion run produces.