Inference lifecycle
How models get from disk to RAM to a response. The site optimizes for slow cold starts but fast warm requests: heavy work happens once at boot, individual model weights load on first inference, and a single inference lock prevents concurrent expensive jobs from sharing a 2 vCPU box.
Boot sequence
api/apps.ApiConfig.ready() runs once when Django imports the app β typically inside the gunicorn master under --preload, before any worker forks. It calls ModelRegistry.initialize(), which executes four steps in order:
flowchart TB
Start([AppConfig.ready]) --> Mgmt{argv in<br/>SKIP_REGISTRY_INIT?}
Mgmt -->|yes: collectstatic,<br/>migrate, check, β¦| Done([return])
Mgmt -->|no| Skip{outer auto-reloader?}
Skip -->|yes| Done
Skip -->|no| Init[ModelRegistry.initialize]
Init --> DL[_download_checkpoints<br/>HF Hub snapshot_download]
DL --> Scan[_scan_checkpoints<br/>list available files]
Scan --> Loaders[_load_all_loaders<br/>3 lightweight Loaders]
Loaders --> Sub[_generate_sample_subgraphs<br/>per-dataset DFS partitions]
Sub --> Ready([gunicorn worker forks via copy-on-write])
The SKIP_REGISTRY_INIT set covers collectstatic, migrate, makemigrations, check, shell, test, etc. Without that guard, python manage.py collectstatic --noinput at image build time would trigger the full ~6 GB checkpoint download into a throwaway layer.
1. Download checkpoints (idempotent)
_download_checkpoints checks every expected subdir under CHECKPOINTS_ROOT; if any is empty it calls huggingface_hub.snapshot_download(repo_id="Bani57/checkpoints") to pull the missing files. The repo's directory layout mirrors the on-disk one, so files land in their final location and the scan code doesn't need a redistribution step.
In production the entrypoint.sh script also calls snapshot_download before gunicorn starts. That makes Django's call a no-op on a normal cold start; it only does real work when run outside the container or when the entrypoint was bypassed.
2. Scan checkpoint directories
Three globs:
COINs-KGGeneration/graph_completion/checkpoints/*.tarβ COINs algorithms per dataset.MultiProxAn/checkpoints/*.ckptβ discrete or continuous (_csuffix) per dataset.COINs-KGGeneration/graph_generation/checkpoints/*.ckptβ KG anomalygenerateorcorrect(_correctsuffix) per dataset.
The result populates coins_checkpoints_available, graphgen_checkpoints_available, and kg_anomaly_checkpoints_available on the registry. Discovery endpoints (/coins/datasets, /graph-generation/datasets, /kg-anomaly/datasets) read these dictionaries.
3. Load lightweight Loaders
One Loader per COINs dataset (freebase, wordnet, nell). Each Loader wraps the dataset, name maps (entity-id β string, relation-id β string), and graph indices.
After construction, the registry frees the heavy arrays (node_neighbours, community_neighbours, adjacency dicts, machines vector) β they can take ~275 MB each per dataset and aren't needed for the discovery endpoints. They get rebuilt on demand when a full COINs experiment is loaded for inference.
4. Pre-compute sample subgraphs
For each COINs dataset, _build_sample_subgraphs walks the DFS context-subgraph partitioner and stores the first ~40 well-formed subgraphs (alternating bipartite and unipartite). The KG-anomaly demo serves these from /kg-anomaly/datasets/{id}/sample-subgraphs so visitors can click an example instead of constructing a graph by hand.
Lazy weight loading
The Loaders, samples and metadata are eager. The actual inference weights are not. Each method has a per-instance cache:
_coins_experiments[(dataset_id, algorithm)]β full COINsExperiment(embedder + link ranker), built lazily by_load_coins_experiment._graphgen_models[(dataset_id, model_type)]β DiGress / lifted diffusion model, built lazily by_load_graphgen_model._kg_anomaly_models[(dataset_id, task)]β KG anomaly model, built lazily by_load_kg_anomaly_model. Reuses the matching COINs experiment for itskg_experimentfield.
First request for a (dataset, algorithm) or (dataset, model_type) combination loads the checkpoint into RAM and then keeps it. Subsequent requests skip the load.
The COINs registry also reuses Loaders across algorithms via _coins_loaders[(dataset_id, seed, leiden_resolution)]: all four transe / distmult / complex / rotate checkpoints on a dataset share the same seed and Leiden resolution, so they share one Loader and don't reload the graph four times.
Monkey-patches around experiment.prepare()
Two patches wrap each call to experiment.prepare() in _load_coins_experiment, restored in a finally:
Module.share_memoryβ no-op. The research code'sprepare()callsembedder.share_memory()to share weights across multi-process training workers. Inference is single-process; the call is gratuitous, and on Linux containers with a small/dev/shm(Docker default 64 MB, free HF Spaces tmpfs similar) it raises aBus errormid-prepare. The no-op makesprepare()return cleanly.torch.loadβ TransE-init dim expansion.prepare()loadstranse_model.tarto seed the embedder'sentity_embeddings_initialbuffers. The KBGAT embedder's__init__then assignsweight.data = init, which silently re-shapes the YAML-declared embedding layer to the init's shape. For wordnet KBGAT this is fatal: the trained checkpoint was 200d but the wordnet TransE init is 100d, so the embedder ends up at 100d and the trainedload_state_dictblows up on the dim mismatch. The patch detects TransE state dicts being loaded and repeats them along the embedding axis (e.g. 100d β 200d viacat([init, init])) when the YAML'sembedding_dimis an integer multiple of the init's dim β same trick_adapt_kbgat_state_dictalready uses for the GATConv multi-head expansion.
Concurrency and the inference lock
ModelRegistry._inference_lock is a single threading.Lock. Every endpoint that runs PyTorch inference acquires it non-blocking; if it can't, it raises InferenceBusy (HTTP 429). The lock is released in a finally after the response is fully streamed:
- COINs
/coins/predictβ synchronous; releases on return. /graph-generation/generate,/continueβ generator functions release the lock in theirfinallyblock./kg-anomaly/correct,/continueβ same pattern.
The _inference_lock_owner string records who currently holds the lock. The health endpoint /api/v1/health reports it, so a visitor can see what's running.
If a client disconnects mid-stream and Python's generator cleanup doesn't fire (rare but possible behind some proxies), the lock can stick. /api/v1/debug/force-unlock releases it, but only with DEBUG=True β not exposed in production.
Memory budget
The free-tier HF Space gives 16 GB RAM. Approximate usage:
- Three lightweight COINs Loaders after
_free_heavy_arrays: ~150 MB total. - Sample subgraph caches: <10 MB.
- One COINs experiment fully loaded (embedder + link ranker for one algorithm/dataset): a few hundred MB.
- One MultiProxAn / KG anomaly diffusion model: hundreds of MB to ~1 GB depending on dataset.
Worst case (everything ever requested in one Space lifetime): well under 16 GB. The current code does not evict β caches grow monotonically until the container restarts.
Local-dev memory floor
Loading all three Loaders + computing graph metrics for NELL peaks at ~5β6 GB transient RAM unless the bundled Loader caches (results/<dataset>/*.npz and *.gz, ~10 MB total in the image) are present, which let the boot read precomputed arrays instead of recomputing them. WSL2 should be configured for at least 12 GB to give Docker enough headroom; the recommended .wslconfig is:
[wsl2]
memory=12GB
processors=4
swap=4GB
docker-compose.yml also sets shm_size: "2gb" to avoid Bus error from PyTorch's shared-memory paths under Docker's 64 MB /dev/shm default.
MultiProx symmetry safeguard
graphgen_inference._collapse_final symmetrises the edge tensor before calling model.sample_discrete_graph_given_z0. The model has a strict assert (pred_E == pred_E.T).all(); the MultiProx Gibbs aggregation (mean / median over multiple chains) can introduce ULP-level asymmetry that survives into pred_E and trips the assert on some BLAS / vectorization stacks (notably the Linux +cu118 torch wheel inside the deployment container, while the same code runs fine on the Windows wheel in dev). E = (E + E.T) / 2 is a no-op on already-symmetric input and a one-line invariant fix when it isn't.
Cross-references
- explanation/architecture.md β where this lifecycle sits in the request flow.
- reference/backend-services.md β module-by-module reference for
ModelRegistryand the inference helpers. - reference/sse-protocol.md β the streaming envelope each diffusion run produces.