Spaces:

Bani57
/

website

Sleeping

App Files Files Community

website / docs /explanation /inference-lifecycle.md

Andrej Janchevski

docs(deploy): refresh for the post-launch deployment iteration

5ed6f37 27 days ago

preview code

raw

history blame contribute delete

9.17 kB

Inference lifecycle

How models get from disk to RAM to a response. The site optimizes for slow cold starts but fast warm requests: heavy work happens once at boot, individual model weights load on first inference, and a single inference lock prevents concurrent expensive jobs from sharing a 2 vCPU box.

Boot sequence

api/apps.ApiConfig.ready() runs once when Django imports the app — typically inside the gunicorn master under --preload, before any worker forks. It calls ModelRegistry.initialize(), which executes four steps in order:

flowchart TB
    Start([AppConfig.ready]) --> Mgmt{argv in<br/>SKIP_REGISTRY_INIT?}
    Mgmt -->|yes: collectstatic,<br/>migrate, check, …| Done([return])
    Mgmt -->|no| Skip{outer auto-reloader?}
    Skip -->|yes| Done
    Skip -->|no| Init[ModelRegistry.initialize]
    Init --> DL[_download_checkpoints<br/>HF Hub snapshot_download]
    DL --> Scan[_scan_checkpoints<br/>list available files]
    Scan --> Loaders[_load_all_loaders<br/>3 lightweight Loaders]
    Loaders --> Sub[_generate_sample_subgraphs<br/>per-dataset DFS partitions]
    Sub --> Ready([gunicorn worker forks via copy-on-write])

The SKIP_REGISTRY_INIT set covers collectstatic, migrate, makemigrations, check, shell, test, etc. Without that guard, python manage.py collectstatic --noinput at image build time would trigger the full ~6 GB checkpoint download into a throwaway layer.

1. Download checkpoints (idempotent)

_download_checkpoints checks every expected subdir under CHECKPOINTS_ROOT; if any is empty it calls huggingface_hub.snapshot_download(repo_id="Bani57/checkpoints") to pull the missing files. The repo's directory layout mirrors the on-disk one, so files land in their final location and the scan code doesn't need a redistribution step.

In production the entrypoint.sh script also calls snapshot_download before gunicorn starts. That makes Django's call a no-op on a normal cold start; it only does real work when run outside the container or when the entrypoint was bypassed.

2. Scan checkpoint directories

Three globs:

COINs-KGGeneration/graph_completion/checkpoints/*.tar → COINs algorithms per dataset.
MultiProxAn/checkpoints/*.ckpt → discrete or continuous (_c suffix) per dataset.
COINs-KGGeneration/graph_generation/checkpoints/*.ckpt → KG anomaly generate or correct (_correct suffix) per dataset.

The result populates coins_checkpoints_available, graphgen_checkpoints_available, and kg_anomaly_checkpoints_available on the registry. Discovery endpoints (/coins/datasets, /graph-generation/datasets, /kg-anomaly/datasets) read these dictionaries.

3. Load lightweight Loaders

One Loader per COINs dataset (freebase, wordnet, nell). Each Loader wraps the dataset, name maps (entity-id → string, relation-id → string), and graph indices.

After construction, the registry frees the heavy arrays (node_neighbours, community_neighbours, adjacency dicts, machines vector) — they can take ~275 MB each per dataset and aren't needed for the discovery endpoints. They get rebuilt on demand when a full COINs experiment is loaded for inference.

4. Pre-compute sample subgraphs

For each COINs dataset, _build_sample_subgraphs walks the DFS context-subgraph partitioner and stores the first ~40 well-formed subgraphs (alternating bipartite and unipartite). The KG-anomaly demo serves these from /kg-anomaly/datasets/{id}/sample-subgraphs so visitors can click an example instead of constructing a graph by hand.

Lazy weight loading

The Loaders, samples and metadata are eager. The actual inference weights are not. Each method has a per-instance cache:

_coins_experiments[(dataset_id, algorithm)] — full COINs Experiment (embedder + link ranker), built lazily by _load_coins_experiment.
_graphgen_models[(dataset_id, model_type)] — DiGress / lifted diffusion model, built lazily by _load_graphgen_model.
_kg_anomaly_models[(dataset_id, task)] — KG anomaly model, built lazily by _load_kg_anomaly_model. Reuses the matching COINs experiment for its kg_experiment field.

First request for a (dataset, algorithm) or (dataset, model_type) combination loads the checkpoint into RAM and then keeps it. Subsequent requests skip the load.

The COINs registry also reuses Loaders across algorithms via _coins_loaders[(dataset_id, seed, leiden_resolution)]: all four transe / distmult / complex / rotate checkpoints on a dataset share the same seed and Leiden resolution, so they share one Loader and don't reload the graph four times.

Monkey-patches around `experiment.prepare()`

Two patches wrap each call to experiment.prepare() in _load_coins_experiment, restored in a finally:

Module.share_memory → no-op. The research code's prepare() calls embedder.share_memory() to share weights across multi-process training workers. Inference is single-process; the call is gratuitous, and on Linux containers with a small /dev/shm (Docker default 64 MB, free HF Spaces tmpfs similar) it raises a Bus error mid-prepare. The no-op makes prepare() return cleanly.
torch.load → TransE-init dim expansion. prepare() loads transe_model.tar to seed the embedder's entity_embeddings_initial buffers. The KBGAT embedder's __init__ then assigns weight.data = init, which silently re-shapes the YAML-declared embedding layer to the init's shape. For wordnet KBGAT this is fatal: the trained checkpoint was 200d but the wordnet TransE init is 100d, so the embedder ends up at 100d and the trained load_state_dict blows up on the dim mismatch. The patch detects TransE state dicts being loaded and repeats them along the embedding axis (e.g. 100d → 200d via cat([init, init])) when the YAML's embedding_dim is an integer multiple of the init's dim — same trick _adapt_kbgat_state_dict already uses for the GATConv multi-head expansion.

Concurrency and the inference lock

ModelRegistry._inference_lock is a single threading.Lock. Every endpoint that runs PyTorch inference acquires it non-blocking; if it can't, it raises InferenceBusy (HTTP 429). The lock is released in a finally after the response is fully streamed:

COINs /coins/predict — synchronous; releases on return.
/graph-generation/generate, /continue — generator functions release the lock in their finally block.
/kg-anomaly/correct, /continue — same pattern.

The _inference_lock_owner string records who currently holds the lock. The health endpoint /api/v1/health reports it, so a visitor can see what's running.

If a client disconnects mid-stream and Python's generator cleanup doesn't fire (rare but possible behind some proxies), the lock can stick. /api/v1/debug/force-unlock releases it, but only with DEBUG=True — not exposed in production.

Memory budget

The free-tier HF Space gives 16 GB RAM. Approximate usage:

Three lightweight COINs Loaders after _free_heavy_arrays: ~150 MB total.
Sample subgraph caches: <10 MB.
One COINs experiment fully loaded (embedder + link ranker for one algorithm/dataset): a few hundred MB.
One MultiProxAn / KG anomaly diffusion model: hundreds of MB to ~1 GB depending on dataset.

Worst case (everything ever requested in one Space lifetime): well under 16 GB. The current code does not evict — caches grow monotonically until the container restarts.

Local-dev memory floor

Loading all three Loaders + computing graph metrics for NELL peaks at ~5–6 GB transient RAM unless the bundled Loader caches (results/<dataset>/*.npz and *.gz, ~10 MB total in the image) are present, which let the boot read precomputed arrays instead of recomputing them. WSL2 should be configured for at least 12 GB to give Docker enough headroom; the recommended .wslconfig is:

[wsl2]
memory=12GB
processors=4
swap=4GB

docker-compose.yml also sets shm_size: "2gb" to avoid Bus error from PyTorch's shared-memory paths under Docker's 64 MB /dev/shm default.

MultiProx symmetry safeguard

graphgen_inference._collapse_final symmetrises the edge tensor before calling model.sample_discrete_graph_given_z0. The model has a strict assert (pred_E == pred_E.T).all(); the MultiProx Gibbs aggregation (mean / median over multiple chains) can introduce ULP-level asymmetry that survives into pred_E and trips the assert on some BLAS / vectorization stacks (notably the Linux +cu118 torch wheel inside the deployment container, while the same code runs fine on the Windows wheel in dev). E = (E + E.T) / 2 is a no-op on already-symmetric input and a one-line invariant fix when it isn't.

Cross-references

explanation/architecture.md — where this lifecycle sits in the request flow.
reference/backend-services.md — module-by-module reference for ModelRegistry and the inference helpers.
reference/sse-protocol.md — the streaming envelope each diffusion run produces.