Spaces:

Bani57
/

website

Running

App Files Files Community

website / docs /reference /backend-services.md

Andrej Janchevski

docs(deploy): refresh for the post-launch deployment iteration

5ed6f37 8 days ago

preview code

raw

history blame contribute delete

9.6 kB

	# Backend services

	Module-by-module reference for `src/backend/api/`. The Django app is named `api`; the project (`research_api`) is the WSGI entry point. Consult [explanation/architecture.md](../explanation/architecture.md) for how these modules fit together.

	## `research_api/` — Django project

	\| File \| Role \|
	\|---\|---\|
	\| `settings.py` \| All configuration, env-var driven. Adds `src/research/*` to `sys.path` so the research code imports cleanly. Configures WhiteNoise, CORS, DRF, security middleware, paths under `RESEARCH_ROOT` / `CHECKPOINTS_ROOT`. \|
	\| `urls.py` \| Root URL config. Mounts `/api/v1/` and a non-API SPA catch-all that returns `dist/index.html`. \|
	\| `wsgi.py` \| Standard `get_wsgi_application()` entry point. Used by gunicorn. \|

	## `api/` — Django app

	### `apps.py`
	`ApiConfig.ready()` runs once at boot. Two skip-checks before calling `ModelRegistry.initialize()`:

	- `sys.argv[1]` against `_SKIP_REGISTRY_INIT` (`collectstatic`, `migrate`, `makemigrations`, `check`, `shell`, `showmigrations`, `diffsettings`, `test`, `compilemessages`, `makemessages`). Stops `python manage.py collectstatic --noinput` from triggering a multi-GB checkpoint download into a throwaway image layer.
	- The outer `runserver` reloader process (`RUN_MAIN != "true"`). Stops dev mode from doing the heavy boot twice.

	### `urls.py`
	Maps every endpoint listed in [reference/api.md](api.md) to the matching view class.

	### `exceptions.py`
	The error envelope. All exceptions raised inside views inherit from `ApiError`, which has a `code` and a `details` dict. `api_exception_handler` wraps every error in `{"error": {"code": ..., "message": ..., "details": ...}}`. Subclasses:

	\| Class \| HTTP \| `code` \|
	\|---\|---\|---\|
	\| `NotFoundError` \| 404 \| `NOT_FOUND` \|
	\| `InvalidRequestError` \| 400 \| `INVALID_REQUEST` \|
	\| `InferenceError` \| 422 \| `INFERENCE_ERROR` \|
	\| `InferenceBusy` \| 429 \| `INFERENCE_BUSY` \|
	\| `ModelUnavailable` \| 503 \| `MODEL_UNAVAILABLE` \|

	### `pagination.py`
	Tiny helper for the entity / relation list endpoints (1-indexed `page`, default `page_size=50`).

	### `renderers.py`
	`EventStreamRenderer` declares `text/event-stream` so DRF content negotiation accepts SSE clients. Streaming views return a `StreamingHttpResponse` directly, so the renderer's `render()` is never invoked — this class exists only to satisfy DRF's accept-header machinery.

	### `utils.py`
	String cleanup helpers. `clean_entity_name` and `clean_relation_name` strip dataset-specific prefixes (`/m/...` for Freebase, namespace prefixes for NELL, etc.) so the UI shows readable labels.

	## `api/views/` — endpoint handlers

	\| File \| Endpoints \| Notes \|
	\|---\|---\|---\|
	\| `health.py` \| `/`, `/health`, `/methods`, `/debug/force-unlock` \| Trivial views; the only one that touches the registry is `HealthView`. \|
	\| `coins.py` \| `/coins/*` \| Discovery views read directly from the registry's pre-built dictionaries. `CoinsPredictView` calls `ModelRegistry.coins_predict`, which acquires the inference lock. \|
	\| `graph_generation.py` \| `/graph-generation/*` \| `GraphGenGenerateView` and `GraphGenContinueView` return `StreamingHttpResponse(generator)` where the generator yields SSE-formatted bytes. The lock is acquired before the generator starts and released in its `finally`. \|
	\| `kg_anomaly.py` \| `/kg-anomaly/*` \| Same shape as graph generation. The `correct` task computes a [KG log-likelihood](../glossary.md#sampling-mode) per chain frame. \|

	Every view either:
	- Returns a `Response` (DRF JSON), or
	- Returns a `StreamingHttpResponse` whose generator yields `event: ...\ndata: ...\n\n` strings encoded as bytes.

	## `api/services/` — business logic

	The heart of the backend. These modules import the research code under `src/research/` and host all PyTorch inference.

	### `constants.py`
	Domain metadata used by the discovery endpoints:

	- `METHODS` — the three research methods with thesis sections.
	- `COINS_DATASET_META` — display names, descriptions, raw-data directory mapping.
	- `COINS_MODELS` — algorithm definitions and supported `query_structure` lists.
	- `QUERY_STRUCTURES` — frontend rendering templates (anchor / variable / relation slot positions, edge connectivity).
	- `COINS_CONFIG_SUFFIX` — yaml-config naming convention for each algorithm.
	- `QUERY_TREE_MAPPINGS` — research-code structure strings (e.g. `1p2i`) and slot mappings consumed by `Query.instantiate`.

	### `registry.py`
	The single most important module. Owns `ModelRegistry`, the in-memory cache of everything the API needs at request time.

	Public surface (used by views):

	\| Method \| Returns \|
	\|---\|---\|
	\| `ModelRegistry.get()` \| The singleton (raises if not initialized). \|
	\| `get_loader(dataset_id)` \| The lightweight COINs Loader for discovery endpoints. \|
	\| `get_entity_count`, `get_relation_count` \| Cardinalities for `/coins/datasets`. \|
	\| `get_inverted_name_maps(dataset_id)` \| `(inv_node_names, inv_node_types, inv_relation_names)` Series. \|
	\| `search_entities`, `search_relations` \| Substring search over labels, with pagination. \|
	\| `sample_triples` \| Random training triples. Optional `seed` for determinism. \|
	\| `sample_query` \| Calls `Query.instantiate` to walk the graph and produce a structurally valid query. \|
	\| `coins_predict(...)` \| Acquires the lock, runs prediction, releases. \|
	\| `graphgen_generate_stream(...)` \| Returns a generator (lock is held by the generator). \|
	\| `graphgen_continue_stream(...)` \| Decodes a state blob, advances one Gibbs round. \|
	\| `kg_anomaly_correct_stream(...)` \| Same shape as graphgen. \|
	\| `kg_anomaly_continue_stream(...)` \| Same shape. \|
	\| `force_release_inference_lock()` \| Called by the debug endpoint. \|
	\| `is_coins_loaded`, `is_graphgen_loaded`, `is_kg_anomaly_loaded` \| Health-endpoint signals. \|

	Internal state:

	- `coins_checkpoints_available`, `graphgen_checkpoints_available`, `kg_anomaly_checkpoints_available` — populated by `_scan_checkpoints`.
	- `loaders` — `{dataset_id: lightweight Loader}` for discovery endpoints.
	- `_coins_experiments`, `_graphgen_models`, `_kg_anomaly_models` — lazy caches keyed by request parameters.
	- `_coins_loaders` — full Loaders shared across algorithms with the same `(dataset, seed, leiden_resolution)`.
	- `_inference_lock` — the global single-flight gate.

	Initialization is a four-step sequence described in [explanation/inference-lifecycle.md](../explanation/inference-lifecycle.md).

	Checkpoint loading helpers live in the same module:

	- `_safe_load_lightning_checkpoint` — loads a Lightning checkpoint without triggering DDP / `deepcopy` crashes.
	- `_adapt_shape_mismatches`, `_adapt_mlp_bn_keys`, `_adapt_kbgat_state_dict` — torch-geometric 2.0.x → 2.3.x weight-format compatibility shims.
	- `_free_heavy_arrays` — discards memory-intensive Loader fields after init.

	`_load_coins_experiment` wraps each `experiment.prepare()` call in two monkey-patches (restored in a `finally`) — see [explanation/inference-lifecycle.md](../explanation/inference-lifecycle.md#monkey-patches-around-experimentprepare) for the rationale:

	- `Module.share_memory` → no-op (avoids `Bus error` from PyTorch shared-memory paths under tight `/dev/shm`).
	- `torch.load` → TransE-init dim expansion (repeats `transe_model.tar` weights along the embedding axis when YAML's `embedding_dim` is an integer multiple of the init's dim, so KBGAT's `weight.data = init` doesn't clobber the model's declared dim).

	### `coins_inference.py`
	`coins_predict_inner(experiment, dataset_id, algorithm, query_structure_id, anchors, variables, relations_map, top_k)` — runs a single COINs prediction. Validates the query, builds the embedding query, scores candidate tails, returns the top-k with cleaned names and the community-rank info.

	### `graphgen_inference.py`
	The MultiProxAn / DiGress sampling loop.

	- `run_standard_generation(model, num_nodes, diffusion_steps, chain_frames, dataset_id)` — single denoising chain. Yields `progress`, `preview`, `result` events.
	- `run_multiprox_init(model, num_nodes, n, m, t, t_prime, gibbs_chain_freq, dataset_id)` — initial denoise to step `t_prime`. Returns the partial state for a `/continue` follow-up.
	- `run_multiprox_step(model, state, dataset_id)` — one Gibbs round.
	- `encode_state_blob` / `decode_state_blob` — base64 round-trip for the [continuation token](../glossary.md#continuation-token--state-blob).
	- `_collapse_final` symmetrises `E` (`E = (E + E.T) / 2`) before calling `model.sample_discrete_graph_given_z0`. The model has a strict symmetry assert that's tripped by ULP-level drift from the MultiProx aggregation on some BLAS stacks. See the [MultiProx symmetry safeguard](../explanation/inference-lifecycle.md#multiprox-symmetry-safeguard) note.

	### `kg_anomaly_inference.py`
	The KG-subgraph correction loop. Mirrors `graphgen_inference.py` but operates on knowledge-graph subgraphs and computes the KG log-likelihood metric per frame using the frozen COINs link ranker.

	- `build_kg_tensors(subgraph, loader, model)` — converts the request payload into the model's input tensors.
	- `run_standard_correction(...)` and `run_multiprox_correction_init(...)` / `run_multiprox_correction_step(...)` — analogous to graphgen.

	### `kg_likelihood.py`
	Helper that scores edges with the COINs link ranker and computes the mean log-sigmoid metric the SSE protocol surfaces.

	## See also

	- [explanation/inference-lifecycle.md](../explanation/inference-lifecycle.md) — boot, lazy load, lock.
	- [reference/api.md](api.md) — endpoint contracts.
	- [reference/sse-protocol.md](sse-protocol.md) — wire format the streaming services produce.