website / docs /glossary.md
Andrej Janchevski
docs(deploy): refresh for the post-launch deployment iteration
5ed6f37
# Glossary
Domain terms used throughout the codebase and documentation. Other documents reference this file rather than redefining terms in place.
## Knowledge graph reasoning
### Knowledge graph (KG)
A directed multigraph of `(head, relation, tail)` triples where vertices are entities and labelled edges are relations. The three KGs exposed by the site are FB15k-237 (Freebase subset), WN18RR (WordNet subset) and NELL-995.
### Link prediction
Given two of `head`, `relation`, `tail`, score and rank candidates for the missing slot. The 1-projection (`1p`) query structure is link prediction.
### Query structure
A multi-hop / intersection / projection query template over a KG. Supported structures: `1p`, `2p`, `3p` (single chain projections), `2i`, `3i` (intersection of two/three relations), `ip` (intersection then projection), `pi` (projection then intersection). Templates determine which slots β€” anchor entities (`a`, `a1`, `a2`, …), variable entities (`v1`, `v2`) and relations (`r1`, `r2`, `r3`) β€” the user fills in.
### COINs
*Community-Informed Graph Embeddings*, the link-prediction / query-answering approach from PhD thesis section 3.1. Partitions the KG into communities via Leiden clustering, learns separate community-local and global embeddings, and combines them at scoring time. Reduces compute relative to full-graph methods on large KGs.
### Leiden clustering
A community-detection algorithm refining Louvain. The `leiden_resolution` parameter trades community count against community size; the configured per-dataset resolutions are stable across all COINs algorithms for that dataset.
### Algorithm (COINs context)
Embedding scoring family. Supported: TransE, DistMult, ComplEx, RotatE (translation/bilinear/complex), Q2B (Query2Box for hyper-rectangles, supports box queries), KBGAT (graph-attention message passing). Each algorithm declares which `query_structure`s it can answer.
## Graph generation
### MultiProxAn
The graph-generation method from PhD thesis section 4.3. A discrete denoising diffusion model β€” DiGress-style β€” augmented with *MultiProx*, an outer loop over multiple noisy initializations sampled jointly. The Gibbs inner step refines the current graph against several samples, raising sample quality on small graphs (e.g. QM9 molecules).
### DiGress
The base discrete denoising diffusion architecture for graphs. Forward process noises a graph by category permutation; the model learns to reverse the process step by step.
### Sampling mode
Either `standard` (one denoising chain to a single output) or `multiprox` (the outer Gibbs loop wraps several chains). MultiProx adds the parameters `n` (chains), `m` (Gibbs rounds), `t` and `t_prime` (intermediate timesteps), and `gibbs_chain_freq` (preview cadence).
### Discrete vs. continuous
Two model variants per dataset. Discrete predicts categorical distributions over node/edge types directly; continuous predicts in a relaxed continuous space and rounds at the end. Checkpoints are named `{dataset}.ckpt` (discrete) and `{dataset}_c.ckpt` (continuous).
## KG anomaly correction
### Subgraph
A small (≀ 20-node) connected sample drawn from a COINs Loader's DFS context-subgraph partitioning. Used as input/output for the KG anomaly demo.
### Task (kg-anomaly)
Either `generate` (sample a fresh subgraph from noise) or `correct` (denoise a user-supplied subgraph back toward something the model considers plausible). Each `(dataset, task)` pair has its own checkpoint.
### Bipartite vs. unipartite subgraph
The DFS partitioner emits both: bipartite subgraphs split nodes into two halves with edges across, unipartite subgraphs are a single connected blob. The frontend renders them differently.
## Inference protocol
### Inference lock
A single `threading.Lock` in `ModelRegistry`. Only one inference runs at a time across the whole process (free HF Spaces is 2 vCPU, no GPU); a busy server returns HTTP 429 (`INFERENCE_BUSY`). `/api/v1/debug/force-unlock` releases a stuck lock when `DEBUG=True`.
### SSE (Server-Sent Events)
The streaming-inference protocol the graph-generation and kg-anomaly endpoints use. Each event has a `type` (`progress` | `preview` | `result`) and a JSON payload. See [reference/sse-protocol.md](reference/sse-protocol.md).
### Continuation token / state blob
Multiprox sampling can pause between Gibbs rounds. The `result` event of a `/generate` or `/correct` call returns a base64-encoded state blob; the client posts that blob to the matching `/continue` endpoint to advance one more round.
### Inference lifecycle
Boot-time: pre-warm checkpoints from HF Hub, scan checkpoint dirs, load lightweight COINs Loaders, generate sample subgraphs. First-request: lazy-load the relevant model weights into memory. See [explanation/inference-lifecycle.md](explanation/inference-lifecycle.md).
## Deployment
### HF Space
A Hugging Face Spaces application running this repo's `Dockerfile`. The deployed URL is `https://bani57-website.hf.space`. The Space repo is `Bani57/website`.
### HF Hub model repo
`Bani57/checkpoints` β€” holds all PyTorch weights. Mirrors the on-disk layout under `CHECKPOINTS_ROOT` so `huggingface_hub.snapshot_download` populates files in their expected paths and the registry's scan logic finds them unchanged.
### Persistent storage (HF Spaces)
A paid `/data` volume that survives Space restarts. Free Spaces have 50 GB ephemeral disk that resets on restart. Without persistent storage, every cold start re-downloads checkpoints from HF Hub.