Spaces:

Bani57
/

website

Sleeping

App Files Files Community

website / docs /glossary.md

Andrej Janchevski

docs(deploy): refresh for the post-launch deployment iteration

5ed6f37 26 days ago

preview code

raw

history blame contribute delete

5.53 kB

	# Glossary

	Domain terms used throughout the codebase and documentation. Other documents reference this file rather than redefining terms in place.

	## Knowledge graph reasoning

	### Knowledge graph (KG)
	A directed multigraph of `(head, relation, tail)` triples where vertices are entities and labelled edges are relations. The three KGs exposed by the site are FB15k-237 (Freebase subset), WN18RR (WordNet subset) and NELL-995.

	### Link prediction
	Given two of `head`, `relation`, `tail`, score and rank candidates for the missing slot. The 1-projection (`1p`) query structure is link prediction.

	### Query structure
	A multi-hop / intersection / projection query template over a KG. Supported structures: `1p`, `2p`, `3p` (single chain projections), `2i`, `3i` (intersection of two/three relations), `ip` (intersection then projection), `pi` (projection then intersection). Templates determine which slots — anchor entities (`a`, `a1`, `a2`, …), variable entities (`v1`, `v2`) and relations (`r1`, `r2`, `r3`) — the user fills in.

	### COINs
	Community-Informed Graph Embeddings, the link-prediction / query-answering approach from PhD thesis section 3.1. Partitions the KG into communities via Leiden clustering, learns separate community-local and global embeddings, and combines them at scoring time. Reduces compute relative to full-graph methods on large KGs.

	### Leiden clustering
	A community-detection algorithm refining Louvain. The `leiden_resolution` parameter trades community count against community size; the configured per-dataset resolutions are stable across all COINs algorithms for that dataset.

	### Algorithm (COINs context)
	Embedding scoring family. Supported: TransE, DistMult, ComplEx, RotatE (translation/bilinear/complex), Q2B (Query2Box for hyper-rectangles, supports box queries), KBGAT (graph-attention message passing). Each algorithm declares which `query_structure`s it can answer.

	## Graph generation

	### MultiProxAn
	The graph-generation method from PhD thesis section 4.3. A discrete denoising diffusion model — DiGress-style — augmented with MultiProx, an outer loop over multiple noisy initializations sampled jointly. The Gibbs inner step refines the current graph against several samples, raising sample quality on small graphs (e.g. QM9 molecules).

	### DiGress
	The base discrete denoising diffusion architecture for graphs. Forward process noises a graph by category permutation; the model learns to reverse the process step by step.

	### Sampling mode
	Either `standard` (one denoising chain to a single output) or `multiprox` (the outer Gibbs loop wraps several chains). MultiProx adds the parameters `n` (chains), `m` (Gibbs rounds), `t` and `t_prime` (intermediate timesteps), and `gibbs_chain_freq` (preview cadence).

	### Discrete vs. continuous
	Two model variants per dataset. Discrete predicts categorical distributions over node/edge types directly; continuous predicts in a relaxed continuous space and rounds at the end. Checkpoints are named `{dataset}.ckpt` (discrete) and `{dataset}_c.ckpt` (continuous).

	## KG anomaly correction

	### Subgraph
	A small (≤ 20-node) connected sample drawn from a COINs Loader's DFS context-subgraph partitioning. Used as input/output for the KG anomaly demo.

	### Task (kg-anomaly)
	Either `generate` (sample a fresh subgraph from noise) or `correct` (denoise a user-supplied subgraph back toward something the model considers plausible). Each `(dataset, task)` pair has its own checkpoint.

	### Bipartite vs. unipartite subgraph
	The DFS partitioner emits both: bipartite subgraphs split nodes into two halves with edges across, unipartite subgraphs are a single connected blob. The frontend renders them differently.

	## Inference protocol

	### Inference lock
	A single `threading.Lock` in `ModelRegistry`. Only one inference runs at a time across the whole process (free HF Spaces is 2 vCPU, no GPU); a busy server returns HTTP 429 (`INFERENCE_BUSY`). `/api/v1/debug/force-unlock` releases a stuck lock when `DEBUG=True`.

	### SSE (Server-Sent Events)
	The streaming-inference protocol the graph-generation and kg-anomaly endpoints use. Each event has a `type` (`progress` \| `preview` \| `result`) and a JSON payload. See [reference/sse-protocol.md](reference/sse-protocol.md).

	### Continuation token / state blob
	Multiprox sampling can pause between Gibbs rounds. The `result` event of a `/generate` or `/correct` call returns a base64-encoded state blob; the client posts that blob to the matching `/continue` endpoint to advance one more round.

	### Inference lifecycle
	Boot-time: pre-warm checkpoints from HF Hub, scan checkpoint dirs, load lightweight COINs Loaders, generate sample subgraphs. First-request: lazy-load the relevant model weights into memory. See [explanation/inference-lifecycle.md](explanation/inference-lifecycle.md).

	## Deployment

	### HF Space
	A Hugging Face Spaces application running this repo's `Dockerfile`. The deployed URL is `https://bani57-website.hf.space`. The Space repo is `Bani57/website`.

	### HF Hub model repo
	`Bani57/checkpoints` — holds all PyTorch weights. Mirrors the on-disk layout under `CHECKPOINTS_ROOT` so `huggingface_hub.snapshot_download` populates files in their expected paths and the registry's scan logic finds them unchanged.

	### Persistent storage (HF Spaces)
	A paid `/data` volume that survives Space restarts. Free Spaces have 50 GB ephemeral disk that resets on restart. Without persistent storage, every cold start re-downloads checkpoints from HF Hub.