Alex Latipov
Add clean HF snapshot deployment path
cfd076a

HF KG Endpoint Backend

This package now serves the Hugging Face Space as a raw SPARQL endpoint proxy over the locally booted Virtuoso instance.

It no longer acts as the active challenge Text2SPARQL generation API.

Active Routes

  • GET /health
  • GET /sparql/dbpedia
  • POST /sparql/dbpedia
  • GET /sparql/corporate
  • POST /sparql/corporate

Behavior

  • DBpedia proxy forwards to the local DBpedia Virtuoso endpoint restored from the snapshot bucket.
  • Corporate proxy forwards to the same local Virtuoso instance, but through the corporate graph-aware endpoint URL.
  • The Space now supports two boot modes through HF_BOOT_MODE:
    • legacy_snapshot: old restore path from the existing snapshot scripts
    • clean_snapshot: restore a separately prepared clean DBpedia snapshot from /data/dbpedia_snapshot_paper_clean, then load corporate and start FastAPI
  • The clean path exists so DBpedia can be served from a snapshot known to preserve the explicit graph URI http://dbpedia.org.

Environment Variables

  • DBPEDIA_ENDPOINT_URL optional override for the internal DBpedia upstream
  • CORPORATE_ENDPOINT_URL optional override for the internal corporate upstream
  • CORPORATE_GRAPH_URI optional corporate graph URI override
  • HF_BOOT_MODE chooses legacy_snapshot or clean_snapshot
  • PORT optional FastAPI port, default 7860

Clean Snapshot Preparation

These scripts are kept separate from the legacy path:

  • scripts/prepare_dbpedia_snapshot_from_container_clean.sh
    • checkpoints the working local Docker Virtuoso and copies its DB files into hf_upload/dbpedia_snapshot_paper_clean
  • scripts/upload_dbpedia_snapshot_clean_sync.sh
    • syncs that directory into the HF bucket path hf://buckets/InsanAlex/iris-at-text2sparql-storage/dbpedia_snapshot_paper_clean
  • scripts/hf_restore_db_snapshot_clean.sh
  • scripts/hf_space_boot_clean.sh

Optional offline RDF export helper:

  • scripts/export_dbpedia_graph_partitions.py
    • exports http://dbpedia.org into deterministic RDF partitions by MD5(subject) prefix

Expected Public URLs

If the Space URL is:

  • https://insanalex-iris-at-text2sparql.hf.space

then the public raw SPARQL endpoints are:

  • https://insanalex-iris-at-text2sparql.hf.space/sparql/dbpedia
  • https://insanalex-iris-at-text2sparql.hf.space/sparql/corporate

Notes

  • The old challenge-specific /text2sparql routes are no longer part of the active HF app.
  • The main repository still contains the repair pipeline code for local paper experiments; this deployment package now focuses only on stable KG access.