Alex Latipov commited on
Commit
cfd076a
·
1 Parent(s): 6314cdf

Add clean HF snapshot deployment path

Browse files
Dockerfile CHANGED
@@ -24,4 +24,4 @@ COPY . ${APP_HOME}
24
 
25
  EXPOSE 7860
26
 
27
- CMD ["bash", "scripts/hf_space_boot.sh"]
 
24
 
25
  EXPOSE 7860
26
 
27
+ CMD ["bash", "scripts/hf_space_boot_dispatch.sh"]
deployment/hf_eval_backend/README.md CHANGED
@@ -16,19 +16,35 @@ It no longer acts as the active challenge Text2SPARQL generation API.
16
 
17
  - DBpedia proxy forwards to the local DBpedia Virtuoso endpoint restored from the snapshot bucket.
18
  - Corporate proxy forwards to the same local Virtuoso instance, but through the corporate graph-aware endpoint URL.
19
- - The startup flow is unchanged:
20
- - restore DBpedia snapshot
21
- - start Virtuoso
22
- - load the corporate graph
23
- - start FastAPI
24
 
25
  ## Environment Variables
26
 
27
  - `DBPEDIA_ENDPOINT_URL` optional override for the internal DBpedia upstream
28
  - `CORPORATE_ENDPOINT_URL` optional override for the internal corporate upstream
29
  - `CORPORATE_GRAPH_URI` optional corporate graph URI override
 
30
  - `PORT` optional FastAPI port, default `7860`
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ## Expected Public URLs
33
 
34
  If the Space URL is:
 
16
 
17
  - DBpedia proxy forwards to the local DBpedia Virtuoso endpoint restored from the snapshot bucket.
18
  - Corporate proxy forwards to the same local Virtuoso instance, but through the corporate graph-aware endpoint URL.
19
+ - The Space now supports two boot modes through `HF_BOOT_MODE`:
20
+ - `legacy_snapshot`: old restore path from the existing snapshot scripts
21
+ - `clean_snapshot`: restore a separately prepared clean DBpedia snapshot from `/data/dbpedia_snapshot_paper_clean`, then load corporate and start FastAPI
22
+ - The clean path exists so DBpedia can be served from a snapshot known to preserve the explicit graph URI `http://dbpedia.org`.
 
23
 
24
  ## Environment Variables
25
 
26
  - `DBPEDIA_ENDPOINT_URL` optional override for the internal DBpedia upstream
27
  - `CORPORATE_ENDPOINT_URL` optional override for the internal corporate upstream
28
  - `CORPORATE_GRAPH_URI` optional corporate graph URI override
29
+ - `HF_BOOT_MODE` chooses `legacy_snapshot` or `clean_snapshot`
30
  - `PORT` optional FastAPI port, default `7860`
31
 
32
+ ## Clean Snapshot Preparation
33
+
34
+ These scripts are kept separate from the legacy path:
35
+
36
+ - `scripts/prepare_dbpedia_snapshot_from_container_clean.sh`
37
+ - checkpoints the working local Docker Virtuoso and copies its DB files into `hf_upload/dbpedia_snapshot_paper_clean`
38
+ - `scripts/upload_dbpedia_snapshot_clean_sync.sh`
39
+ - syncs that directory into the HF bucket path `hf://buckets/InsanAlex/iris-at-text2sparql-storage/dbpedia_snapshot_paper_clean`
40
+ - `scripts/hf_restore_db_snapshot_clean.sh`
41
+ - `scripts/hf_space_boot_clean.sh`
42
+
43
+ Optional offline RDF export helper:
44
+
45
+ - `scripts/export_dbpedia_graph_partitions.py`
46
+ - exports `http://dbpedia.org` into deterministic RDF partitions by MD5(subject) prefix
47
+
48
  ## Expected Public URLs
49
 
50
  If the Space URL is:
scripts/export_dbpedia_graph_partitions.py ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ from __future__ import annotations
3
+
4
+ import argparse
5
+ import itertools
6
+ from pathlib import Path
7
+
8
+ import requests
9
+
10
+
11
+ DEFAULT_ENDPOINT = "http://127.0.0.1:8890/sparql"
12
+ DEFAULT_GRAPH = "http://dbpedia.org"
13
+
14
+
15
+ def parse_args() -> argparse.Namespace:
16
+ parser = argparse.ArgumentParser(
17
+ description="Export a Virtuoso graph into deterministic RDF partitions by MD5(subject) prefix."
18
+ )
19
+ parser.add_argument("--endpoint", default=DEFAULT_ENDPOINT)
20
+ parser.add_argument("--graph-uri", default=DEFAULT_GRAPH)
21
+ parser.add_argument("--output-dir", required=True)
22
+ parser.add_argument(
23
+ "--prefixes",
24
+ nargs="*",
25
+ help="Hex prefixes to export, e.g. 00 01 ff. Default is all 256 two-hex-digit prefixes.",
26
+ )
27
+ parser.add_argument("--overwrite", action="store_true")
28
+ parser.add_argument("--timeout-sec", type=int, default=1800)
29
+ return parser.parse_args()
30
+
31
+
32
+ def all_prefixes() -> list[str]:
33
+ return [a + b for a, b in itertools.product("0123456789abcdef", repeat=2)]
34
+
35
+
36
+ def build_query(graph_uri: str, prefix: str) -> str:
37
+ return f"""
38
+ CONSTRUCT {{ ?s ?p ?o }}
39
+ WHERE {{
40
+ GRAPH <{graph_uri}> {{
41
+ ?s ?p ?o .
42
+ FILTER (SUBSTR(MD5(STR(?s)), 1, 2) = "{prefix}")
43
+ }}
44
+ }}
45
+ """.strip()
46
+
47
+
48
+ def export_partition(
49
+ endpoint: str,
50
+ graph_uri: str,
51
+ prefix: str,
52
+ output_path: Path,
53
+ timeout_sec: int,
54
+ ) -> None:
55
+ params = {
56
+ "query": build_query(graph_uri, prefix),
57
+ "format": "text/plain",
58
+ }
59
+ with requests.get(endpoint, params=params, stream=True, timeout=timeout_sec) as response:
60
+ response.raise_for_status()
61
+ with output_path.open("wb") as handle:
62
+ for chunk in response.iter_content(chunk_size=1 << 20):
63
+ if chunk:
64
+ handle.write(chunk)
65
+
66
+
67
+ def main() -> int:
68
+ args = parse_args()
69
+ output_dir = Path(args.output_dir)
70
+ output_dir.mkdir(parents=True, exist_ok=True)
71
+
72
+ prefixes = args.prefixes or all_prefixes()
73
+ for prefix in prefixes:
74
+ if len(prefix) != 2 or any(ch not in "0123456789abcdefABCDEF" for ch in prefix):
75
+ raise ValueError(f"Invalid prefix: {prefix}")
76
+ prefix = prefix.lower()
77
+ output_path = output_dir / f"dbpedia_graph_{prefix}.nt"
78
+ if output_path.exists() and output_path.stat().st_size > 0 and not args.overwrite:
79
+ print(f"[skip] {output_path}")
80
+ continue
81
+ print(f"[export] {prefix} -> {output_path}")
82
+ export_partition(args.endpoint, args.graph_uri, prefix, output_path, args.timeout_sec)
83
+ return 0
84
+
85
+
86
+ if __name__ == "__main__":
87
+ raise SystemExit(main())
scripts/hf_restore_db_snapshot_clean.sh ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ SNAPSHOT_DIR="${VIRTUOSO_SNAPSHOT_DIR:-/data/dbpedia_snapshot_paper_clean}"
5
+ RUNTIME_DIR="${VIRTUOSO_RUNTIME_DIR:-/tmp/virtuoso_runtime}"
6
+ RUNTIME_DB_DIR="$RUNTIME_DIR/database"
7
+
8
+ if [[ ! -d "$SNAPSHOT_DIR" ]]; then
9
+ echo "Missing clean Virtuoso snapshot directory: $SNAPSHOT_DIR"
10
+ echo "Upload the prepared clean DB snapshot there before starting the Space."
11
+ exit 1
12
+ fi
13
+
14
+ mkdir -p "$RUNTIME_DB_DIR"
15
+
16
+ echo "Restoring clean Virtuoso snapshot from $SNAPSHOT_DIR to $RUNTIME_DB_DIR ..."
17
+ find "$RUNTIME_DB_DIR" -maxdepth 1 -type f -name 'virtuoso*' -delete
18
+
19
+ for name in \
20
+ virtuoso.db \
21
+ virtuoso-temp.db \
22
+ virtuoso.pxa \
23
+ virtuoso.key \
24
+ virtuoso.crt; do
25
+ if [[ -f "$SNAPSHOT_DIR/$name" ]]; then
26
+ cp -f "$SNAPSHOT_DIR/$name" "$RUNTIME_DB_DIR/$name"
27
+ fi
28
+ done
29
+
30
+ rm -f \
31
+ "$RUNTIME_DB_DIR/virtuoso.trx" \
32
+ "$RUNTIME_DB_DIR/virtuoso-temp.trx"
33
+
34
+ if [[ ! -f "$RUNTIME_DB_DIR/virtuoso.db" ]]; then
35
+ echo "Clean snapshot restore failed: virtuoso.db not found in $RUNTIME_DB_DIR"
36
+ exit 1
37
+ fi
38
+
39
+ echo "Clean Virtuoso snapshot restored."
scripts/hf_space_boot_clean.sh ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ cd /app
5
+ export PYTHONPATH=/app:${PYTHONPATH:-}
6
+
7
+ export CORPORATE_GRAPH_URI="${CORPORATE_GRAPH_URI:-http://ld.company.org/prod}"
8
+ export DBPEDIA_GRAPH_URI="${DBPEDIA_GRAPH_URI:-http://dbpedia.org}"
9
+ export DBPEDIA_ENDPOINT_URL="${DBPEDIA_ENDPOINT_URL:-http://127.0.0.1:8890/sparql?default-graph-uri=${DBPEDIA_GRAPH_URI}}"
10
+ export CORPORATE_ENDPOINT_URL="${CORPORATE_ENDPOINT_URL:-http://127.0.0.1:8890/sparql?default-graph-uri=${DBPEDIA_GRAPH_URI}&default-graph-uri=${CORPORATE_GRAPH_URI}}"
11
+ export PORT="${PORT:-7860}"
12
+
13
+ bash scripts/hf_restore_db_snapshot_clean.sh
14
+ bash scripts/hf_prepare_virtuoso_ini.sh
15
+ bash scripts/hf_start_virtuoso.sh
16
+
17
+ echo "Verifying DBpedia graph availability ..."
18
+ DBPEDIA_VERIFY_RESPONSE="$(
19
+ curl -fsG \
20
+ --data-urlencode "query=ASK WHERE { <http://dbpedia.org/resource/Angela_Merkel> ?p ?o }" \
21
+ --data-urlencode "format=application/sparql-results+json" \
22
+ "${DBPEDIA_ENDPOINT_URL}" || true
23
+ )"
24
+ if [[ "${DBPEDIA_VERIFY_RESPONSE}" != *'"boolean": true'* && "${DBPEDIA_VERIFY_RESPONSE}" != *'"boolean":true'* ]]; then
25
+ echo "DBpedia verification failed."
26
+ echo "Response: ${DBPEDIA_VERIFY_RESPONSE}"
27
+ exit 1
28
+ fi
29
+
30
+ bash scripts/hf_load_corporate_graph.sh
31
+
32
+ exec uvicorn service.app:app --host 0.0.0.0 --port "$PORT"
scripts/hf_space_boot_dispatch.sh ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ BOOT_MODE="${HF_BOOT_MODE:-legacy_snapshot}"
5
+
6
+ case "$BOOT_MODE" in
7
+ legacy_snapshot)
8
+ exec bash scripts/hf_space_boot.sh
9
+ ;;
10
+ clean_snapshot)
11
+ exec bash scripts/hf_space_boot_clean.sh
12
+ ;;
13
+ *)
14
+ echo "Unknown HF_BOOT_MODE: $BOOT_MODE"
15
+ echo "Supported values: legacy_snapshot, clean_snapshot"
16
+ exit 1
17
+ ;;
18
+ esac
scripts/prepare_dbpedia_snapshot_from_container_clean.sh ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ CONTAINER_NAME="${1:-dbpedia-virtuoso}"
5
+ SNAPSHOT_DIR="${2:-/home/alatipov/iris-at-text2sparql/hf_upload/dbpedia_snapshot_paper_clean}"
6
+ CONTAINER_DB_DIR="${CONTAINER_DB_DIR:-/database}"
7
+
8
+ export DOCKER_HOST="${DOCKER_HOST:-unix:///run/user/$(id -u)/docker.sock}"
9
+
10
+ FILES=(
11
+ "virtuoso.db"
12
+ "virtuoso-temp.db"
13
+ "virtuoso.trx"
14
+ "virtuoso.pxa"
15
+ "virtuoso.key"
16
+ "virtuoso.crt"
17
+ )
18
+
19
+ docker inspect "$CONTAINER_NAME" >/dev/null
20
+
21
+ echo "Checkpointing Virtuoso in container ${CONTAINER_NAME} ..."
22
+ docker exec "$CONTAINER_NAME" sh -lc "cat >/tmp/checkpoint.sql <<'SQL'
23
+ checkpoint;
24
+ SQL
25
+ /opt/virtuoso-opensource/bin/isql 1111 dba dba < /tmp/checkpoint.sql >/tmp/checkpoint.log 2>&1"
26
+
27
+ mkdir -p "$SNAPSHOT_DIR"
28
+ rm -f "$SNAPSHOT_DIR"/virtuoso*
29
+
30
+ echo "Copying snapshot files from ${CONTAINER_NAME}:${CONTAINER_DB_DIR} to ${SNAPSHOT_DIR} ..."
31
+ for name in "${FILES[@]}"; do
32
+ docker cp "${CONTAINER_NAME}:${CONTAINER_DB_DIR}/${name}" "$SNAPSHOT_DIR/$name"
33
+ done
34
+
35
+ echo "Clean snapshot directory ready:"
36
+ ls -lh "$SNAPSHOT_DIR"
scripts/upload_dbpedia_snapshot_clean_sync.sh ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ LOCAL_DIR="${1:-/home/alatipov/iris-at-text2sparql/hf_upload/dbpedia_snapshot_paper_clean}"
5
+ BUCKET_URI="${2:-hf://buckets/InsanAlex/iris-at-text2sparql-storage/dbpedia_snapshot_paper_clean}"
6
+
7
+ if [[ ! -d "$LOCAL_DIR" ]]; then
8
+ echo "Local clean snapshot directory not found: $LOCAL_DIR"
9
+ exit 1
10
+ fi
11
+
12
+ echo "Syncing $LOCAL_DIR -> $BUCKET_URI"
13
+ hf buckets sync "$LOCAL_DIR" "$BUCKET_URI"