cjc0013 commited on
Commit
2ecbdee
·
verified ·
1 Parent(s): c350513

Upload hf_app_updated.py

Browse files
Files changed (1) hide show
  1. hf_app_updated.py +1490 -0
hf_app_updated.py ADDED
@@ -0,0 +1,1490 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ HF Space app for browsing/searching a big SQLite corpus built by build_corpus_sqlite.py.
5
+
6
+ Goal:
7
+ - SIMPLE UI: type -> search -> pick -> open
8
+ - Advanced knobs hidden unless you open "Advanced"
9
+ - No "runs" UI (no run picking, no runs tab)
10
+
11
+ What it does:
12
+ - Loads corpus.sqlite (read-only)
13
+ - FTS keyword search (chunks_fts)
14
+ - Browse clusters across ALL runs (cluster_summary)
15
+ - Open a uid -> show full text + context window (order_index +/- k within the same run_id)
16
+ - Optional: load deterministic method-sanitized signal cards (jsonl/csv) and open linked chunks
17
+
18
+ How it finds the DB:
19
+ 1) If CORPUS_SQLITE_PATH is set, uses that
20
+ 2) Else tries common local paths (./data/corpus.sqlite, ./dataset/corpus.sqlite, /data/corpus.sqlite, ./corpus.sqlite)
21
+ 3) Else downloads from a dataset repo using huggingface_hub (set DATASET_REPO_ID and optional DATASET_FILENAME)
22
+
23
+ How it finds optional signal cards:
24
+ 1) If METHOD_SIGNALS_PATH is set, uses that
25
+ 2) Else tries common local paths:
26
+ ./data/public_method_sanitized_topN.jsonl
27
+ ./data/public_top_signals.jsonl
28
+ ./dataset/public_method_sanitized_topN.jsonl
29
+ ./dataset/public_top_signals.jsonl
30
+ /data/public_method_sanitized_topN.jsonl
31
+ /data/public_top_signals.jsonl
32
+ (and csv variants)
33
+ 3) Else optional download via METHOD_SIGNALS_DATASET_REPO_ID + METHOD_SIGNALS_FILENAME
34
+
35
+ Env vars you can set in the Space:
36
+ - CORPUS_SQLITE_PATH : absolute/relative path to the sqlite file if it already exists in the container
37
+ - DATASET_REPO_ID : like "yourname/your-dataset-repo" (repo_type=dataset)
38
+ (also accepts a full HF URL; we'll extract repo_id)
39
+ - DATASET_FILENAME : default "corpus.sqlite"
40
+ - DB_LOCAL_DIR : default "./data" (where downloaded DB will be copied)
41
+
42
+ Optional signals env vars:
43
+ - METHOD_SIGNALS_PATH : absolute/relative path to jsonl/csv signal file
44
+ - METHOD_SIGNALS_DATASET_REPO_ID : optional dataset repo for signals file
45
+ - METHOD_SIGNALS_FILENAME : default "public_method_sanitized_topN.jsonl"
46
+ - METHOD_SIGNALS_LOCAL_DIR : default "./data"
47
+
48
+ Notes:
49
+ - Opens sqlite in read-only mode
50
+ - Uses thread-local sqlite connections (safer with Gradio)
51
+ """
52
+
53
+ from __future__ import annotations
54
+
55
+ import csv
56
+ import json
57
+ import os
58
+ import re
59
+ import shutil
60
+ import sqlite3
61
+ import threading
62
+ import traceback
63
+ from pathlib import Path
64
+ from typing import Any, Dict, List, Optional, Tuple
65
+ from urllib.parse import quote, urlparse
66
+
67
+ import gradio as gr
68
+
69
+ try:
70
+ from huggingface_hub import hf_hub_download
71
+ except Exception:
72
+ hf_hub_download = None # type: ignore
73
+
74
+
75
+ APP_VERSION = "2026-02-11_app_g_signals"
76
+
77
+
78
+ # -----------------------------
79
+ # Env helpers (strip hidden whitespace/newlines)
80
+ # -----------------------------
81
+ def _clean_env_value(v: str) -> str:
82
+ if v is None:
83
+ return ""
84
+ s = str(v)
85
+ s = s.replace("\r", "").replace("\n", "").replace("\t", " ")
86
+ s = s.strip()
87
+ s = "".join(ch for ch in s if ch.isprintable())
88
+ return s
89
+
90
+
91
+ def _env(name: str, default: str = "") -> str:
92
+ v = os.environ.get(name)
93
+ if v is None:
94
+ return default
95
+ vv = _clean_env_value(v)
96
+ return vv if vv else default
97
+
98
+
99
+ def _parse_dataset_ref(repo_like: str) -> Tuple[str, Optional[str]]:
100
+ """
101
+ Accept either:
102
+ - "user/repo"
103
+ - "https://huggingface.co/datasets/user/repo/blob/main/file.ext"
104
+
105
+ Returns: (repo_id, inferred_filename_or_None)
106
+ """
107
+ s = _clean_env_value(repo_like)
108
+ if not s:
109
+ return "", None
110
+
111
+ if s.startswith("http://") or s.startswith("https://"):
112
+ u = urlparse(s)
113
+ p = (u.path or "").strip("/")
114
+ parts = p.split("/")
115
+
116
+ if len(parts) >= 3 and parts[0] == "datasets":
117
+ repo_id = f"{parts[1]}/{parts[2]}"
118
+ inferred_file: Optional[str] = None
119
+
120
+ if "blob" in parts:
121
+ try:
122
+ i = parts.index("blob")
123
+ if i + 2 < len(parts):
124
+ inferred_file = "/".join(parts[i + 2 :])
125
+ except Exception:
126
+ inferred_file = None
127
+
128
+ return repo_id, inferred_file
129
+
130
+ if any(ch.isspace() for ch in s):
131
+ s = "".join(s.split())
132
+
133
+ return s, None
134
+
135
+
136
+ # -----------------------------
137
+ # Gradio compat shim (Dataframe args differ by version)
138
+ # -----------------------------
139
+ _UNEXPECTED_KW_RE = re.compile(r"unexpected keyword argument '([^']+)'")
140
+
141
+
142
+ def _df(**kwargs):
143
+ """
144
+ Build gr.Dataframe in a way that survives Gradio version drift.
145
+ If a kwarg isn't supported, drop it and retry.
146
+ """
147
+ k = dict(kwargs)
148
+ for _ in range(32):
149
+ try:
150
+ return gr.Dataframe(**k)
151
+ except TypeError as e:
152
+ msg = str(e)
153
+ m = _UNEXPECTED_KW_RE.search(msg)
154
+ if m:
155
+ bad = m.group(1)
156
+ if bad in k:
157
+ k.pop(bad, None)
158
+ continue
159
+ dropped = False
160
+ for bad in ("max_rows", "wrap"):
161
+ if bad in k:
162
+ k.pop(bad, None)
163
+ dropped = True
164
+ break
165
+ if dropped:
166
+ continue
167
+ raise
168
+ return gr.Dataframe(**k)
169
+
170
+
171
+ # -----------------------------
172
+ # DB location / download
173
+ # -----------------------------
174
+ def _candidate_paths() -> List[Path]:
175
+ p0 = _env("CORPUS_SQLITE_PATH", "")
176
+ cands = [
177
+ Path(p0).expanduser() if p0 else None,
178
+ Path("./data/corpus.sqlite"),
179
+ Path("./dataset/corpus.sqlite"),
180
+ Path("/data/corpus.sqlite"),
181
+ Path("./corpus.sqlite"),
182
+ Path("./data/corpus.db"),
183
+ Path("./dataset/corpus.db"),
184
+ Path("/data/corpus.db"),
185
+ ]
186
+ out: List[Path] = []
187
+ for p in cands:
188
+ if p is None:
189
+ continue
190
+ try:
191
+ out.append(p.resolve())
192
+ except Exception:
193
+ out.append(p)
194
+ return out
195
+
196
+
197
+ def ensure_db_file() -> Path:
198
+ for p in _candidate_paths():
199
+ if p.exists() and p.is_file():
200
+ print(f"[db] using local file: {p}")
201
+ return p
202
+
203
+ ds_repo_raw = _env("DATASET_REPO_ID", "")
204
+ ds_repo, inferred_file = _parse_dataset_ref(ds_repo_raw)
205
+
206
+ ds_file_raw = _env("DATASET_FILENAME", "corpus.sqlite")
207
+ ds_file = _clean_env_value(ds_file_raw)
208
+
209
+ if inferred_file and (not os.environ.get("DATASET_FILENAME") or not ds_file):
210
+ ds_file = inferred_file
211
+
212
+ ds_repo = _clean_env_value(ds_repo)
213
+ ds_file = _clean_env_value(ds_file)
214
+
215
+ local_dir = Path(_env("DB_LOCAL_DIR", "./data")).expanduser().resolve()
216
+ local_dir.mkdir(parents=True, exist_ok=True)
217
+ target = (local_dir / (ds_file if ds_file else "corpus.sqlite")).resolve()
218
+
219
+ print(f"[db] DATASET_REPO_ID={ds_repo!r}")
220
+ print(f"[db] DATASET_FILENAME={ds_file!r}")
221
+ print(f"[db] DB_LOCAL_DIR={str(local_dir)!r}")
222
+
223
+ if ds_repo:
224
+ if hf_hub_download is None:
225
+ raise RuntimeError("DATASET_REPO_ID is set, but huggingface_hub is not installed. Add it to requirements.txt.")
226
+
227
+ if not ds_file:
228
+ ds_file = "corpus.sqlite"
229
+
230
+ cached_path = hf_hub_download(
231
+ repo_id=ds_repo,
232
+ filename=ds_file,
233
+ repo_type="dataset",
234
+ )
235
+ cached_path = str(cached_path)
236
+
237
+ try:
238
+ src = Path(cached_path).resolve()
239
+ if target.exists():
240
+ try:
241
+ if target.stat().st_size == src.stat().st_size:
242
+ print(f"[db] target already present (same size), using: {target}")
243
+ return target
244
+ except Exception:
245
+ pass
246
+
247
+ shutil.copy2(str(src), str(target))
248
+ print(f"[db] downloaded -> {target}")
249
+ return target
250
+ except Exception as e:
251
+ print(f"[db] copy to local_dir failed, using cache path instead: {cached_path} ({e})")
252
+ return Path(cached_path).resolve()
253
+
254
+ raise RuntimeError(
255
+ "Could not find corpus sqlite file.\n"
256
+ "Fix: set CORPUS_SQLITE_PATH or set DATASET_REPO_ID (and make sure the dataset has corpus.sqlite)."
257
+ )
258
+
259
+
260
+ # -----------------------------
261
+ # Optional method-sanitized signals file location / download
262
+ # -----------------------------
263
+ _SIGNAL_BASENAMES = [
264
+ "public_method_sanitized_topN.jsonl",
265
+ "public_top_signals.jsonl",
266
+ "public_method_sanitized_topN.csv",
267
+ "public_top_signals.csv",
268
+ ]
269
+
270
+
271
+ def _candidate_signal_paths() -> List[Path]:
272
+ p0 = _env("METHOD_SIGNALS_PATH", "")
273
+ cands: List[Optional[Path]] = [
274
+ Path(p0).expanduser() if p0 else None,
275
+ ]
276
+
277
+ for base in _SIGNAL_BASENAMES:
278
+ cands.extend(
279
+ [
280
+ Path("./data") / base,
281
+ Path("./dataset") / base,
282
+ Path("/data") / base,
283
+ Path(".") / base,
284
+ ]
285
+ )
286
+
287
+ out: List[Path] = []
288
+ for p in cands:
289
+ if p is None:
290
+ continue
291
+ try:
292
+ out.append(p.resolve())
293
+ except Exception:
294
+ out.append(p)
295
+ return out
296
+
297
+
298
+ def ensure_signals_file() -> Optional[Path]:
299
+ # local first
300
+ for p in _candidate_signal_paths():
301
+ if p.exists() and p.is_file():
302
+ print(f"[signals] using local file: {p}")
303
+ return p
304
+
305
+ # optional dataset download
306
+ ds_repo_raw = _env("METHOD_SIGNALS_DATASET_REPO_ID", "")
307
+ ds_repo, inferred_file = _parse_dataset_ref(ds_repo_raw)
308
+ ds_repo = _clean_env_value(ds_repo)
309
+
310
+ ds_file_raw = _env("METHOD_SIGNALS_FILENAME", "public_method_sanitized_topN.jsonl")
311
+ ds_file = _clean_env_value(ds_file_raw)
312
+ if inferred_file and (not os.environ.get("METHOD_SIGNALS_FILENAME") or not ds_file):
313
+ ds_file = inferred_file
314
+ ds_file = _clean_env_value(ds_file)
315
+
316
+ if not ds_repo:
317
+ return None
318
+
319
+ if hf_hub_download is None:
320
+ print("[signals] METHOD_SIGNALS_DATASET_REPO_ID set but huggingface_hub not installed.")
321
+ return None
322
+
323
+ local_dir = Path(_env("METHOD_SIGNALS_LOCAL_DIR", "./data")).expanduser().resolve()
324
+ local_dir.mkdir(parents=True, exist_ok=True)
325
+ target = (local_dir / (ds_file if ds_file else "public_method_sanitized_topN.jsonl")).resolve()
326
+
327
+ print(f"[signals] METHOD_SIGNALS_DATASET_REPO_ID={ds_repo!r}")
328
+ print(f"[signals] METHOD_SIGNALS_FILENAME={ds_file!r}")
329
+ print(f"[signals] METHOD_SIGNALS_LOCAL_DIR={str(local_dir)!r}")
330
+
331
+ try:
332
+ cached_path = hf_hub_download(
333
+ repo_id=ds_repo,
334
+ filename=ds_file,
335
+ repo_type="dataset",
336
+ )
337
+ cached_path = str(cached_path)
338
+ except Exception as e:
339
+ print(f"[signals] download failed: {e}")
340
+ return None
341
+
342
+ try:
343
+ src = Path(cached_path).resolve()
344
+ if target.exists():
345
+ try:
346
+ if target.stat().st_size == src.stat().st_size:
347
+ print(f"[signals] target already present (same size), using: {target}")
348
+ return target
349
+ except Exception:
350
+ pass
351
+
352
+ shutil.copy2(str(src), str(target))
353
+ print(f"[signals] downloaded -> {target}")
354
+ return target
355
+ except Exception as e:
356
+ print(f"[signals] copy to local_dir failed, using cache path instead: {cached_path} ({e})")
357
+ return Path(cached_path).resolve()
358
+
359
+
360
+ DB_PATH = ensure_db_file()
361
+ SIGNALS_PATH = ensure_signals_file()
362
+
363
+
364
+ # -----------------------------
365
+ # SQLite connection (thread-local)
366
+ # -----------------------------
367
+ _tls = threading.local()
368
+
369
+
370
+ def _connect_readonly(db_path: Path) -> sqlite3.Connection:
371
+ uri_path = quote(db_path.as_posix(), safe="/:")
372
+ uri = f"file:{uri_path}?mode=ro"
373
+ conn = sqlite3.connect(uri, uri=True, check_same_thread=False)
374
+ conn.row_factory = sqlite3.Row
375
+
376
+ try:
377
+ conn.execute("PRAGMA query_only=ON;")
378
+ except Exception:
379
+ pass
380
+ try:
381
+ conn.execute("PRAGMA temp_store=MEMORY;")
382
+ except Exception:
383
+ pass
384
+ try:
385
+ conn.execute("PRAGMA cache_size=-100000;")
386
+ except Exception:
387
+ pass
388
+
389
+ return conn
390
+
391
+
392
+ def get_conn() -> sqlite3.Connection:
393
+ c = getattr(_tls, "conn", None)
394
+ if c is None:
395
+ _tls.conn = _connect_readonly(DB_PATH)
396
+ c = _tls.conn
397
+ return c
398
+
399
+
400
+ # -----------------------------
401
+ # Query helpers
402
+ # -----------------------------
403
+ def table_exists(conn: sqlite3.Connection, name: str) -> bool:
404
+ cur = conn.cursor()
405
+ cur.execute("SELECT 1 FROM sqlite_master WHERE type IN ('table','view') AND name=? LIMIT 1;", (name,))
406
+ ok = cur.fetchone() is not None
407
+ cur.close()
408
+ return ok
409
+
410
+
411
+ def normalize_fts_query(q: str) -> str:
412
+ q = (q or "").strip()
413
+ if not q:
414
+ return ""
415
+
416
+ ops = ['"', "*", " OR ", " AND ", " NOT ", " NEAR", "(", ")", ":"]
417
+ q_up = f" {q.upper()} "
418
+ if any(op in q for op in ops) or any(op in q_up for op in ops):
419
+ return q
420
+
421
+ toks = []
422
+ for t in q.replace("\n", " ").replace("\t", " ").split(" "):
423
+ t = t.strip()
424
+ if not t:
425
+ continue
426
+ t = t.strip(".,;!?[]{}<>")
427
+ if t:
428
+ toks.append(t)
429
+ if not toks:
430
+ return q
431
+ return " AND ".join(toks)
432
+
433
+
434
+ def fetch_meta() -> List[List[Any]]:
435
+ conn = get_conn()
436
+ cur = conn.cursor()
437
+ cur.execute("SELECT k, v FROM meta ORDER BY k;")
438
+ rows = cur.fetchall()
439
+ cur.close()
440
+ out = [["k", "v"]]
441
+ for r in rows:
442
+ out.append([r["k"], r["v"]])
443
+ return out
444
+
445
+
446
+ def fts_search(query: str, cluster_id: str, limit: int) -> List[List[Any]]:
447
+ conn = get_conn()
448
+ if not table_exists(conn, "chunks_fts"):
449
+ return [["error"], ["FTS table (chunks_fts) not found in this DB."]]
450
+
451
+ qn = normalize_fts_query(query)
452
+ if not qn:
453
+ return [["error"], ["empty query"]]
454
+
455
+ cluster_id = (cluster_id or "").strip()
456
+ limit = int(limit) if limit else 50
457
+ limit = max(1, min(500, limit))
458
+
459
+ where = ["(chunks_fts MATCH ?)"]
460
+ params: List[Any] = [qn]
461
+
462
+ if cluster_id:
463
+ where.append("c.cluster_id = ?")
464
+ try:
465
+ params.append(int(float(cluster_id)))
466
+ except Exception:
467
+ return [["error"], [f"cluster_id must be an int (got {cluster_id!r})"]]
468
+
469
+ where_sql = " AND ".join(where)
470
+
471
+ sql_bm25 = f"""
472
+ SELECT
473
+ c.uid,
474
+ c.cluster_id,
475
+ c.order_index,
476
+ c.doc_id,
477
+ c.source_file,
478
+ c.cluster_prob,
479
+ CASE
480
+ WHEN length(c.text) > 220 THEN substr(c.text, 1, 220) || '…'
481
+ ELSE c.text
482
+ END AS preview
483
+ FROM chunks_fts
484
+ JOIN chunks c ON c.uid = chunks_fts.uid
485
+ WHERE {where_sql}
486
+ ORDER BY bm25(chunks_fts)
487
+ LIMIT ?;
488
+ """
489
+
490
+ sql_fallback = f"""
491
+ SELECT
492
+ c.uid,
493
+ c.cluster_id,
494
+ c.order_index,
495
+ c.doc_id,
496
+ c.source_file,
497
+ c.cluster_prob,
498
+ CASE
499
+ WHEN length(c.text) > 220 THEN substr(c.text, 1, 220) || '…'
500
+ ELSE c.text
501
+ END AS preview
502
+ FROM chunks_fts
503
+ JOIN chunks c ON c.uid = chunks_fts.uid
504
+ WHERE {where_sql}
505
+ LIMIT ?;
506
+ """
507
+
508
+ params2 = params + [limit]
509
+
510
+ cur = conn.cursor()
511
+ headers = ["uid", "cluster_id", "order_index", "doc_id", "source_file", "cluster_prob", "preview"]
512
+ out = [headers]
513
+
514
+ try:
515
+ cur.execute(sql_bm25, params2)
516
+ except Exception:
517
+ cur.execute(sql_fallback, params2)
518
+
519
+ rows = cur.fetchall()
520
+ cur.close()
521
+
522
+ for r in rows:
523
+ out.append([r["uid"], r["cluster_id"], r["order_index"], r["doc_id"], r["source_file"], r["cluster_prob"], r["preview"]])
524
+ return out
525
+
526
+
527
+ def get_chunk_by_uid(uid: str) -> Optional[Dict[str, Any]]:
528
+ conn = get_conn()
529
+ cur = conn.cursor()
530
+ cur.execute(
531
+ """
532
+ SELECT uid, run_id, chunk_id, order_index, doc_id, source_file, cluster_id, cluster_prob, bm25_density,
533
+ idf_mass, token_count, unique_token_count, text
534
+ FROM chunks
535
+ WHERE uid=?
536
+ LIMIT 1;
537
+ """,
538
+ (uid,),
539
+ )
540
+ r = cur.fetchone()
541
+ cur.close()
542
+ if not r:
543
+ return None
544
+ return dict(r)
545
+
546
+
547
+ def get_context(run_id: str, order_index: int, window: int) -> List[Dict[str, Any]]:
548
+ conn = get_conn()
549
+ lo = int(order_index) - int(window)
550
+ hi = int(order_index) + int(window)
551
+ cur = conn.cursor()
552
+ cur.execute(
553
+ """
554
+ SELECT uid, order_index, doc_id, source_file, cluster_id, cluster_prob,
555
+ CASE WHEN length(text) > 220 THEN substr(text, 1, 220) || '…' ELSE text END AS preview
556
+ FROM chunks
557
+ WHERE run_id=? AND order_index BETWEEN ? AND ?
558
+ ORDER BY order_index;
559
+ """,
560
+ (run_id, lo, hi),
561
+ )
562
+ rows = cur.fetchall()
563
+ cur.close()
564
+ return [dict(x) for x in rows]
565
+
566
+
567
+ def fetch_cluster_summary_all(top_n: int) -> List[List[Any]]:
568
+ conn = get_conn()
569
+ if not table_exists(conn, "cluster_summary"):
570
+ return [["error"], ["cluster_summary not found in this DB."]]
571
+
572
+ top_n = int(top_n) if top_n else 200
573
+ top_n = max(1, min(2000, top_n))
574
+
575
+ cur = conn.cursor()
576
+ cur.execute(
577
+ """
578
+ SELECT run_id, cluster_id, n_chunks, prob_avg, bm25_density_avg, idf_mass_avg, token_count_avg
579
+ FROM cluster_summary
580
+ ORDER BY n_chunks DESC
581
+ LIMIT ?;
582
+ """,
583
+ (top_n,),
584
+ )
585
+ rows = cur.fetchall()
586
+ cur.close()
587
+
588
+ out = [["run_id", "cluster_id", "n_chunks", "prob_avg", "bm25_density_avg", "idf_mass_avg", "token_count_avg"]]
589
+ for r in rows:
590
+ out.append([r["run_id"], r["cluster_id"], r["n_chunks"], r["prob_avg"], r["bm25_density_avg"], r["idf_mass_avg"], r["token_count_avg"]])
591
+ return out
592
+
593
+
594
+ def fetch_cluster_chunks(run_id: str, cluster_id: str, limit: int) -> List[List[Any]]:
595
+ conn = get_conn()
596
+ run_id = (run_id or "").strip()
597
+ cluster_id = (cluster_id or "").strip()
598
+ if not run_id:
599
+ return [["error"], ["missing run_id for this cluster"]]
600
+ if not cluster_id:
601
+ return [["error"], ["missing cluster_id"]]
602
+
603
+ try:
604
+ cid = int(float(cluster_id))
605
+ except Exception:
606
+ return [["error"], [f"cluster_id must be int (got {cluster_id!r})"]]
607
+
608
+ limit = int(limit) if limit else 150
609
+ limit = max(1, min(2000, limit))
610
+
611
+ cur = conn.cursor()
612
+ cur.execute(
613
+ """
614
+ SELECT uid, order_index, doc_id, source_file, cluster_prob,
615
+ CASE WHEN length(text) > 220 THEN substr(text, 1, 220) || '…' ELSE text END AS preview
616
+ FROM chunks
617
+ WHERE run_id=? AND cluster_id=?
618
+ ORDER BY cluster_prob DESC, order_index ASC
619
+ LIMIT ?;
620
+ """,
621
+ (run_id, cid, limit),
622
+ )
623
+ rows = cur.fetchall()
624
+ cur.close()
625
+
626
+ out = [["uid", "order_index", "doc_id", "source_file", "cluster_prob", "preview"]]
627
+ for r in rows:
628
+ out.append([r["uid"], r["order_index"], r["doc_id"], r["source_file"], r["cluster_prob"], r["preview"]])
629
+ return out
630
+
631
+
632
+ # -----------------------------
633
+ # Optional method-sanitized signals helpers
634
+ # -----------------------------
635
+ def _safe_int(v: Any, default: int = 0) -> int:
636
+ try:
637
+ if v is None or v == "":
638
+ return default
639
+ if isinstance(v, bool):
640
+ return int(v)
641
+ if isinstance(v, (int,)):
642
+ return int(v)
643
+ if isinstance(v, float):
644
+ return int(v)
645
+ s = str(v).strip()
646
+ if not s:
647
+ return default
648
+ return int(float(s))
649
+ except Exception:
650
+ return default
651
+
652
+
653
+ def _safe_float(v: Any, default: float = 0.0) -> float:
654
+ try:
655
+ if v is None or v == "":
656
+ return default
657
+ if isinstance(v, bool):
658
+ return float(int(v))
659
+ if isinstance(v, (int, float)):
660
+ return float(v)
661
+ s = str(v).strip()
662
+ if not s:
663
+ return default
664
+ return float(s)
665
+ except Exception:
666
+ return default
667
+
668
+
669
+ def _safe_str(v: Any) -> str:
670
+ if v is None:
671
+ return ""
672
+ s = str(v)
673
+ s = s.replace("\x00", " ").strip()
674
+ return s
675
+
676
+
677
+ def _truncate(s: str, n: int = 240) -> str:
678
+ s = _safe_str(s).replace("\r", " ").replace("\n", " ")
679
+ s = re.sub(r"\s+", " ", s).strip()
680
+ if len(s) > n:
681
+ return s[:n] + "…"
682
+ return s
683
+
684
+
685
+ def _parse_tags(v: Any) -> List[str]:
686
+ if v is None:
687
+ return []
688
+ if isinstance(v, list):
689
+ tags = [str(x).strip() for x in v if str(x).strip()]
690
+ return tags
691
+ s = str(v).strip()
692
+ if not s:
693
+ return []
694
+ # try json list first
695
+ if (s.startswith("[") and s.endswith("]")) or (s.startswith('["') and s.endswith('"]')):
696
+ try:
697
+ obj = json.loads(s)
698
+ if isinstance(obj, list):
699
+ return [str(x).strip() for x in obj if str(x).strip()]
700
+ except Exception:
701
+ pass
702
+ # fallback split
703
+ parts = re.split(r"[,\|;]+", s)
704
+ tags = [p.strip() for p in parts if p.strip()]
705
+ return tags
706
+
707
+
708
+ def _first_present(d: Dict[str, Any], keys: List[str], default: Any = None) -> Any:
709
+ for k in keys:
710
+ if k in d and d[k] is not None and str(d[k]).strip() != "":
711
+ return d[k]
712
+ return default
713
+
714
+
715
+ def _normalize_signal_row(row: Dict[str, Any], idx: int) -> Dict[str, Any]:
716
+ case_id = _safe_str(_first_present(row, ["case_id", "public_case_id", "card_id"], ""))
717
+ if not case_id:
718
+ case_id = f"CASE-{idx+1:04d}"
719
+
720
+ has_rank = False
721
+ rank_v = _first_present(row, ["rank", "public_rank", "order"])
722
+ if rank_v is not None and str(rank_v).strip() != "":
723
+ has_rank = True
724
+ rank_i = _safe_int(rank_v, idx + 1)
725
+ else:
726
+ rank_i = idx + 1
727
+
728
+ anomaly_score = _safe_float(
729
+ _first_present(row, ["anomaly_score", "convergence_score", "weighted_score", "score"], 0.0), 0.0
730
+ )
731
+ signal_count = _safe_int(_first_present(row, ["signal_count", "signals", "n_signals"], 0), 0)
732
+
733
+ uid = _safe_str(_first_present(row, ["uid"], ""))
734
+ chunk_id = _safe_str(_first_present(row, ["chunk_id"], ""))
735
+ run_id = _safe_str(_first_present(row, ["run_id"], ""))
736
+ order_index = _safe_int(_first_present(row, ["order_index"], -1), -1)
737
+
738
+ tags = _parse_tags(_first_present(row, ["signal_tags", "theme_tags", "top_hypotheses"], []))
739
+ tags_s = ",".join(tags)
740
+
741
+ excerpt = _safe_str(_first_present(row, ["excerpt", "preview", "text", "snippet"], ""))
742
+ context_prev = _safe_str(_first_present(row, ["context_prev", "prev", "left_context"], ""))
743
+ context_next = _safe_str(_first_present(row, ["context_next", "next", "right_context"], ""))
744
+
745
+ return {
746
+ "case_id": case_id,
747
+ "rank": rank_i,
748
+ "_has_rank": has_rank,
749
+ "anomaly_score": anomaly_score,
750
+ "signal_count": signal_count,
751
+ "signal_tags": tags_s,
752
+ "uid": uid,
753
+ "chunk_id": chunk_id,
754
+ "run_id": run_id,
755
+ "order_index": order_index,
756
+ "excerpt": excerpt,
757
+ "context_prev": context_prev,
758
+ "context_next": context_next,
759
+ }
760
+
761
+
762
+ def _sort_signals(rows: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
763
+ out = list(rows)
764
+ out.sort(
765
+ key=lambda r: (
766
+ 0 if r.get("_has_rank") else 1,
767
+ _safe_int(r.get("rank"), 10**12),
768
+ -_safe_float(r.get("anomaly_score"), 0.0),
769
+ -_safe_int(r.get("signal_count"), 0),
770
+ _safe_str(r.get("case_id")),
771
+ _safe_str(r.get("uid")),
772
+ _safe_str(r.get("chunk_id")),
773
+ )
774
+ )
775
+ return out
776
+
777
+
778
+ def _load_signals_rows() -> Tuple[List[Dict[str, Any]], Optional[Path], Optional[str]]:
779
+ p = SIGNALS_PATH if SIGNALS_PATH is not None else ensure_signals_file()
780
+ if p is None or (not p.exists()) or (not p.is_file()):
781
+ return [], None, "No method-sanitized signals file found."
782
+
783
+ rows: List[Dict[str, Any]] = []
784
+ suffix = p.suffix.lower()
785
+
786
+ try:
787
+ if suffix == ".jsonl":
788
+ with p.open("r", encoding="utf-8", errors="ignore") as f:
789
+ for _i, line in enumerate(f):
790
+ s = line.strip()
791
+ if not s:
792
+ continue
793
+ try:
794
+ obj = json.loads(s)
795
+ except Exception:
796
+ continue
797
+ if not isinstance(obj, dict):
798
+ continue
799
+ rows.append(_normalize_signal_row(obj, len(rows)))
800
+ elif suffix == ".csv":
801
+ with p.open("r", encoding="utf-8", errors="ignore", newline="") as f:
802
+ reader = csv.DictReader(f)
803
+ for _i, r in enumerate(reader):
804
+ rr = dict(r) if isinstance(r, dict) else {}
805
+ rows.append(_normalize_signal_row(rr, len(rows)))
806
+ else:
807
+ return [], p, f"Unsupported signals file extension: {suffix}. Use .jsonl or .csv"
808
+ except Exception as e:
809
+ return [], p, f"Failed reading signals file: {e}"
810
+
811
+ rows = _sort_signals(rows)
812
+ return rows, p, None
813
+
814
+
815
+ def _resolve_uid_from_signal_fields(uid: str, run_id: str, chunk_id: str, order_index: int) -> str:
816
+ uid = _safe_str(uid)
817
+ if uid:
818
+ return uid
819
+
820
+ conn = get_conn()
821
+ cur = conn.cursor()
822
+ try:
823
+ if run_id and order_index >= 0:
824
+ cur.execute(
825
+ "SELECT uid FROM chunks WHERE run_id=? AND order_index=? LIMIT 1;",
826
+ (run_id, int(order_index)),
827
+ )
828
+ r = cur.fetchone()
829
+ if r is not None and r["uid"]:
830
+ return str(r["uid"])
831
+
832
+ if run_id and chunk_id:
833
+ cur.execute(
834
+ "SELECT uid FROM chunks WHERE run_id=? AND chunk_id=? LIMIT 1;",
835
+ (run_id, chunk_id),
836
+ )
837
+ r = cur.fetchone()
838
+ if r is not None and r["uid"]:
839
+ return str(r["uid"])
840
+
841
+ if chunk_id:
842
+ cur.execute("SELECT uid, COUNT(*) AS n FROM chunks WHERE chunk_id=?;", (chunk_id,))
843
+ r = cur.fetchone()
844
+ if r is not None and int(r["n"] or 0) == 1 and r["uid"]:
845
+ return str(r["uid"])
846
+ except Exception:
847
+ return uid
848
+ finally:
849
+ cur.close()
850
+
851
+ return uid
852
+
853
+
854
+ def _filter_signals(rows: List[Dict[str, Any]], top_n: int, min_score: float, tag_filter: str, text_filter: str) -> List[Dict[str, Any]]:
855
+ top_n = max(1, min(5000, _safe_int(top_n, 350)))
856
+ min_score_f = _safe_float(min_score, float("-inf"))
857
+
858
+ tag_tokens = [t.strip().lower() for t in re.split(r"[,\s]+", _safe_str(tag_filter)) if t.strip()]
859
+ text_q = _safe_str(text_filter).lower()
860
+
861
+ out: List[Dict[str, Any]] = []
862
+ for r in rows:
863
+ if _safe_float(r.get("anomaly_score"), 0.0) < min_score_f:
864
+ continue
865
+
866
+ tags_s = _safe_str(r.get("signal_tags")).lower()
867
+ if tag_tokens:
868
+ # require all tag tokens
869
+ if any(tok not in tags_s for tok in tag_tokens):
870
+ continue
871
+
872
+ if text_q:
873
+ hay = " ".join(
874
+ [
875
+ _safe_str(r.get("case_id")),
876
+ _safe_str(r.get("uid")),
877
+ _safe_str(r.get("chunk_id")),
878
+ _safe_str(r.get("run_id")),
879
+ _safe_str(r.get("signal_tags")),
880
+ _safe_str(r.get("excerpt")),
881
+ _safe_str(r.get("context_prev")),
882
+ _safe_str(r.get("context_next")),
883
+ ]
884
+ ).lower()
885
+ if text_q not in hay:
886
+ continue
887
+
888
+ out.append(r)
889
+ if len(out) >= top_n:
890
+ break
891
+ return out
892
+
893
+
894
+ def _blank_signals_table() -> List[List[Any]]:
895
+ return [[
896
+ "case_id", "rank", "anomaly_score", "signal_count", "signal_tags",
897
+ "uid", "chunk_id", "run_id", "order_index", "excerpt"
898
+ ]]
899
+
900
+
901
+ def _signals_to_table(rows: List[Dict[str, Any]]) -> List[List[Any]]:
902
+ out = _blank_signals_table()
903
+ for r in rows:
904
+ out.append(
905
+ [
906
+ _safe_str(r.get("case_id")),
907
+ _safe_int(r.get("rank"), 0),
908
+ _safe_float(r.get("anomaly_score"), 0.0),
909
+ _safe_int(r.get("signal_count"), 0),
910
+ _safe_str(r.get("signal_tags")),
911
+ _safe_str(r.get("uid")),
912
+ _safe_str(r.get("chunk_id")),
913
+ _safe_str(r.get("run_id")),
914
+ _safe_int(r.get("order_index"), -1),
915
+ _truncate(_safe_str(r.get("excerpt")), 220),
916
+ ]
917
+ )
918
+ return out
919
+
920
+
921
+ def _pack_signal_choice(r: Dict[str, Any]) -> str:
922
+ case_id = _safe_str(r.get("case_id"))
923
+ rank = _safe_int(r.get("rank"), 0)
924
+ score = _safe_float(r.get("anomaly_score"), 0.0)
925
+ sig = _safe_int(r.get("signal_count"), 0)
926
+ prev = _truncate(_safe_str(r.get("excerpt")), 100)
927
+ return f"{case_id} | rank={rank} score={score:.6f} sig={sig} | {prev}"
928
+
929
+
930
+ def _extract_case_id(choice: str) -> str:
931
+ s = _safe_str(choice)
932
+ if not s:
933
+ return ""
934
+ if " | " in s:
935
+ return s.split(" | ", 1)[0].strip()
936
+ return s.strip()
937
+
938
+
939
+ def _signal_details_text(r: Dict[str, Any], source_path: Optional[Path]) -> str:
940
+ src = str(source_path) if source_path is not None else "n/a"
941
+ lines = [
942
+ f"case_id: {_safe_str(r.get('case_id'))}",
943
+ f"rank: {_safe_int(r.get('rank'), 0)}",
944
+ f"anomaly_score: {_safe_float(r.get('anomaly_score'), 0.0)}",
945
+ f"signal_count: {_safe_int(r.get('signal_count'), 0)}",
946
+ f"signal_tags: {_safe_str(r.get('signal_tags'))}",
947
+ f"uid: {_safe_str(r.get('uid'))}",
948
+ f"chunk_id: {_safe_str(r.get('chunk_id'))}",
949
+ f"run_id: {_safe_str(r.get('run_id'))}",
950
+ f"order_index: {_safe_int(r.get('order_index'), -1)}",
951
+ f"source_file: {src}",
952
+ "",
953
+ "excerpt:",
954
+ _safe_str(r.get("excerpt")),
955
+ "",
956
+ "context_prev:",
957
+ _safe_str(r.get("context_prev")),
958
+ "",
959
+ "context_next:",
960
+ _safe_str(r.get("context_next")),
961
+ ]
962
+ txt = "\n".join(lines)
963
+ if len(txt) > 20000:
964
+ txt = txt[:20000] + "\n\n…(truncated to 20k chars)…"
965
+ return txt
966
+
967
+
968
+ # -----------------------------
969
+ # UI helpers
970
+ # -----------------------------
971
+ def _fmt_debug(e: BaseException) -> str:
972
+ tb = traceback.format_exc()
973
+ if len(tb) > 6000:
974
+ tb = tb[-6000:]
975
+ return f"```text\n{tb}\n```"
976
+
977
+
978
+ def _blank_results_table() -> List[List[Any]]:
979
+ return [["uid", "cluster_id", "order_index", "doc_id", "source_file", "cluster_prob", "preview"]]
980
+
981
+
982
+ def _blank_cluster_table() -> List[List[Any]]:
983
+ return [["run_id", "cluster_id", "n_chunks", "prob_avg", "bm25_density_avg", "idf_mass_avg", "token_count_avg"]]
984
+
985
+
986
+ def _blank_cluster_chunks_table() -> List[List[Any]]:
987
+ return [["uid", "order_index", "doc_id", "source_file", "cluster_prob", "preview"]]
988
+
989
+
990
+ def _blank_ctx_table() -> List[List[Any]]:
991
+ return [["uid", "order_index", "cluster_id", "cluster_prob", "doc_id", "source_file", "preview"]]
992
+
993
+
994
+ def _pack_choice(uid: str, preview: str) -> str:
995
+ uid = (uid or "").strip()
996
+ preview = (preview or "").replace("\n", " ").replace("\r", " ").strip()
997
+ preview = re.sub(r"\s+", " ", preview)
998
+ if len(preview) > 160:
999
+ preview = preview[:160] + "…"
1000
+ return f"{uid} | {preview}" if preview else uid
1001
+
1002
+
1003
+ def _extract_uid(choice: str) -> str:
1004
+ s = (choice or "").strip()
1005
+ if not s:
1006
+ return ""
1007
+ if " | " in s:
1008
+ return s.split(" | ", 1)[0].strip()
1009
+ return s
1010
+
1011
+
1012
+ def _pack_cluster_choice(run_id: str, cluster_id: Any, n_chunks: Any) -> str:
1013
+ r = (str(run_id) if run_id is not None else "").strip()
1014
+ c = (str(cluster_id) if cluster_id is not None else "").strip()
1015
+ try:
1016
+ n = int(n_chunks)
1017
+ except Exception:
1018
+ n = n_chunks
1019
+ # user sees this; keep it readable and stable
1020
+ return f"{r} / {c} | {n}"
1021
+
1022
+
1023
+ def _extract_cluster_key(choice: str) -> Tuple[str, str]:
1024
+ """
1025
+ choice format: "run_id / cluster_id | n"
1026
+ """
1027
+ s = (choice or "").strip()
1028
+ if not s:
1029
+ return "", ""
1030
+ left = s.split(" | ", 1)[0].strip()
1031
+ if " / " in left:
1032
+ a, b = left.split(" / ", 1)
1033
+ return a.strip(), b.strip()
1034
+ # fallback: if someone pasted just a cluster_id
1035
+ return "", left.strip()
1036
+
1037
+
1038
+ def _show_uid(uid: str, window: int) -> Tuple[str, str, List[List[Any]]]:
1039
+ uid = (uid or "").strip()
1040
+ if not uid:
1041
+ return "", "", _blank_ctx_table()
1042
+
1043
+ ch = get_chunk_by_uid(uid)
1044
+ if not ch:
1045
+ return "", "", _blank_ctx_table()
1046
+
1047
+ meta_lines = [
1048
+ f"uid: {ch.get('uid','')}",
1049
+ f"run_id: {ch.get('run_id','')}",
1050
+ f"chunk_id: {ch.get('chunk_id','')}",
1051
+ f"order_index: {ch.get('order_index','')}",
1052
+ f"doc_id: {ch.get('doc_id','')}",
1053
+ f"source_file: {ch.get('source_file','')}",
1054
+ f"cluster_id: {ch.get('cluster_id','')}",
1055
+ f"cluster_prob: {ch.get('cluster_prob','')}",
1056
+ f"bm25_density: {ch.get('bm25_density','')}",
1057
+ f"idf_mass: {ch.get('idf_mass','')}",
1058
+ f"token_count: {ch.get('token_count','')}",
1059
+ f"unique_token_count: {ch.get('unique_token_count','')}",
1060
+ ]
1061
+ meta_text = "\n".join(meta_lines)
1062
+
1063
+ full_text = ch.get("text", "") or ""
1064
+ if len(full_text) > 20000:
1065
+ full_text = full_text[:20000] + "\n\n…(truncated to 20k chars)…"
1066
+
1067
+ ctx = get_context(run_id=str(ch["run_id"]), order_index=int(ch["order_index"] or 0), window=int(window or 3))
1068
+ ctx_table = _blank_ctx_table()
1069
+ for r in ctx:
1070
+ ctx_table.append(
1071
+ [
1072
+ r.get("uid", ""),
1073
+ r.get("order_index", ""),
1074
+ r.get("cluster_id", ""),
1075
+ r.get("cluster_prob", ""),
1076
+ r.get("doc_id", ""),
1077
+ r.get("source_file", ""),
1078
+ r.get("preview", ""),
1079
+ ]
1080
+ )
1081
+
1082
+ return meta_text, full_text, ctx_table
1083
+
1084
+
1085
+ # -----------------------------
1086
+ # Callbacks
1087
+ # -----------------------------
1088
+ def ui_search(query: str, limit: int, cluster_id: str):
1089
+ try:
1090
+ tbl = fts_search(query=query, cluster_id=cluster_id, limit=limit)
1091
+
1092
+ if tbl and tbl[0] and tbl[0][0] == "error":
1093
+ return (
1094
+ gr.update(choices=[], value=""),
1095
+ "",
1096
+ tbl,
1097
+ "⚠️ " + (tbl[1][0] if len(tbl) > 1 and tbl[1] else "Search error"),
1098
+ _fmt_debug(RuntimeError("search error")),
1099
+ )
1100
+
1101
+ choices: List[str] = []
1102
+ if len(tbl) >= 2:
1103
+ for row in tbl[1:]:
1104
+ if not row or len(row) < 7:
1105
+ continue
1106
+ uid = str(row[0])
1107
+ preview = str(row[6])
1108
+ choices.append(_pack_choice(uid, preview))
1109
+
1110
+ status = f"✅ Found {len(choices)} results."
1111
+ debug = ""
1112
+ first_uid = _extract_uid(choices[0]) if choices else ""
1113
+ return (
1114
+ gr.update(choices=choices, value=(choices[0] if choices else "")),
1115
+ first_uid,
1116
+ tbl,
1117
+ status,
1118
+ debug,
1119
+ )
1120
+ except Exception as e:
1121
+ return (
1122
+ gr.update(choices=[], value=""),
1123
+ "",
1124
+ _blank_results_table(),
1125
+ f"⚠️ {type(e).__name__}: {e}",
1126
+ _fmt_debug(e),
1127
+ )
1128
+
1129
+
1130
+ def ui_pick_result(choice: str):
1131
+ return _extract_uid(choice)
1132
+
1133
+
1134
+ def ui_open_uid(uid: str, ctx_window: int):
1135
+ try:
1136
+ uid = (uid or "").strip()
1137
+ if not uid:
1138
+ return "", "", _blank_ctx_table(), "⚠️ Enter/pick a uid first.", ""
1139
+
1140
+ meta, text, ctx = _show_uid(uid, ctx_window)
1141
+ if not meta and not text:
1142
+ return "", "", _blank_ctx_table(), f"⚠️ uid not found: {uid}", ""
1143
+
1144
+ return meta, text, ctx, f"✅ Opened uid {uid}", ""
1145
+ except Exception as e:
1146
+ return "", "", _blank_ctx_table(), f"⚠️ {type(e).__name__}: {e}", _fmt_debug(e)
1147
+
1148
+
1149
+ def ui_load_clusters_all(top_n: int):
1150
+ try:
1151
+ tbl = fetch_cluster_summary_all(top_n=top_n)
1152
+ if tbl and tbl[0] and tbl[0][0] == "error":
1153
+ return tbl, gr.update(choices=[], value=""), "⚠️ " + (tbl[1][0] if len(tbl) > 1 and tbl[1] else "Cluster summary error"), ""
1154
+
1155
+ choices: List[str] = []
1156
+ if len(tbl) >= 2:
1157
+ for row in tbl[1:]:
1158
+ if not row or len(row) < 3:
1159
+ continue
1160
+ choices.append(_pack_cluster_choice(str(row[0]), row[1], row[2]))
1161
+
1162
+ status = f"✅ Loaded {len(choices)} clusters."
1163
+ return tbl, gr.update(choices=choices, value=(choices[0] if choices else "")), status, ""
1164
+ except Exception as e:
1165
+ return _blank_cluster_table(), gr.update(choices=[], value=""), f"⚠️ {type(e).__name__}: {e}", _fmt_debug(e)
1166
+
1167
+
1168
+ def ui_load_cluster_chunks(cluster_choice: str, limit: int):
1169
+ try:
1170
+ run_id, cluster_id = _extract_cluster_key(cluster_choice)
1171
+ if not run_id:
1172
+ return (
1173
+ _blank_cluster_chunks_table(),
1174
+ gr.update(choices=[], value=""),
1175
+ "",
1176
+ "⚠️ Pick a cluster from the list.",
1177
+ "",
1178
+ )
1179
+
1180
+ tbl = fetch_cluster_chunks(run_id=run_id, cluster_id=cluster_id, limit=limit)
1181
+
1182
+ if tbl and tbl[0] and tbl[0][0] == "error":
1183
+ return (
1184
+ tbl,
1185
+ gr.update(choices=[], value=""),
1186
+ "",
1187
+ "⚠️ " + (tbl[1][0] if len(tbl) > 1 and tbl[1] else "Cluster error"),
1188
+ "",
1189
+ )
1190
+
1191
+ choices: List[str] = []
1192
+ if len(tbl) >= 2:
1193
+ for row in tbl[1:]:
1194
+ if not row or len(row) < 6:
1195
+ continue
1196
+ uid = str(row[0])
1197
+ preview = str(row[5])
1198
+ choices.append(_pack_choice(uid, preview))
1199
+
1200
+ first_uid = _extract_uid(choices[0]) if choices else ""
1201
+ return (
1202
+ tbl,
1203
+ gr.update(choices=choices, value=(choices[0] if choices else "")),
1204
+ first_uid,
1205
+ f"✅ Loaded {len(choices)} chunks.",
1206
+ "",
1207
+ )
1208
+ except Exception as e:
1209
+ return _blank_cluster_chunks_table(), gr.update(choices=[], value=""), "", f"⚠️ {type(e).__name__}: {e}", _fmt_debug(e)
1210
+
1211
+
1212
+ def ui_reload_meta():
1213
+ try:
1214
+ meta_table = fetch_meta()
1215
+ return meta_table, "✅ Reloaded.", ""
1216
+ except Exception as e:
1217
+ return [["error"], ["failed"]], f"⚠️ {type(e).__name__}: {e}", _fmt_debug(e)
1218
+
1219
+
1220
+ def ui_signals_load(top_n: int, min_score: float, tag_filter: str, text_filter: str):
1221
+ try:
1222
+ rows_all, src_path, err = _load_signals_rows()
1223
+ if err:
1224
+ state = {}
1225
+ return (
1226
+ _blank_signals_table(),
1227
+ gr.update(choices=[], value=""),
1228
+ "",
1229
+ "",
1230
+ state,
1231
+ f"⚠️ {err}",
1232
+ "",
1233
+ )
1234
+
1235
+ rows = _filter_signals(
1236
+ rows=rows_all,
1237
+ top_n=top_n,
1238
+ min_score=min_score,
1239
+ tag_filter=tag_filter,
1240
+ text_filter=text_filter,
1241
+ )
1242
+ table = _signals_to_table(rows)
1243
+
1244
+ # state by case_id
1245
+ st: Dict[str, Dict[str, Any]] = {}
1246
+ choices: List[str] = []
1247
+ for r in rows:
1248
+ cid = _safe_str(r.get("case_id"))
1249
+ if not cid:
1250
+ continue
1251
+ st[cid] = r
1252
+ choices.append(_pack_signal_choice(r))
1253
+
1254
+ uid = ""
1255
+ details = ""
1256
+ if choices:
1257
+ cid0 = _extract_case_id(choices[0])
1258
+ r0 = st.get(cid0, {})
1259
+ uid = _resolve_uid_from_signal_fields(
1260
+ uid=_safe_str(r0.get("uid")),
1261
+ run_id=_safe_str(r0.get("run_id")),
1262
+ chunk_id=_safe_str(r0.get("chunk_id")),
1263
+ order_index=_safe_int(r0.get("order_index"), -1),
1264
+ )
1265
+ details = _signal_details_text(r0, src_path)
1266
+
1267
+ source_msg = f" source={src_path}" if src_path is not None else ""
1268
+ status = f"✅ Loaded {len(rows)} signal cards.{source_msg}"
1269
+ return (
1270
+ table,
1271
+ gr.update(choices=choices, value=(choices[0] if choices else "")),
1272
+ uid,
1273
+ details,
1274
+ st,
1275
+ status,
1276
+ "",
1277
+ )
1278
+ except Exception as e:
1279
+ return (
1280
+ _blank_signals_table(),
1281
+ gr.update(choices=[], value=""),
1282
+ "",
1283
+ "",
1284
+ {},
1285
+ f"⚠️ {type(e).__name__}: {e}",
1286
+ _fmt_debug(e),
1287
+ )
1288
+
1289
+
1290
+ def ui_signals_pick(choice: str, state: Dict[str, Dict[str, Any]]):
1291
+ try:
1292
+ cid = _extract_case_id(choice)
1293
+ r = state.get(cid) if isinstance(state, dict) else None
1294
+ if not r:
1295
+ return "", "", "⚠️ Pick a signal card first.", ""
1296
+
1297
+ uid = _resolve_uid_from_signal_fields(
1298
+ uid=_safe_str(r.get("uid")),
1299
+ run_id=_safe_str(r.get("run_id")),
1300
+ chunk_id=_safe_str(r.get("chunk_id")),
1301
+ order_index=_safe_int(r.get("order_index"), -1),
1302
+ )
1303
+
1304
+ details = _signal_details_text(r, SIGNALS_PATH)
1305
+ return uid, details, f"✅ Picked {cid}", ""
1306
+ except Exception as e:
1307
+ return "", "", f"⚠️ {type(e).__name__}: {e}", _fmt_debug(e)
1308
+
1309
+
1310
+ # -----------------------------
1311
+ # UI build
1312
+ # -----------------------------
1313
+ CSS = """
1314
+ #app { max-width: 1100px; margin: 0 auto; }
1315
+ h1,h2,h3 { margin-bottom: 0.4rem; }
1316
+ .note { font-size: 0.95rem; opacity: 0.9; }
1317
+ """
1318
+
1319
+ def build_ui() -> gr.Blocks:
1320
+ meta_table = fetch_meta()
1321
+
1322
+ with gr.Blocks(title="Corpus Browser", css=CSS) as demo:
1323
+ gr.Markdown(
1324
+ f"""
1325
+ # Corpus Browser
1326
+ <span class="note">version: <code>{APP_VERSION}</code> - db: <code>{DB_PATH}</code></span>
1327
+
1328
+ **Use it like this:**
1329
+ - **Search:** type words -> Search -> pick result -> Open
1330
+ - **Clusters:** Load clusters -> pick one -> Load chunks -> pick chunk -> Open
1331
+ - **Signals (optional):** Load signal cards -> pick card -> Open linked chunk
1332
+ """
1333
+ )
1334
+
1335
+ status = gr.Markdown("Ready.", elem_id="status")
1336
+ with gr.Accordion("Debug details", open=False):
1337
+ debug = gr.Markdown("")
1338
+
1339
+ with gr.Tab("Search"):
1340
+ with gr.Row():
1341
+ q = gr.Textbox(label="Search words", placeholder="Type words to search", lines=2)
1342
+ search_btn = gr.Button("Search", variant="primary")
1343
+
1344
+ with gr.Accordion("Advanced", open=False):
1345
+ with gr.Row():
1346
+ limit_in = gr.Slider(1, 500, value=50, step=1, label="Max results")
1347
+ cluster_in = gr.Textbox(label="Filter by cluster_id (optional)", placeholder="Leave blank")
1348
+ ctx_window = gr.Slider(0, 12, value=3, step=1, label="Context window")
1349
+
1350
+ gr.Markdown("### Results")
1351
+ result_pick = gr.Dropdown(choices=[], value="", label="Pick a result", interactive=True)
1352
+ uid_box = gr.Textbox(label="UID", placeholder="Auto-filled when you pick a result (or paste one)")
1353
+ open_btn = gr.Button("Open", variant="secondary")
1354
+
1355
+ with gr.Row():
1356
+ text_out = gr.Textbox(label="Text", lines=18)
1357
+
1358
+ with gr.Accordion("More details", open=False):
1359
+ meta_out = gr.Textbox(label="Meta", lines=10)
1360
+ ctx_tbl = _df(value=_blank_ctx_table(), label="Nearby chunks (context)", interactive=False, wrap=True)
1361
+
1362
+ with gr.Accordion("Show table (power users)", open=False):
1363
+ results_tbl = _df(value=_blank_results_table(), label="Raw results table", interactive=False, wrap=True)
1364
+
1365
+ search_btn.click(
1366
+ ui_search,
1367
+ inputs=[q, limit_in, cluster_in],
1368
+ outputs=[result_pick, uid_box, results_tbl, status, debug],
1369
+ )
1370
+ result_pick.change(ui_pick_result, inputs=[result_pick], outputs=[uid_box])
1371
+
1372
+ open_btn.click(
1373
+ ui_open_uid,
1374
+ inputs=[uid_box, ctx_window],
1375
+ outputs=[meta_out, text_out, ctx_tbl, status, debug],
1376
+ )
1377
+
1378
+ with gr.Tab("Clusters"):
1379
+ with gr.Row():
1380
+ load_clusters_btn = gr.Button("Load clusters", variant="primary")
1381
+
1382
+ with gr.Accordion("Advanced", open=False):
1383
+ with gr.Row():
1384
+ topn = gr.Slider(1, 2000, value=200, step=1, label="How many clusters to list")
1385
+ sample_n = gr.Slider(1, 2000, value=150, step=1, label="How many chunks to list")
1386
+ ctx_window2 = gr.Slider(0, 12, value=3, step=1, label="Context window")
1387
+
1388
+ cluster_pick = gr.Dropdown(choices=[], value="", label="Pick a cluster", interactive=True)
1389
+ load_chunks_btn = gr.Button("Load chunks", variant="secondary")
1390
+
1391
+ chunk_pick = gr.Dropdown(choices=[], value="", label="Pick a chunk", interactive=True)
1392
+ uid_box2 = gr.Textbox(label="UID", placeholder="Auto-filled when you pick a chunk (or paste one)")
1393
+ open_btn2 = gr.Button("Open", variant="secondary")
1394
+
1395
+ with gr.Accordion("Show tables", open=False):
1396
+ cluster_tbl = _df(value=_blank_cluster_table(), label="Clusters table", interactive=False, wrap=True)
1397
+ chunk_tbl = _df(value=_blank_cluster_chunks_table(), label="Chunks table", interactive=False, wrap=True)
1398
+
1399
+ with gr.Row():
1400
+ text_out2 = gr.Textbox(label="Text", lines=18)
1401
+
1402
+ with gr.Accordion("More details", open=False):
1403
+ meta_out2 = gr.Textbox(label="Meta", lines=10)
1404
+ ctx_tbl2 = _df(value=_blank_ctx_table(), label="Nearby chunks (context)", interactive=False, wrap=True)
1405
+
1406
+ load_clusters_btn.click(
1407
+ ui_load_clusters_all,
1408
+ inputs=[topn],
1409
+ outputs=[cluster_tbl, cluster_pick, status, debug],
1410
+ )
1411
+
1412
+ load_chunks_btn.click(
1413
+ ui_load_cluster_chunks,
1414
+ inputs=[cluster_pick, sample_n],
1415
+ outputs=[chunk_tbl, chunk_pick, uid_box2, status, debug],
1416
+ )
1417
+ chunk_pick.change(ui_pick_result, inputs=[chunk_pick], outputs=[uid_box2])
1418
+
1419
+ open_btn2.click(
1420
+ ui_open_uid,
1421
+ inputs=[uid_box2, ctx_window2],
1422
+ outputs=[meta_out2, text_out2, ctx_tbl2, status, debug],
1423
+ )
1424
+
1425
+ with gr.Tab("Signals"):
1426
+ signals_state = gr.State({})
1427
+
1428
+ with gr.Row():
1429
+ load_signals_btn = gr.Button("Load signal cards", variant="primary")
1430
+
1431
+ with gr.Accordion("Advanced", open=False):
1432
+ with gr.Row():
1433
+ topn_sig = gr.Slider(1, 2000, value=350, step=1, label="Max cards")
1434
+ min_score_sig = gr.Number(value=0.0, label="Min anomaly_score")
1435
+ with gr.Row():
1436
+ tag_filter_sig = gr.Textbox(label="Tag filter (space/comma separated; all required)", placeholder="ex: outlier rare_entity")
1437
+ text_filter_sig = gr.Textbox(label="Text filter (substring)", placeholder="search in case_id/tags/excerpt/context")
1438
+
1439
+ ctx_window3 = gr.Slider(0, 12, value=3, step=1, label="Context window")
1440
+
1441
+ signal_pick = gr.Dropdown(choices=[], value="", label="Pick a signal card", interactive=True)
1442
+ uid_box3 = gr.Textbox(label="Resolved UID", placeholder="Auto-filled if mapping exists")
1443
+ open_btn3 = gr.Button("Open linked chunk", variant="secondary")
1444
+
1445
+ signal_details = gr.Textbox(label="Signal card details", lines=14)
1446
+
1447
+ with gr.Accordion("Show signal table", open=False):
1448
+ signal_tbl = _df(value=_blank_signals_table(), label="Signal cards", interactive=False, wrap=True)
1449
+
1450
+ with gr.Row():
1451
+ text_out3 = gr.Textbox(label="Text", lines=18)
1452
+
1453
+ with gr.Accordion("More details", open=False):
1454
+ meta_out3 = gr.Textbox(label="Meta", lines=10)
1455
+ ctx_tbl3 = _df(value=_blank_ctx_table(), label="Nearby chunks (context)", interactive=False, wrap=True)
1456
+
1457
+ load_signals_btn.click(
1458
+ ui_signals_load,
1459
+ inputs=[topn_sig, min_score_sig, tag_filter_sig, text_filter_sig],
1460
+ outputs=[signal_tbl, signal_pick, uid_box3, signal_details, signals_state, status, debug],
1461
+ )
1462
+
1463
+ signal_pick.change(
1464
+ ui_signals_pick,
1465
+ inputs=[signal_pick, signals_state],
1466
+ outputs=[uid_box3, signal_details, status, debug],
1467
+ )
1468
+
1469
+ open_btn3.click(
1470
+ ui_open_uid,
1471
+ inputs=[uid_box3, ctx_window3],
1472
+ outputs=[meta_out3, text_out3, ctx_tbl3, status, debug],
1473
+ )
1474
+
1475
+ with gr.Tab("About"):
1476
+ reload_meta_btn = gr.Button("Reload meta", variant="primary")
1477
+ meta_tbl = _df(value=meta_table, label="Meta", interactive=False, wrap=True)
1478
+ reload_meta_btn.click(
1479
+ ui_reload_meta,
1480
+ inputs=[],
1481
+ outputs=[meta_tbl, status, debug],
1482
+ )
1483
+
1484
+ return demo
1485
+
1486
+
1487
+ demo = build_ui()
1488
+
1489
+ if __name__ == "__main__":
1490
+ demo.launch(server_name="0.0.0.0", server_port=int(_env("PORT", "7860")))