cjc0013 commited on
Commit
2862cfc
·
verified ·
1 Parent(s): 38bdac2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -1
README.md CHANGED
@@ -7,6 +7,80 @@ sdk: gradio
7
  sdk_version: 6.5.1
8
  app_file: app.py
9
  pinned: false
 
10
  ---
 
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  sdk_version: 6.5.1
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
11
  ---
12
+ # Epstein Corpus Explorer (Space + Dataset split)
13
 
14
+ This Space is a read-only browser for a large SQLite corpus plus optional signal cards.
15
+
16
+ - Space: `cjc0013/EpsteinWithAnomScore`
17
+ - Dataset: `cjc0013/EpsteinWithAnomScore`
18
+
19
+ ## Links
20
+
21
+ - Space: https://huggingface.co/spaces/cjc0013/EpsteinWithAnomScore
22
+ - Dataset file (DB): https://huggingface.co/datasets/cjc0013/EpsteinWithAnomScore/blob/main/corpus.sqlite
23
+
24
+ ## What this app does
25
+
26
+ - Opens `corpus.sqlite` in read-only mode
27
+ - FTS keyword search (`chunks_fts`)
28
+ - Cluster browsing across runs (`cluster_summary`)
29
+ - Open any `uid` and view local context window (`order_index +/- k`)
30
+ - Optional Signals tab for method-sanitized signal cards (JSONL/CSV), then open linked chunks
31
+
32
+ ## Core principle
33
+
34
+ Raw data is not modified here.
35
+ This app is for indexing, browsing, and narrowing search space.
36
+ Signal/anomaly values are triage hints, not proof.
37
+
38
+ ## How DB loading works
39
+
40
+ Priority order:
41
+
42
+ 1. `CORPUS_SQLITE_PATH` (if set)
43
+ 2. Local paths like `./data/corpus.sqlite`
44
+ 3. Download from dataset repo using:
45
+ - `DATASET_REPO_ID`
46
+ - `DATASET_FILENAME` (default: `corpus.sqlite`)
47
+
48
+ Recommended Space variables:
49
+
50
+ - `DATASET_REPO_ID = cjc0013/EpsteinWithAnomScore`
51
+ - `DATASET_FILENAME = corpus.sqlite`
52
+ - `DB_LOCAL_DIR = ./data` (optional)
53
+
54
+ ## Optional Signals file loading
55
+
56
+ If you publish a signals file in the dataset, the app can load it automatically.
57
+
58
+ Supported names:
59
+ - `public_method_sanitized_topN.jsonl`
60
+ - `public_top_signals.jsonl`
61
+ - CSV variants of the same names
62
+
63
+ Priority order:
64
+
65
+ 1. `METHOD_SIGNALS_PATH` (if set)
66
+ 2. Common local paths (`./data`, `./dataset`, `/data`)
67
+ 3. Download from dataset repo with:
68
+ - `METHOD_SIGNALS_DATASET_REPO_ID`
69
+ - `METHOD_SIGNALS_FILENAME`
70
+
71
+ Recommended variables (if signals are in same dataset repo):
72
+ - `METHOD_SIGNALS_DATASET_REPO_ID = cjc0013/EpsteinWithAnomScore`
73
+ - `METHOD_SIGNALS_FILENAME = public_method_sanitized_topN.jsonl`
74
+
75
+ ## Space files
76
+
77
+ Minimum:
78
+
79
+ - `app.py` (your HF app)
80
+ - `requirements.txt`
81
+
82
+ Suggested `requirements.txt`:
83
+
84
+ ```txt
85
+ gradio>=4.0.0
86
+ huggingface_hub>=0.20.0