File size: 2,119 Bytes
c350513
 
 
 
 
 
 
 
 
2862cfc
c350513
2862cfc
c350513
2862cfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
title: EpsteinWithAnomScore
emoji: 👁
colorFrom: gray
colorTo: green
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: mit
---
# Epstein Corpus Explorer (Space + Dataset split)

This Space is a read-only browser for a large SQLite corpus plus optional signal cards.

- Space: `cjc0013/EpsteinWithAnomScore`
- Dataset: `cjc0013/EpsteinWithAnomScore`

## Links

- Space: https://huggingface.co/spaces/cjc0013/EpsteinWithAnomScore
- Dataset file (DB): https://huggingface.co/datasets/cjc0013/EpsteinWithAnomScore/blob/main/corpus.sqlite

## What this app does

- Opens `corpus.sqlite` in read-only mode
- FTS keyword search (`chunks_fts`)
- Cluster browsing across runs (`cluster_summary`)
- Open any `uid` and view local context window (`order_index +/- k`)
- Optional Signals tab for method-sanitized signal cards (JSONL/CSV), then open linked chunks

## Core principle

Raw data is not modified here.  
This app is for indexing, browsing, and narrowing search space.  
Signal/anomaly values are triage hints, not proof.

## How DB loading works

Priority order:

1. `CORPUS_SQLITE_PATH` (if set)
2. Local paths like `./data/corpus.sqlite`
3. Download from dataset repo using:
   - `DATASET_REPO_ID`
   - `DATASET_FILENAME` (default: `corpus.sqlite`)

Recommended Space variables:

- `DATASET_REPO_ID = cjc0013/EpsteinWithAnomScore`
- `DATASET_FILENAME = corpus.sqlite`
- `DB_LOCAL_DIR = ./data` (optional)

## Optional Signals file loading

If you publish a signals file in the dataset, the app can load it automatically.

Supported names:
- `public_method_sanitized_topN.jsonl`
- `public_top_signals.jsonl`
- CSV variants of the same names

Priority order:

1. `METHOD_SIGNALS_PATH` (if set)
2. Common local paths (`./data`, `./dataset`, `/data`)
3. Download from dataset repo with:
   - `METHOD_SIGNALS_DATASET_REPO_ID`
   - `METHOD_SIGNALS_FILENAME`

Recommended variables (if signals are in same dataset repo):
- `METHOD_SIGNALS_DATASET_REPO_ID = cjc0013/EpsteinWithAnomScore`
- `METHOD_SIGNALS_FILENAME = public_method_sanitized_topN.jsonl`


```txt
gradio>=4.0.0
huggingface_hub>=0.20.0