cjc0013's picture
Update README.md
6d53500 verified
metadata
title: EpsteinWithAnomScore
emoji: 👁
colorFrom: gray
colorTo: green
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: mit

Epstein Corpus Explorer (Space + Dataset split)

This Space is a read-only browser for a large SQLite corpus plus optional signal cards.

  • Space: cjc0013/EpsteinWithAnomScore
  • Dataset: cjc0013/EpsteinWithAnomScore

Links

What this app does

  • Opens corpus.sqlite in read-only mode
  • FTS keyword search (chunks_fts)
  • Cluster browsing across runs (cluster_summary)
  • Open any uid and view local context window (order_index +/- k)
  • Optional Signals tab for method-sanitized signal cards (JSONL/CSV), then open linked chunks

Core principle

Raw data is not modified here.
This app is for indexing, browsing, and narrowing search space.
Signal/anomaly values are triage hints, not proof.

How DB loading works

Priority order:

  1. CORPUS_SQLITE_PATH (if set)
  2. Local paths like ./data/corpus.sqlite
  3. Download from dataset repo using:
    • DATASET_REPO_ID
    • DATASET_FILENAME (default: corpus.sqlite)

Recommended Space variables:

  • DATASET_REPO_ID = cjc0013/EpsteinWithAnomScore
  • DATASET_FILENAME = corpus.sqlite
  • DB_LOCAL_DIR = ./data (optional)

Optional Signals file loading

If you publish a signals file in the dataset, the app can load it automatically.

Supported names:

  • public_method_sanitized_topN.jsonl
  • public_top_signals.jsonl
  • CSV variants of the same names

Priority order:

  1. METHOD_SIGNALS_PATH (if set)
  2. Common local paths (./data, ./dataset, /data)
  3. Download from dataset repo with:
    • METHOD_SIGNALS_DATASET_REPO_ID
    • METHOD_SIGNALS_FILENAME

Recommended variables (if signals are in same dataset repo):

  • METHOD_SIGNALS_DATASET_REPO_ID = cjc0013/EpsteinWithAnomScore
  • METHOD_SIGNALS_FILENAME = public_method_sanitized_topN.jsonl
gradio>=4.0.0
huggingface_hub>=0.20.0