Chucks90 commited on
Commit
91dc2cd
Β·
verified Β·
1 Parent(s): ea194bb

add DATA_AVAILABILITY.md: what's included, public dataset/backbone sources, reproduce-via-HF-Jobs, DOI note

Browse files
Files changed (1) hide show
  1. DATA_AVAILABILITY.md +51 -0
DATA_AVAILABILITY.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data & Code Availability
2
+
3
+ This repository (`Chucks90/covtoken` on the Hugging Face Hub) holds the **code and experiment
4
+ artifacts** for the covtoken study (label-free mid-layer lesion subspaces for token-economical
5
+ medical imaging). All compute was run as Hugging Face Jobs; every reported number is reproducible
6
+ from the scripts here against the public backbones and datasets listed below.
7
+
8
+ ## What is in this repository
9
+ - **Code** β€” `jobs/` (PEP-723 `uv` job scripts, one per experiment), `subspace/`, `coverage/`,
10
+ `gate/`, `arch/`, `eval/`, `data/`, `backbone/`, and `tests/` (incl. the label-leak guard).
11
+ - **Decision records** β€” `gate_reports/` (per-gate JSON with metric, comparator, threshold, and
12
+ statistical test; `NEGATIVE_RESULT.md`; `SUMMARY.md`).
13
+ - **Research-program results** β€” `research_v2/` (S1–S5), `research_v3/` (F1–F4), `research_v4/`
14
+ (G1/spectra/rarity-route) as JSON + summaries.
15
+ - **Manuscripts & figures** β€” `paper/` (three drafts, `make_figures.py`, `figures/`).
16
+ - **Specs** β€” `research_specs/`, `configs/thresholds.lock.json`.
17
+
18
+ ## What is NOT in this repository (and why)
19
+ Raw token banks, model weights, and materialized image/mask tensors are **not** included: they are
20
+ large, and the imaging data are governed by their original third-party licenses. They are
21
+ regenerated deterministically by the scripts in `jobs/` from the public sources below. Reported
22
+ metrics depend only on those public sources + the scripts here.
23
+
24
+ ## Backbones (public, frozen β€” no fine-tuning)
25
+ - **MedDINOv3 ViT-B/16 (CT-3M)** β€” `ricklisz123/MedDINOv3-ViTB-16-CT-3M` (Hugging Face)
26
+ - **DINOv2-base** β€” `facebook/dinov2-base`
27
+ - Cross-objective controls: `google/vit-base-patch16-224` (supervised), `facebook/vit-mae-base` (MAE)
28
+
29
+ ## Imaging datasets (public, third-party β€” used eval-only; labels never touch subspace construction)
30
+ - **LIDC-IDRI** (lung CT) β€” The Cancer Imaging Archive: https://www.cancerimagingarchive.net/collection/lidc-idri/
31
+ - **KiTS23** (kidney CT) β€” https://kits-challenge.org/kits23/
32
+ - **Medical Segmentation Decathlon** β€” Task03 Liver, Task07 Pancreas (CT) β€” http://medicaldecathlon.com/
33
+ - **BUSI** (breast ultrasound) β€” Al-Dhabyani et al., *Data in Brief* 2020 (Dataset of breast ultrasound images)
34
+
35
+ Each dataset retains its original license/terms; obtain it from the source above.
36
+
37
+ ## Reproducing a result
38
+ Every experiment is a self-contained job script. Example:
39
+
40
+ ```bash
41
+ hf jobs uv run --flavor t4-medium --timeout 2h --secrets HF_TOKEN \
42
+ -v hf://buckets/<your-bucket>:/mnt --detach jobs/<experiment>_job.py
43
+ ```
44
+
45
+ Each script declares its inline dependencies (PEP 723), reads inputs from the mounted bucket,
46
+ writes a result JSON, and prints a `*_RESULT` line. The mapping from claims to scripts/artifacts is
47
+ in each `gate_reports/*.json` and the `research_v*/SUMMARY.md` files.
48
+
49
+ ## Citing this repository
50
+ A DOI for the archival snapshot is available via the repository's **Settings β†’ Generate DOI** on the
51
+ Hugging Face Hub; cite that DOI in the manuscript's Data Availability Statement.