add DATA_AVAILABILITY.md: what's included, public dataset/backbone sources, reproduce-via-HF-Jobs, DOI note
Browse files- DATA_AVAILABILITY.md +51 -0
DATA_AVAILABILITY.md
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Data & Code Availability
|
| 2 |
+
|
| 3 |
+
This repository (`Chucks90/covtoken` on the Hugging Face Hub) holds the **code and experiment
|
| 4 |
+
artifacts** for the covtoken study (label-free mid-layer lesion subspaces for token-economical
|
| 5 |
+
medical imaging). All compute was run as Hugging Face Jobs; every reported number is reproducible
|
| 6 |
+
from the scripts here against the public backbones and datasets listed below.
|
| 7 |
+
|
| 8 |
+
## What is in this repository
|
| 9 |
+
- **Code** β `jobs/` (PEP-723 `uv` job scripts, one per experiment), `subspace/`, `coverage/`,
|
| 10 |
+
`gate/`, `arch/`, `eval/`, `data/`, `backbone/`, and `tests/` (incl. the label-leak guard).
|
| 11 |
+
- **Decision records** β `gate_reports/` (per-gate JSON with metric, comparator, threshold, and
|
| 12 |
+
statistical test; `NEGATIVE_RESULT.md`; `SUMMARY.md`).
|
| 13 |
+
- **Research-program results** β `research_v2/` (S1βS5), `research_v3/` (F1βF4), `research_v4/`
|
| 14 |
+
(G1/spectra/rarity-route) as JSON + summaries.
|
| 15 |
+
- **Manuscripts & figures** β `paper/` (three drafts, `make_figures.py`, `figures/`).
|
| 16 |
+
- **Specs** β `research_specs/`, `configs/thresholds.lock.json`.
|
| 17 |
+
|
| 18 |
+
## What is NOT in this repository (and why)
|
| 19 |
+
Raw token banks, model weights, and materialized image/mask tensors are **not** included: they are
|
| 20 |
+
large, and the imaging data are governed by their original third-party licenses. They are
|
| 21 |
+
regenerated deterministically by the scripts in `jobs/` from the public sources below. Reported
|
| 22 |
+
metrics depend only on those public sources + the scripts here.
|
| 23 |
+
|
| 24 |
+
## Backbones (public, frozen β no fine-tuning)
|
| 25 |
+
- **MedDINOv3 ViT-B/16 (CT-3M)** β `ricklisz123/MedDINOv3-ViTB-16-CT-3M` (Hugging Face)
|
| 26 |
+
- **DINOv2-base** β `facebook/dinov2-base`
|
| 27 |
+
- Cross-objective controls: `google/vit-base-patch16-224` (supervised), `facebook/vit-mae-base` (MAE)
|
| 28 |
+
|
| 29 |
+
## Imaging datasets (public, third-party β used eval-only; labels never touch subspace construction)
|
| 30 |
+
- **LIDC-IDRI** (lung CT) β The Cancer Imaging Archive: https://www.cancerimagingarchive.net/collection/lidc-idri/
|
| 31 |
+
- **KiTS23** (kidney CT) β https://kits-challenge.org/kits23/
|
| 32 |
+
- **Medical Segmentation Decathlon** β Task03 Liver, Task07 Pancreas (CT) β http://medicaldecathlon.com/
|
| 33 |
+
- **BUSI** (breast ultrasound) β Al-Dhabyani et al., *Data in Brief* 2020 (Dataset of breast ultrasound images)
|
| 34 |
+
|
| 35 |
+
Each dataset retains its original license/terms; obtain it from the source above.
|
| 36 |
+
|
| 37 |
+
## Reproducing a result
|
| 38 |
+
Every experiment is a self-contained job script. Example:
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
hf jobs uv run --flavor t4-medium --timeout 2h --secrets HF_TOKEN \
|
| 42 |
+
-v hf://buckets/<your-bucket>:/mnt --detach jobs/<experiment>_job.py
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
Each script declares its inline dependencies (PEP 723), reads inputs from the mounted bucket,
|
| 46 |
+
writes a result JSON, and prints a `*_RESULT` line. The mapping from claims to scripts/artifacts is
|
| 47 |
+
in each `gate_reports/*.json` and the `research_v*/SUMMARY.md` files.
|
| 48 |
+
|
| 49 |
+
## Citing this repository
|
| 50 |
+
A DOI for the archival snapshot is available via the repository's **Settings β Generate DOI** on the
|
| 51 |
+
Hugging Face Hub; cite that DOI in the manuscript's Data Availability Statement.
|