File size: 5,288 Bytes
1f76206 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | # TCGA & IMPACT Genomic Biomarker WSI Training Checkpoints
This repository hosts the full set of 200th-epoch classification checkpoints
used for genomic biomarker prediction across TCGA and IMPACT cohorts.
Checkpoints are organized strictly by:
- Dataset source (`TCGA` or `IMPACT`)
- Tumor type (e.g., `HNSC`, `UCS`, `BRCA`)
- Gene (e.g., `PIK3CA`, `FBXW7`, `BRAF`)
- Encoder (e.g., `virchow`, `gigapath_ft`)
- Data split index (`split_1`, `split_2`, ...)
---
## Repository Structure
The exact directory layout in this Hugging Face repo is:
```text
TCGA_Genomic_Biomarker_WSI_Training/
βββ TCGA/
β βββ checkpoints/
β βββ <TUMOR>/
β βββ <GENE>/
β βββ TCGA_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
β
βββ IMPACT/
βββ checkpoints/
βββ <TUMOR>/
βββ <GENE>/
βββ IMPACT_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
```
### Examples
```text
TCGA/checkpoints/HNSC/PIK3CA/
TCGA_trained_HNSC_PIK3CA_virchow_gma_1_200.pth
TCGA_trained_HNSC_PIK3CA_virchow_gma_2_200.pth
TCGA_trained_HNSC_PIK3CA_gigapath_ft_gma_1_200.pth
IMPACT/checkpoints/UCS/FBXW7/
IMPACT_trained_UCS_FBXW7_virchow_gma_1_200.pth
IMPACT_trained_UCS_FBXW7_gigapath_ft_gma_2_200.pth
```
Each checkpoint filename is self-descriptive:
```text
<SOURCE>_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
```
---
## Downloading
### 1. Clone with Git LFS (recommended)
```bash
git lfs install
git clone https://huggingface.co/chadvanderbilt/TCGA_Genomic_Biomarker_WSI_Training
cd TCGA_Genomic_Biomarker_WSI_Training
```
### 2. Download an individual checkpoint
```python
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="chadvanderbilt/TCGA_Genomic_Biomarker_WSI_Training",
filename="TCGA/checkpoints/HNSC/PIK3CA/TCGA_trained_HNSC_PIK3CA_virchow_gma_1_200.pth"
)
print(ckpt_path)
```
---
## Checksum Logs (SHA256)
Each upload run writes a checksum log under:
```text
logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json
```
Each entry in this JSON file includes:
- `source` (`TCGA` or `IMPACT`)
- `tumor`
- `gene`
- `encoder`
- `split`
- `remote_path` (path inside this repo)
- `size_bytes`
- `sha256`
- `timestamp`
These logs allow you to verify that your local copies of the checkpoints
match the originals used at upload time.
---
## Verifying Checkpoints After Download
This repo includes a helper script `verify_checkpoints.py` for checksum verification.
### Usage
From the root of the cloned repo:
```bash
python verify_checkpoints.py logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json
```
The script will:
1. Read the JSON log.
2. For each record, look up the file at `remote_path` under the repo root.
3. Recompute SHA256 and size.
4. Compare with the logged `sha256` and `size_bytes`.
Example output:
```text
OK : 128
MISMATCH : 0
MISSING : 0
```
- **OK** β file exists and matches checksum and size.
- **MISMATCH** β file exists but checksum or size does not match the log.
- **MISSING** β file listed in the log is not present on disk.
The script exits with a non-zero status code if there are any mismatches or missing files.
---
## `verify_checkpoints.py`
For convenience, the expected content of `verify_checkpoints.py` is:
```python
import json, hashlib, sys
from pathlib import Path
def sha256_file(path, buf=1024*1024):
h = hashlib.sha256()
with open(path, "rb") as f:
while True:
chunk = f.read(buf)
if not chunk:
break
h.update(chunk)
return h.hexdigest()
def main(log_json: str):
log_file = Path(log_json)
if not log_file.is_file():
print(f"ERROR: log not found: {log_json}")
sys.exit(1)
with log_file.open() as f:
records = json.load(f)
repo_root = Path(__file__).resolve().parent
ok = mismatch = missing = 0
for rec in records:
remote_path = rec["remote_path"]
expected_sha = rec["sha256"]
expected_size = rec["size_bytes"]
local_path = repo_root / remote_path
if not local_path.exists():
print(f"[MISSING] {remote_path}")
missing += 1
continue
actual_size = local_path.stat().st_size
actual_sha = sha256_file(local_path)
if actual_sha == expected_sha and actual_size == expected_size:
ok += 1
else:
mismatch += 1
print(f"[MISMATCH] {remote_path}")
print(f" expected sha : {expected_sha}")
print(f" actual sha : {actual_sha}")
print(f" expected size: {expected_size}")
print(f" actual size : {actual_size}")
print()
print(f"OK : {ok}")
print(f"MISMATCH : {mismatch}")
print(f"MISSING : {missing}")
if mismatch or missing:
sys.exit(1)
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python verify_checkpoints.py logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json")
sys.exit(1)
main(sys.argv[1])
```
You can either copy this script into your local clone, or use the version
shipped directly in the repository (if present).
---
license: mit
---
|