File size: 5,288 Bytes
1f76206
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
# TCGA & IMPACT Genomic Biomarker WSI Training Checkpoints

This repository hosts the full set of 200th-epoch classification checkpoints
used for genomic biomarker prediction across TCGA and IMPACT cohorts.

Checkpoints are organized strictly by:

- Dataset source (`TCGA` or `IMPACT`)
- Tumor type (e.g., `HNSC`, `UCS`, `BRCA`)
- Gene (e.g., `PIK3CA`, `FBXW7`, `BRAF`)
- Encoder (e.g., `virchow`, `gigapath_ft`)
- Data split index (`split_1`, `split_2`, ...)

---

## Repository Structure

The exact directory layout in this Hugging Face repo is:

```text
TCGA_Genomic_Biomarker_WSI_Training/
β”œβ”€β”€ TCGA/
β”‚   └── checkpoints/
β”‚       └── <TUMOR>/
β”‚           └── <GENE>/
β”‚               └── TCGA_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
β”‚
└── IMPACT/
    └── checkpoints/
        └── <TUMOR>/
            └── <GENE>/
                └── IMPACT_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
```

### Examples

```text
TCGA/checkpoints/HNSC/PIK3CA/
    TCGA_trained_HNSC_PIK3CA_virchow_gma_1_200.pth
    TCGA_trained_HNSC_PIK3CA_virchow_gma_2_200.pth
    TCGA_trained_HNSC_PIK3CA_gigapath_ft_gma_1_200.pth

IMPACT/checkpoints/UCS/FBXW7/
    IMPACT_trained_UCS_FBXW7_virchow_gma_1_200.pth
    IMPACT_trained_UCS_FBXW7_gigapath_ft_gma_2_200.pth
```

Each checkpoint filename is self-descriptive:

```text
<SOURCE>_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
```

---

## Downloading

### 1. Clone with Git LFS (recommended)

```bash
git lfs install
git clone https://huggingface.co/chadvanderbilt/TCGA_Genomic_Biomarker_WSI_Training
cd TCGA_Genomic_Biomarker_WSI_Training
```

### 2. Download an individual checkpoint

```python
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="chadvanderbilt/TCGA_Genomic_Biomarker_WSI_Training",
    filename="TCGA/checkpoints/HNSC/PIK3CA/TCGA_trained_HNSC_PIK3CA_virchow_gma_1_200.pth"
)
print(ckpt_path)
```

---

## Checksum Logs (SHA256)

Each upload run writes a checksum log under:

```text
logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json
```

Each entry in this JSON file includes:

- `source` (`TCGA` or `IMPACT`)
- `tumor`
- `gene`
- `encoder`
- `split`
- `remote_path` (path inside this repo)
- `size_bytes`
- `sha256`
- `timestamp`

These logs allow you to verify that your local copies of the checkpoints
match the originals used at upload time.

---

## Verifying Checkpoints After Download

This repo includes a helper script `verify_checkpoints.py` for checksum verification.

### Usage

From the root of the cloned repo:

```bash
python verify_checkpoints.py logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json
```

The script will:

1. Read the JSON log.
2. For each record, look up the file at `remote_path` under the repo root.
3. Recompute SHA256 and size.
4. Compare with the logged `sha256` and `size_bytes`.

Example output:

```text
OK       : 128
MISMATCH : 0
MISSING  : 0
```

- **OK** – file exists and matches checksum and size.
- **MISMATCH** – file exists but checksum or size does not match the log.
- **MISSING** – file listed in the log is not present on disk.

The script exits with a non-zero status code if there are any mismatches or missing files.

---

## `verify_checkpoints.py`

For convenience, the expected content of `verify_checkpoints.py` is:

```python
import json, hashlib, sys
from pathlib import Path

def sha256_file(path, buf=1024*1024):
    h = hashlib.sha256()
    with open(path, "rb") as f:
        while True:
            chunk = f.read(buf)
            if not chunk:
                break
            h.update(chunk)
    return h.hexdigest()

def main(log_json: str):
    log_file = Path(log_json)
    if not log_file.is_file():
        print(f"ERROR: log not found: {log_json}")
        sys.exit(1)

    with log_file.open() as f:
        records = json.load(f)

    repo_root = Path(__file__).resolve().parent

    ok = mismatch = missing = 0

    for rec in records:
        remote_path = rec["remote_path"]
        expected_sha = rec["sha256"]
        expected_size = rec["size_bytes"]

        local_path = repo_root / remote_path

        if not local_path.exists():
            print(f"[MISSING] {remote_path}")
            missing += 1
            continue

        actual_size = local_path.stat().st_size
        actual_sha = sha256_file(local_path)

        if actual_sha == expected_sha and actual_size == expected_size:
            ok += 1
        else:
            mismatch += 1
            print(f"[MISMATCH] {remote_path}")
            print(f"  expected sha : {expected_sha}")
            print(f"  actual sha   : {actual_sha}")
            print(f"  expected size: {expected_size}")
            print(f"  actual size  : {actual_size}")

    print()
    print(f"OK       : {ok}")
    print(f"MISMATCH : {mismatch}")
    print(f"MISSING  : {missing}")

    if mismatch or missing:
        sys.exit(1)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python verify_checkpoints.py logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json")
        sys.exit(1)
    main(sys.argv[1])
```

You can either copy this script into your local clone, or use the version
shipped directly in the repository (if present).


---
license: mit
---