File size: 5,288 Bytes

1f76206

# TCGA & IMPACT Genomic Biomarker WSI Training Checkpoints

This repository hosts the full set of 200th-epoch classification checkpoints
used for genomic biomarker prediction across TCGA and IMPACT cohorts.

Checkpoints are organized strictly by:

- Dataset source (`TCGA` or `IMPACT`)
- Tumor type (e.g., `HNSC`, `UCS`, `BRCA`)
- Gene (e.g., `PIK3CA`, `FBXW7`, `BRAF`)
- Encoder (e.g., `virchow`, `gigapath_ft`)
- Data split index (`split_1`, `split_2`, ...)

---

## Repository Structure

The exact directory layout in this Hugging Face repo is:

```text
TCGA_Genomic_Biomarker_WSI_Training/
├── TCGA/
│   └── checkpoints/
│       └── <TUMOR>/
│           └── <GENE>/
│               └── TCGA_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
│
└── IMPACT/
    └── checkpoints/
        └── <TUMOR>/
            └── <GENE>/
                └── IMPACT_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
```

### Examples

```text
TCGA/checkpoints/HNSC/PIK3CA/
    TCGA_trained_HNSC_PIK3CA_virchow_gma_1_200.pth
    TCGA_trained_HNSC_PIK3CA_virchow_gma_2_200.pth
    TCGA_trained_HNSC_PIK3CA_gigapath_ft_gma_1_200.pth

IMPACT/checkpoints/UCS/FBXW7/
    IMPACT_trained_UCS_FBXW7_virchow_gma_1_200.pth
    IMPACT_trained_UCS_FBXW7_gigapath_ft_gma_2_200.pth
```

Each checkpoint filename is self-descriptive:

```text
<SOURCE>_trained_<TUMOR>_<GENE>_<ENCODER>_gma_<SPLIT>_200.pth
```

---

## Downloading

### 1. Clone with Git LFS (recommended)

```bash
git lfs install
git clone https://huggingface.co/chadvanderbilt/TCGA_Genomic_Biomarker_WSI_Training
cd TCGA_Genomic_Biomarker_WSI_Training
```

### 2. Download an individual checkpoint

```python
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="chadvanderbilt/TCGA_Genomic_Biomarker_WSI_Training",
    filename="TCGA/checkpoints/HNSC/PIK3CA/TCGA_trained_HNSC_PIK3CA_virchow_gma_1_200.pth"
)
print(ckpt_path)
```

---

## Checksum Logs (SHA256)

Each upload run writes a checksum log under:

```text
logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json
```

Each entry in this JSON file includes:

- `source` (`TCGA` or `IMPACT`)
- `tumor`
- `gene`
- `encoder`
- `split`
- `remote_path` (path inside this repo)
- `size_bytes`
- `sha256`
- `timestamp`

These logs allow you to verify that your local copies of the checkpoints
match the originals used at upload time.

---

## Verifying Checkpoints After Download

This repo includes a helper script `verify_checkpoints.py` for checksum verification.

### Usage

From the root of the cloned repo:

```bash
python verify_checkpoints.py logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json
```

The script will:

1. Read the JSON log.
2. For each record, look up the file at `remote_path` under the repo root.
3. Recompute SHA256 and size.
4. Compare with the logged `sha256` and `size_bytes`.

Example output:

```text
OK       : 128
MISMATCH : 0
MISSING  : 0
```

- **OK** – file exists and matches checksum and size.
- **MISMATCH** – file exists but checksum or size does not match the log.
- **MISSING** – file listed in the log is not present on disk.

The script exits with a non-zero status code if there are any mismatches or missing files.

---

## `verify_checkpoints.py`

For convenience, the expected content of `verify_checkpoints.py` is:

```python
import json, hashlib, sys
from pathlib import Path

def sha256_file(path, buf=1024*1024):
    h = hashlib.sha256()
    with open(path, "rb") as f:
        while True:
            chunk = f.read(buf)
            if not chunk:
                break
            h.update(chunk)
    return h.hexdigest()

def main(log_json: str):
    log_file = Path(log_json)
    if not log_file.is_file():
        print(f"ERROR: log not found: {log_json}")
        sys.exit(1)

    with log_file.open() as f:
        records = json.load(f)

    repo_root = Path(__file__).resolve().parent

    ok = mismatch = missing = 0

    for rec in records:
        remote_path = rec["remote_path"]
        expected_sha = rec["sha256"]
        expected_size = rec["size_bytes"]

        local_path = repo_root / remote_path

        if not local_path.exists():
            print(f"[MISSING] {remote_path}")
            missing += 1
            continue

        actual_size = local_path.stat().st_size
        actual_sha = sha256_file(local_path)

        if actual_sha == expected_sha and actual_size == expected_size:
            ok += 1
        else:
            mismatch += 1
            print(f"[MISMATCH] {remote_path}")
            print(f"  expected sha : {expected_sha}")
            print(f"  actual sha   : {actual_sha}")
            print(f"  expected size: {expected_size}")
            print(f"  actual size  : {actual_size}")

    print()
    print(f"OK       : {ok}")
    print(f"MISMATCH : {mismatch}")
    print(f"MISSING  : {missing}")

    if mismatch or missing:
        sys.exit(1)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python verify_checkpoints.py logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json")
        sys.exit(1)
    main(sys.argv[1])
```

You can either copy this script into your local clone, or use the version
shipped directly in the repository (if present).


---
license: mit
---