# TCGA & IMPACT Genomic Biomarker WSI Training Checkpoints This repository hosts the full set of 200th-epoch classification checkpoints used for genomic biomarker prediction across TCGA and IMPACT cohorts. Checkpoints are organized strictly by: - Dataset source (`TCGA` or `IMPACT`) - Tumor type (e.g., `HNSC`, `UCS`, `BRCA`) - Gene (e.g., `PIK3CA`, `FBXW7`, `BRAF`) - Encoder (e.g., `virchow`, `gigapath_ft`) - Data split index (`split_1`, `split_2`, ...) --- ## Repository Structure The exact directory layout in this Hugging Face repo is: ```text TCGA_Genomic_Biomarker_WSI_Training/ ├── TCGA/ │ └── checkpoints/ │ └── / │ └── / │ └── TCGA_trained____gma__200.pth │ └── IMPACT/ └── checkpoints/ └── / └── / └── IMPACT_trained____gma__200.pth ``` ### Examples ```text TCGA/checkpoints/HNSC/PIK3CA/ TCGA_trained_HNSC_PIK3CA_virchow_gma_1_200.pth TCGA_trained_HNSC_PIK3CA_virchow_gma_2_200.pth TCGA_trained_HNSC_PIK3CA_gigapath_ft_gma_1_200.pth IMPACT/checkpoints/UCS/FBXW7/ IMPACT_trained_UCS_FBXW7_virchow_gma_1_200.pth IMPACT_trained_UCS_FBXW7_gigapath_ft_gma_2_200.pth ``` Each checkpoint filename is self-descriptive: ```text _trained____gma__200.pth ``` --- ## Downloading ### 1. Clone with Git LFS (recommended) ```bash git lfs install git clone https://huggingface.co/chadvanderbilt/TCGA_Genomic_Biomarker_WSI_Training cd TCGA_Genomic_Biomarker_WSI_Training ``` ### 2. Download an individual checkpoint ```python from huggingface_hub import hf_hub_download ckpt_path = hf_hub_download( repo_id="chadvanderbilt/TCGA_Genomic_Biomarker_WSI_Training", filename="TCGA/checkpoints/HNSC/PIK3CA/TCGA_trained_HNSC_PIK3CA_virchow_gma_1_200.pth" ) print(ckpt_path) ``` --- ## Checksum Logs (SHA256) Each upload run writes a checksum log under: ```text logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json ``` Each entry in this JSON file includes: - `source` (`TCGA` or `IMPACT`) - `tumor` - `gene` - `encoder` - `split` - `remote_path` (path inside this repo) - `size_bytes` - `sha256` - `timestamp` These logs allow you to verify that your local copies of the checkpoints match the originals used at upload time. --- ## Verifying Checkpoints After Download This repo includes a helper script `verify_checkpoints.py` for checksum verification. ### Usage From the root of the cloned repo: ```bash python verify_checkpoints.py logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json ``` The script will: 1. Read the JSON log. 2. For each record, look up the file at `remote_path` under the repo root. 3. Recompute SHA256 and size. 4. Compare with the logged `sha256` and `size_bytes`. Example output: ```text OK : 128 MISMATCH : 0 MISSING : 0 ``` - **OK** – file exists and matches checksum and size. - **MISMATCH** – file exists but checksum or size does not match the log. - **MISSING** – file listed in the log is not present on disk. The script exits with a non-zero status code if there are any mismatches or missing files. --- ## `verify_checkpoints.py` For convenience, the expected content of `verify_checkpoints.py` is: ```python import json, hashlib, sys from pathlib import Path def sha256_file(path, buf=1024*1024): h = hashlib.sha256() with open(path, "rb") as f: while True: chunk = f.read(buf) if not chunk: break h.update(chunk) return h.hexdigest() def main(log_json: str): log_file = Path(log_json) if not log_file.is_file(): print(f"ERROR: log not found: {log_json}") sys.exit(1) with log_file.open() as f: records = json.load(f) repo_root = Path(__file__).resolve().parent ok = mismatch = missing = 0 for rec in records: remote_path = rec["remote_path"] expected_sha = rec["sha256"] expected_size = rec["size_bytes"] local_path = repo_root / remote_path if not local_path.exists(): print(f"[MISSING] {remote_path}") missing += 1 continue actual_size = local_path.stat().st_size actual_sha = sha256_file(local_path) if actual_sha == expected_sha and actual_size == expected_size: ok += 1 else: mismatch += 1 print(f"[MISMATCH] {remote_path}") print(f" expected sha : {expected_sha}") print(f" actual sha : {actual_sha}") print(f" expected size: {expected_size}") print(f" actual size : {actual_size}") print() print(f"OK : {ok}") print(f"MISMATCH : {mismatch}") print(f"MISSING : {missing}") if mismatch or missing: sys.exit(1) if __name__ == "__main__": if len(sys.argv) != 2: print("Usage: python verify_checkpoints.py logs/checkpoint_checksums_YYYYMMDD_HHMMSS.json") sys.exit(1) main(sys.argv[1]) ``` You can either copy this script into your local clone, or use the version shipped directly in the repository (if present). --- license: mit ---