| library_name: pytorch | |
| tags: | |
| - multimodal | |
| - histology | |
| - codex | |
| - retrieval | |
| license: other | |
| # Haiku β Trimodal (CODEX + H&E + Text) Retrieval Model | |
| This repo bundles a fine-tuned Haiku checkpoint together with the tokenizer | |
| and marker assets needed to run inference without any additional downloads | |
| from `xiangjx/musk` or `microsoft/BiomedNLP-BiomedBERT-*`. | |
| ## Contents | |
| - `haiku_state_dict.pt` β model weights (CODEX + H&E + Text encoders + projections) | |
| - `config.json` β architecture config + marker lists | |
| - `tokenizer/` β BiomedBERT tokenizer files (+ bert config) | |
| - `esm_embeddings/` β per-biomarker ESM embeddings (also embedded in state_dict; kept here for downstream use) | |
| - `vocab.pkl` β marker vocabulary | |
| ## Quick start | |
| ```python | |
| from models import Haiku | |
| model, tokenizer, marker_embedding = Haiku.from_pretrained( | |
| "zhihuanglab/Haiku", | |
| device="cuda", | |
| token="hf_...", # omit if HF_TOKEN / hf auth login is set | |
| ) | |
| model.eval() | |
| ``` | |