--- license: mit language: - en base_model: - arcinstitute/evo2_1b_base --- # Model Card for EVHost > **Status: Pre-publication release.** Author list, contact email, and paper link are placeholders and will be updated upon paper acceptance. See [CITATION.md](CITATION.md) for the current placeholder citation. This is the model card for the EVHost fusion classifier that pairs with the Evo2 evolutionary language model for viral host prediction. The checkpoint file `evhost_best.pt` contains the trained `FusionClassifier` weights and all hyperparameters required for inference. ## Model Details - **Model name:** EVHost (fusion classifier) - **Architecture:** `FusionClassifier` — a multi-layer perceptron that fuses a 1920-dim Evo2 embedding (post-projection) with 211-dim hand-crafted genomic features (CUB, dinucleotide, CPB, AA frequency, host adaptation, zoonotic) through a 512-dim hidden layer. - **Parameters:** ~2.5M trainable (fusion head only; Evo2 backbone is frozen at inference and not part of this checkpoint) - **Checkpoint size:** ~500 MB - **Framework:** PyTorch 2.7.0 - **License:** MIT The 211-dim genomic-feature vector is the post-CPB-compression representation fed to the fusion MLP. Pre-CPB-compression dimensionality is 1149 (CUB 64 + dinuc 16 + CPB 256 + AA 20 + bridge-dinuc 16 + adaptation 24 + zoonotic 7 + Evo2-projection 512 = 1915 → fused 1149). See `src/evhost/models/fusion.py` for the implementation. ## Intended Use ### Primary use - **Research**: predict the human-host probability of a viral genome sequence given pre-computed Evo2 embeddings and genomic features. Designed for viral surveillance, zoonotic-risk screening, and pandemic-preparedness research. ### Out-of-scope use - **Not a clinical diagnostic tool.** Do not use for patient-level decision-making or pathogen identification in clinical workflows. - **Not validated for non-viral sequences** (bacteria, plasmids, host genomes). - **Not validated for sequences substantially diverged from NCBI/VHDB reference taxa** used during training (coronaviruses, influenza, rabies, etc.). ## How to Use ### Option 1 — Direct download (curl/wget) ```bash mkdir -p models curl -L "https://huggingface.co/Adorably/EVHost/resolve/main/evhost_best.pt" -o models/evhost_best.pt ``` ### Option 2 — Python via `huggingface_hub` ```python from huggingface_hub import hf_hub_download import torch ckpt_path = hf_hub_download( repo_id="Adorably/EVHost", filename="evhost_best.pt", local_dir="models", ) checkpoint = torch.load(ckpt_path, map_location="cpu", weights_only=False) ``` ### Loading and running inference ```python import torch from evhost.models import FusionClassifier checkpoint = torch.load("models/evhost_best.pt", map_location="cpu", weights_only=False) model = FusionClassifier( d_evo=checkpoint["d_evo"], k_bio=checkpoint["k_bio"], d_bio=checkpoint["d_bio"], fusion_hidden=checkpoint["fusion_hidden"], evo_reduced_dim=checkpoint["evo_reduced_dim"], cpb_dim=checkpoint["cpb_dim"], cpb_compressed_dim=checkpoint["cpb_compressed_dim"], ).to("cpu") model.load_state_dict(checkpoint["model_state_dict"]) model.eval() # `embedding` is a 1920-dim Evo2 vector, `cpb` is a 256-dim codon-pair-bias vector, # `non_cpb` is a (211 - 256 → after compression) -dim concatenated feature vector. with torch.no_grad(): logit = model(embedding, cpb, non_cpb) probability = torch.sigmoid(logit).item() ``` For a full end-to-end demo (FASTA → embedding + features → prediction), see the GitHub repository: `examples/simple_prediction.py` and `scripts/predict_host.py`. ## Training Data - **Source corpora:** NCBI Virus + VHDB + BERT-Infect. - **Total sequences:** 128,761 viral genome records across 26 families. - **Labels:** 64,680 labeled-positive (`Homo sapiens` host records); 64,081 unlabeled (all other hosts). - **Observed positive fraction:** 50.2%; estimated true positive class prior π̂ = 0.62 (95% CI 0.59–0.65). - **Split:** date-based — <2015 train, 2015–2018 validation, >2018 prediction. Sarbecovirus sequences were excluded from training and used only for prediction. - **Data redistribution:** the original training data is **not** redistributed with this checkpoint. Users must obtain it from the source databases per their respective licenses. ## Training Procedure - **Backbone:** Evo2 (1B-parameter), fine-tuned on viral sequences for ~20 hours on a single NVIDIA H100 (4,688 optimization steps, sequence length 8,192, effective batch size 32, peak LR 1.5×10⁻⁵, min LR 1.5×10⁻⁶, 400 warm-up steps). The fine-tuned Evo2 backbone is **not** included in this checkpoint — see the [Evo2 project](https://www.arcinstitute.org/) for backbone weights. - **Fusion head:** trained for 40 epochs (final 15 used for checkpoint selection) with batch size 64, learning rates 3×10⁻⁵ (Evo2 projection) / 1×10⁻⁴ (fusion head) / 3×10⁻⁴ (BioMLP), weight decay 5×10⁻³, gradient clip 0.65. - **Objective:** positive-unlabeled (PU) learning with `nnPU` correction (β = 0.1). An architecture-matched binary-cross-entropy variant (EVHost(BC)) is also trained for direct comparison with baselines. - **Reproducibility:** see the [GitHub repository](https://github.com/Adorably/EVHost) — training scripts under `scripts/train/`, configuration under `configs/`. ## Evaluation Performance below is on the post-2018 held-out prediction set, using host-isolate labels as the reference. Recall / precision / F1 / ROC-AUC values are taken from Figure 2 of the paper; species-level detection rates are taken from Figure 4a. ### Sequence-level metrics | Model | F1 | Recall | Precision | ROC-AUC | |-------|-----|--------|-----------|---------| | **EVHost (PU)** | 0.789 | **0.994** | 0.665 | 0.953 | | EVHost (BC) | **0.901** | 0.934 | 0.870 | 0.958 | | BERT-infect (DNABERT) | 0.891 | 0.956 | 0.834 | — | | BERT-infect (VIBE) | 0.813 | 0.955 | 0.706 | — | | DeePaC-vir | 0.724 | 0.663 | 0.798 | — | | Zoonotic rank | 0.928 | 0.955 | 0.903 | — | | BLAST | 0.844 | 0.942 | 0.764 | — | | kNN | 0.893 | 0.929 | 0.860 | — | **Notes:** - EVHost(PU) trades precision for recall by design. Many of its "false positives" under incomplete host-isolate labels map to species with independent human-host evidence in VHDB, so the precision value understates true performance. - EVHost(BC) achieves the highest F1 among all methods on this set, showing that the fusion architecture is strong even without PU learning. ## Limitations and Biases - **Host label noise.** Host-isolate labels from NCBI/VHDB are incomplete and biased toward well-studied zoonoses. Both EVHost variants may under-predict human host probability for viruses with sparse surveillance records (especially non-mammalian reservoirs). - **Taxonomic coverage.** Trained on 26 viral families; performance on families outside this set (e.g., bacteriophages, plant viruses) is unvalidated. - **Length dependence.** Evo2 embeddings are computed at 8,192-token context. Sequences substantially shorter or longer are truncated/aggregated by the upstream embedding script, which may degrade performance. - **Temporal drift.** The post-2018 test set is used for headline metrics, but viral evolution outpaces any fixed model. Periodic re-evaluation is required. - **Confirmation bias from PU learning.** The PU-trained variant optimistically assumes the unlabeled set is mostly negative; for surveillance use, prefer the (BC) variant if false-positive cost is high. ## How to Cite See [CITATION.md](https://github.com/Adorably/EVHost/blob/main/CITATION.md) in the repository for the current (placeholder) citation. The canonical citation will be added upon paper acceptance. ## License MIT — see the [LICENSE](https://github.com/Adorably/EVHost/blob/main/LICENSE) file in the repository. ## Acknowledgments - **Evo2** model by the Arc Institute. - **NCBI Virus** and **VHDB** for viral genome and host metadata. - **BERT-Infect** dataset authors.