| --- |
| license: mit |
| language: |
| - en |
| base_model: |
| - arcinstitute/evo2_1b_base |
| --- |
| # Model Card for EVHost |
|
|
| > **Status: Pre-publication release.** Author list, contact email, and paper link are placeholders and will be updated upon paper acceptance. See [CITATION.md](CITATION.md) for the current placeholder citation. |
|
|
| This is the model card for the EVHost fusion classifier that pairs with the Evo2 evolutionary language model for viral host prediction. The checkpoint file `evhost_best.pt` contains the trained `FusionClassifier` weights and all hyperparameters required for inference. |
|
|
| ## Model Details |
|
|
| - **Model name:** EVHost (fusion classifier) |
| - **Architecture:** `FusionClassifier` β a multi-layer perceptron that fuses a 1920-dim Evo2 embedding (post-projection) with 211-dim hand-crafted genomic features (CUB, dinucleotide, CPB, AA frequency, host adaptation, zoonotic) through a 512-dim hidden layer. |
| - **Parameters:** ~2.5M trainable (fusion head only; Evo2 backbone is frozen at inference and not part of this checkpoint) |
| - **Checkpoint size:** ~500 MB |
| - **Framework:** PyTorch 2.7.0 |
| - **License:** MIT |
|
|
| The 211-dim genomic-feature vector is the post-CPB-compression representation fed to the fusion MLP. Pre-CPB-compression dimensionality is 1149 (CUB 64 + dinuc 16 + CPB 256 + AA 20 + bridge-dinuc 16 + adaptation 24 + zoonotic 7 + Evo2-projection 512 = 1915 β fused 1149). See `src/evhost/models/fusion.py` for the implementation. |
|
|
| ## Intended Use |
|
|
| ### Primary use |
| - **Research**: predict the human-host probability of a viral genome sequence given pre-computed Evo2 embeddings and genomic features. Designed for viral surveillance, zoonotic-risk screening, and pandemic-preparedness research. |
|
|
| ### Out-of-scope use |
| - **Not a clinical diagnostic tool.** Do not use for patient-level decision-making or pathogen identification in clinical workflows. |
| - **Not validated for non-viral sequences** (bacteria, plasmids, host genomes). |
| - **Not validated for sequences substantially diverged from NCBI/VHDB reference taxa** used during training (coronaviruses, influenza, rabies, etc.). |
|
|
| ## How to Use |
|
|
| ### Option 1 β Direct download (curl/wget) |
|
|
| ```bash |
| mkdir -p models |
| curl -L "https://huggingface.co/Adorably/EVHost/resolve/main/evhost_best.pt" -o models/evhost_best.pt |
| ``` |
|
|
| ### Option 2 β Python via `huggingface_hub` |
| ```python |
| from huggingface_hub import hf_hub_download |
| import torch |
|
|
| ckpt_path = hf_hub_download( |
| repo_id="Adorably/EVHost", |
| filename="evhost_best.pt", |
| local_dir="models", |
| ) |
| checkpoint = torch.load(ckpt_path, map_location="cpu", weights_only=False) |
| ``` |
| ### Loading and running inference |
| |
| ```python |
| import torch |
| from evhost.models import FusionClassifier |
| checkpoint = torch.load("models/evhost_best.pt", map_location="cpu", weights_only=False) |
| model = FusionClassifier( |
| d_evo=checkpoint["d_evo"], |
| k_bio=checkpoint["k_bio"], |
| d_bio=checkpoint["d_bio"], |
| fusion_hidden=checkpoint["fusion_hidden"], |
| evo_reduced_dim=checkpoint["evo_reduced_dim"], |
| cpb_dim=checkpoint["cpb_dim"], |
| cpb_compressed_dim=checkpoint["cpb_compressed_dim"], |
| ).to("cpu") |
| model.load_state_dict(checkpoint["model_state_dict"]) |
| model.eval() |
| # `embedding` is a 1920-dim Evo2 vector, `cpb` is a 256-dim codon-pair-bias vector, |
| # `non_cpb` is a (211 - 256 β after compression) -dim concatenated feature vector. |
| with torch.no_grad(): |
| logit = model(embedding, cpb, non_cpb) |
| probability = torch.sigmoid(logit).item() |
| ``` |
|
|
| For a full end-to-end demo (FASTA β embedding + features β prediction), see the GitHub repository: `examples/simple_prediction.py` and `scripts/predict_host.py`. |
|
|
| ## Training Data |
|
|
| - **Source corpora:** NCBI Virus + VHDB + BERT-Infect. |
| - **Total sequences:** 128,761 viral genome records across 26 families. |
| - **Labels:** 64,680 labeled-positive (`Homo sapiens` host records); 64,081 unlabeled (all other hosts). |
| - **Observed positive fraction:** 50.2%; estimated true positive class prior ΟΜ = 0.62 (95% CI 0.59β0.65). |
| - **Split:** date-based β <2015 train, 2015β2018 validation, >2018 prediction. Sarbecovirus sequences were excluded from training and used only for prediction. |
| - **Data redistribution:** the original training data is **not** redistributed with this checkpoint. Users must obtain it from the source databases per their respective licenses. |
|
|
| ## Training Procedure |
|
|
| - **Backbone:** Evo2 (1B-parameter), fine-tuned on viral sequences for ~20 hours on a single NVIDIA H100 (4,688 optimization steps, sequence length 8,192, effective batch size 32, peak LR 1.5Γ10β»β΅, min LR 1.5Γ10β»βΆ, 400 warm-up steps). The fine-tuned Evo2 backbone is **not** included in this checkpoint β see the [Evo2 project](https://www.arcinstitute.org/) for backbone weights. |
| - **Fusion head:** trained for 40 epochs (final 15 used for checkpoint selection) with batch size 64, learning rates 3Γ10β»β΅ (Evo2 projection) / 1Γ10β»β΄ (fusion head) / 3Γ10β»β΄ (BioMLP), weight decay 5Γ10β»Β³, gradient clip 0.65. |
| - **Objective:** positive-unlabeled (PU) learning with `nnPU` correction (Ξ² = 0.1). An architecture-matched binary-cross-entropy variant (EVHost(BC)) is also trained for direct comparison with baselines. |
| - **Reproducibility:** see the [GitHub repository](https://github.com/Adorably/EVHost) β training scripts under `scripts/train/`, configuration under `configs/`. |
|
|
| ## Evaluation |
|
|
| Performance below is on the post-2018 held-out prediction set, using host-isolate labels as the reference. Recall / precision / F1 / ROC-AUC values are taken from Figure 2 of the paper; species-level detection rates are taken from Figure 4a. |
|
|
| ### Sequence-level metrics |
|
|
| | Model | F1 | Recall | Precision | ROC-AUC | |
| |-------|-----|--------|-----------|---------| |
| | **EVHost (PU)** | 0.789 | **0.994** | 0.665 | 0.953 | |
| | EVHost (BC) | **0.901** | 0.934 | 0.870 | 0.958 | |
| | BERT-infect (DNABERT) | 0.891 | 0.956 | 0.834 | β | |
| | BERT-infect (VIBE) | 0.813 | 0.955 | 0.706 | β | |
| | DeePaC-vir | 0.724 | 0.663 | 0.798 | β | |
| | Zoonotic rank | 0.928 | 0.955 | 0.903 | β | |
| | BLAST | 0.844 | 0.942 | 0.764 | β | |
| | kNN | 0.893 | 0.929 | 0.860 | β | |
|
|
| **Notes:** |
| - EVHost(PU) trades precision for recall by design. Many of its "false positives" under incomplete host-isolate labels map to species with independent human-host evidence in VHDB, so the precision value understates true performance. |
| - EVHost(BC) achieves the highest F1 among all methods on this set, showing that the fusion architecture is strong even without PU learning. |
|
|
| ## Limitations and Biases |
|
|
| - **Host label noise.** Host-isolate labels from NCBI/VHDB are incomplete and biased toward well-studied zoonoses. Both EVHost variants may under-predict human host probability for viruses with sparse surveillance records (especially non-mammalian reservoirs). |
| - **Taxonomic coverage.** Trained on 26 viral families; performance on families outside this set (e.g., bacteriophages, plant viruses) is unvalidated. |
| - **Length dependence.** Evo2 embeddings are computed at 8,192-token context. Sequences substantially shorter or longer are truncated/aggregated by the upstream embedding script, which may degrade performance. |
| - **Temporal drift.** The post-2018 test set is used for headline metrics, but viral evolution outpaces any fixed model. Periodic re-evaluation is required. |
| - **Confirmation bias from PU learning.** The PU-trained variant optimistically assumes the unlabeled set is mostly negative; for surveillance use, prefer the (BC) variant if false-positive cost is high. |
|
|
| ## How to Cite |
|
|
| See [CITATION.md](https://github.com/Adorably/EVHost/blob/main/CITATION.md) in the repository for the current (placeholder) citation. The canonical citation will be added upon paper acceptance. |
|
|
| ## License |
|
|
| MIT β see the [LICENSE](https://github.com/Adorably/EVHost/blob/main/LICENSE) file in the repository. |
|
|
| ## Acknowledgments |
|
|
| - **Evo2** model by the Arc Institute. |
| - **NCBI Virus** and **VHDB** for viral genome and host metadata. |
| - **BERT-Infect** dataset authors. |