EVHost / README.md
Adorably's picture
Update README.md
f755721 verified
|
Raw
History Blame Contribute Delete
8.02 kB
---
license: mit
language:
- en
base_model:
- arcinstitute/evo2_1b_base
---
# Model Card for EVHost
> **Status: Pre-publication release.** Author list, contact email, and paper link are placeholders and will be updated upon paper acceptance. See [CITATION.md](CITATION.md) for the current placeholder citation.
This is the model card for the EVHost fusion classifier that pairs with the Evo2 evolutionary language model for viral host prediction. The checkpoint file `evhost_best.pt` contains the trained `FusionClassifier` weights and all hyperparameters required for inference.
## Model Details
- **Model name:** EVHost (fusion classifier)
- **Architecture:** `FusionClassifier` β€” a multi-layer perceptron that fuses a 1920-dim Evo2 embedding (post-projection) with 211-dim hand-crafted genomic features (CUB, dinucleotide, CPB, AA frequency, host adaptation, zoonotic) through a 512-dim hidden layer.
- **Parameters:** ~2.5M trainable (fusion head only; Evo2 backbone is frozen at inference and not part of this checkpoint)
- **Checkpoint size:** ~500 MB
- **Framework:** PyTorch 2.7.0
- **License:** MIT
The 211-dim genomic-feature vector is the post-CPB-compression representation fed to the fusion MLP. Pre-CPB-compression dimensionality is 1149 (CUB 64 + dinuc 16 + CPB 256 + AA 20 + bridge-dinuc 16 + adaptation 24 + zoonotic 7 + Evo2-projection 512 = 1915 β†’ fused 1149). See `src/evhost/models/fusion.py` for the implementation.
## Intended Use
### Primary use
- **Research**: predict the human-host probability of a viral genome sequence given pre-computed Evo2 embeddings and genomic features. Designed for viral surveillance, zoonotic-risk screening, and pandemic-preparedness research.
### Out-of-scope use
- **Not a clinical diagnostic tool.** Do not use for patient-level decision-making or pathogen identification in clinical workflows.
- **Not validated for non-viral sequences** (bacteria, plasmids, host genomes).
- **Not validated for sequences substantially diverged from NCBI/VHDB reference taxa** used during training (coronaviruses, influenza, rabies, etc.).
## How to Use
### Option 1 β€” Direct download (curl/wget)
```bash
mkdir -p models
curl -L "https://huggingface.co/Adorably/EVHost/resolve/main/evhost_best.pt" -o models/evhost_best.pt
```
### Option 2 β€” Python via `huggingface_hub`
```python
from huggingface_hub import hf_hub_download
import torch
ckpt_path = hf_hub_download(
repo_id="Adorably/EVHost",
filename="evhost_best.pt",
local_dir="models",
)
checkpoint = torch.load(ckpt_path, map_location="cpu", weights_only=False)
```
### Loading and running inference
```python
import torch
from evhost.models import FusionClassifier
checkpoint = torch.load("models/evhost_best.pt", map_location="cpu", weights_only=False)
model = FusionClassifier(
d_evo=checkpoint["d_evo"],
k_bio=checkpoint["k_bio"],
d_bio=checkpoint["d_bio"],
fusion_hidden=checkpoint["fusion_hidden"],
evo_reduced_dim=checkpoint["evo_reduced_dim"],
cpb_dim=checkpoint["cpb_dim"],
cpb_compressed_dim=checkpoint["cpb_compressed_dim"],
).to("cpu")
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
# `embedding` is a 1920-dim Evo2 vector, `cpb` is a 256-dim codon-pair-bias vector,
# `non_cpb` is a (211 - 256 β†’ after compression) -dim concatenated feature vector.
with torch.no_grad():
logit = model(embedding, cpb, non_cpb)
probability = torch.sigmoid(logit).item()
```
For a full end-to-end demo (FASTA β†’ embedding + features β†’ prediction), see the GitHub repository: `examples/simple_prediction.py` and `scripts/predict_host.py`.
## Training Data
- **Source corpora:** NCBI Virus + VHDB + BERT-Infect.
- **Total sequences:** 128,761 viral genome records across 26 families.
- **Labels:** 64,680 labeled-positive (`Homo sapiens` host records); 64,081 unlabeled (all other hosts).
- **Observed positive fraction:** 50.2%; estimated true positive class prior Ο€Μ‚ = 0.62 (95% CI 0.59–0.65).
- **Split:** date-based β€” <2015 train, 2015–2018 validation, >2018 prediction. Sarbecovirus sequences were excluded from training and used only for prediction.
- **Data redistribution:** the original training data is **not** redistributed with this checkpoint. Users must obtain it from the source databases per their respective licenses.
## Training Procedure
- **Backbone:** Evo2 (1B-parameter), fine-tuned on viral sequences for ~20 hours on a single NVIDIA H100 (4,688 optimization steps, sequence length 8,192, effective batch size 32, peak LR 1.5Γ—10⁻⁡, min LR 1.5Γ—10⁻⁢, 400 warm-up steps). The fine-tuned Evo2 backbone is **not** included in this checkpoint β€” see the [Evo2 project](https://www.arcinstitute.org/) for backbone weights.
- **Fusion head:** trained for 40 epochs (final 15 used for checkpoint selection) with batch size 64, learning rates 3Γ—10⁻⁡ (Evo2 projection) / 1Γ—10⁻⁴ (fusion head) / 3Γ—10⁻⁴ (BioMLP), weight decay 5Γ—10⁻³, gradient clip 0.65.
- **Objective:** positive-unlabeled (PU) learning with `nnPU` correction (Ξ² = 0.1). An architecture-matched binary-cross-entropy variant (EVHost(BC)) is also trained for direct comparison with baselines.
- **Reproducibility:** see the [GitHub repository](https://github.com/Adorably/EVHost) β€” training scripts under `scripts/train/`, configuration under `configs/`.
## Evaluation
Performance below is on the post-2018 held-out prediction set, using host-isolate labels as the reference. Recall / precision / F1 / ROC-AUC values are taken from Figure 2 of the paper; species-level detection rates are taken from Figure 4a.
### Sequence-level metrics
| Model | F1 | Recall | Precision | ROC-AUC |
|-------|-----|--------|-----------|---------|
| **EVHost (PU)** | 0.789 | **0.994** | 0.665 | 0.953 |
| EVHost (BC) | **0.901** | 0.934 | 0.870 | 0.958 |
| BERT-infect (DNABERT) | 0.891 | 0.956 | 0.834 | β€” |
| BERT-infect (VIBE) | 0.813 | 0.955 | 0.706 | β€” |
| DeePaC-vir | 0.724 | 0.663 | 0.798 | β€” |
| Zoonotic rank | 0.928 | 0.955 | 0.903 | β€” |
| BLAST | 0.844 | 0.942 | 0.764 | β€” |
| kNN | 0.893 | 0.929 | 0.860 | β€” |
**Notes:**
- EVHost(PU) trades precision for recall by design. Many of its "false positives" under incomplete host-isolate labels map to species with independent human-host evidence in VHDB, so the precision value understates true performance.
- EVHost(BC) achieves the highest F1 among all methods on this set, showing that the fusion architecture is strong even without PU learning.
## Limitations and Biases
- **Host label noise.** Host-isolate labels from NCBI/VHDB are incomplete and biased toward well-studied zoonoses. Both EVHost variants may under-predict human host probability for viruses with sparse surveillance records (especially non-mammalian reservoirs).
- **Taxonomic coverage.** Trained on 26 viral families; performance on families outside this set (e.g., bacteriophages, plant viruses) is unvalidated.
- **Length dependence.** Evo2 embeddings are computed at 8,192-token context. Sequences substantially shorter or longer are truncated/aggregated by the upstream embedding script, which may degrade performance.
- **Temporal drift.** The post-2018 test set is used for headline metrics, but viral evolution outpaces any fixed model. Periodic re-evaluation is required.
- **Confirmation bias from PU learning.** The PU-trained variant optimistically assumes the unlabeled set is mostly negative; for surveillance use, prefer the (BC) variant if false-positive cost is high.
## How to Cite
See [CITATION.md](https://github.com/Adorably/EVHost/blob/main/CITATION.md) in the repository for the current (placeholder) citation. The canonical citation will be added upon paper acceptance.
## License
MIT β€” see the [LICENSE](https://github.com/Adorably/EVHost/blob/main/LICENSE) file in the repository.
## Acknowledgments
- **Evo2** model by the Arc Institute.
- **NCBI Virus** and **VHDB** for viral genome and host metadata.
- **BERT-Infect** dataset authors.