EVHost / README.md

Update README.md

f755721 verified about 1 month ago

8.02 kB

	---
	license: mit
	language:
	- en
	base_model:
	- arcinstitute/evo2_1b_base
	---
	# Model Card for EVHost

	> Status: Pre-publication release. Author list, contact email, and paper link are placeholders and will be updated upon paper acceptance. See [CITATION.md](CITATION.md) for the current placeholder citation.

	This is the model card for the EVHost fusion classifier that pairs with the Evo2 evolutionary language model for viral host prediction. The checkpoint file `evhost_best.pt` contains the trained `FusionClassifier` weights and all hyperparameters required for inference.

	## Model Details

	- Model name: EVHost (fusion classifier)
	- Architecture: `FusionClassifier` — a multi-layer perceptron that fuses a 1920-dim Evo2 embedding (post-projection) with 211-dim hand-crafted genomic features (CUB, dinucleotide, CPB, AA frequency, host adaptation, zoonotic) through a 512-dim hidden layer.
	- Parameters: ~2.5M trainable (fusion head only; Evo2 backbone is frozen at inference and not part of this checkpoint)
	- Checkpoint size: ~500 MB
	- Framework: PyTorch 2.7.0
	- License: MIT

	The 211-dim genomic-feature vector is the post-CPB-compression representation fed to the fusion MLP. Pre-CPB-compression dimensionality is 1149 (CUB 64 + dinuc 16 + CPB 256 + AA 20 + bridge-dinuc 16 + adaptation 24 + zoonotic 7 + Evo2-projection 512 = 1915 → fused 1149). See `src/evhost/models/fusion.py` for the implementation.

	## Intended Use

	### Primary use
	- Research: predict the human-host probability of a viral genome sequence given pre-computed Evo2 embeddings and genomic features. Designed for viral surveillance, zoonotic-risk screening, and pandemic-preparedness research.

	### Out-of-scope use
	- Not a clinical diagnostic tool. Do not use for patient-level decision-making or pathogen identification in clinical workflows.
	- Not validated for non-viral sequences (bacteria, plasmids, host genomes).
	- Not validated for sequences substantially diverged from NCBI/VHDB reference taxa used during training (coronaviruses, influenza, rabies, etc.).

	## How to Use

	### Option 1 — Direct download (curl/wget)

	```bash
	mkdir -p models
	curl -L "https://huggingface.co/Adorably/EVHost/resolve/main/evhost_best.pt" -o models/evhost_best.pt
	```

	### Option 2 — Python via `huggingface_hub`
	```python
	from huggingface_hub import hf_hub_download
	import torch

	ckpt_path = hf_hub_download(
	repo_id="Adorably/EVHost",
	filename="evhost_best.pt",
	local_dir="models",
	)
	checkpoint = torch.load(ckpt_path, map_location="cpu", weights_only=False)
	```
	### Loading and running inference

	```python
	import torch
	from evhost.models import FusionClassifier
	checkpoint = torch.load("models/evhost_best.pt", map_location="cpu", weights_only=False)
	model = FusionClassifier(
	d_evo=checkpoint["d_evo"],
	k_bio=checkpoint["k_bio"],
	d_bio=checkpoint["d_bio"],
	fusion_hidden=checkpoint["fusion_hidden"],
	evo_reduced_dim=checkpoint["evo_reduced_dim"],
	cpb_dim=checkpoint["cpb_dim"],
	cpb_compressed_dim=checkpoint["cpb_compressed_dim"],
	).to("cpu")
	model.load_state_dict(checkpoint["model_state_dict"])
	model.eval()
	# `embedding` is a 1920-dim Evo2 vector, `cpb` is a 256-dim codon-pair-bias vector,
	# `non_cpb` is a (211 - 256 → after compression) -dim concatenated feature vector.
	with torch.no_grad():
	logit = model(embedding, cpb, non_cpb)
	probability = torch.sigmoid(logit).item()
	```

	For a full end-to-end demo (FASTA → embedding + features → prediction), see the GitHub repository: `examples/simple_prediction.py` and `scripts/predict_host.py`.

	## Training Data

	- Source corpora: NCBI Virus + VHDB + BERT-Infect.
	- Total sequences: 128,761 viral genome records across 26 families.
	- Labels: 64,680 labeled-positive (`Homo sapiens` host records); 64,081 unlabeled (all other hosts).
	- Observed positive fraction: 50.2%; estimated true positive class prior π̂ = 0.62 (95% CI 0.59–0.65).
	- Split: date-based — <2015 train, 2015–2018 validation, >2018 prediction. Sarbecovirus sequences were excluded from training and used only for prediction.
	- Data redistribution: the original training data is not redistributed with this checkpoint. Users must obtain it from the source databases per their respective licenses.

	## Training Procedure

	- Backbone: Evo2 (1B-parameter), fine-tuned on viral sequences for ~20 hours on a single NVIDIA H100 (4,688 optimization steps, sequence length 8,192, effective batch size 32, peak LR 1.5×10⁻⁵, min LR 1.5×10⁻⁶, 400 warm-up steps). The fine-tuned Evo2 backbone is not included in this checkpoint — see the [Evo2 project](https://www.arcinstitute.org/) for backbone weights.
	- Fusion head: trained for 40 epochs (final 15 used for checkpoint selection) with batch size 64, learning rates 3×10⁻⁵ (Evo2 projection) / 1×10⁻⁴ (fusion head) / 3×10⁻⁴ (BioMLP), weight decay 5×10⁻³, gradient clip 0.65.
	- Objective: positive-unlabeled (PU) learning with `nnPU` correction (β = 0.1). An architecture-matched binary-cross-entropy variant (EVHost(BC)) is also trained for direct comparison with baselines.
	- Reproducibility: see the [GitHub repository](https://github.com/Adorably/EVHost) — training scripts under `scripts/train/`, configuration under `configs/`.

	## Evaluation

	Performance below is on the post-2018 held-out prediction set, using host-isolate labels as the reference. Recall / precision / F1 / ROC-AUC values are taken from Figure 2 of the paper; species-level detection rates are taken from Figure 4a.

	### Sequence-level metrics

	\| Model \| F1 \| Recall \| Precision \| ROC-AUC \|
	\|-------\|-----\|--------\|-----------\|---------\|
	\| EVHost (PU) \| 0.789 \| 0.994 \| 0.665 \| 0.953 \|
	\| EVHost (BC) \| 0.901 \| 0.934 \| 0.870 \| 0.958 \|
	\| BERT-infect (DNABERT) \| 0.891 \| 0.956 \| 0.834 \| — \|
	\| BERT-infect (VIBE) \| 0.813 \| 0.955 \| 0.706 \| — \|
	\| DeePaC-vir \| 0.724 \| 0.663 \| 0.798 \| — \|
	\| Zoonotic rank \| 0.928 \| 0.955 \| 0.903 \| — \|
	\| BLAST \| 0.844 \| 0.942 \| 0.764 \| — \|
	\| kNN \| 0.893 \| 0.929 \| 0.860 \| — \|

	Notes:
	- EVHost(PU) trades precision for recall by design. Many of its "false positives" under incomplete host-isolate labels map to species with independent human-host evidence in VHDB, so the precision value understates true performance.
	- EVHost(BC) achieves the highest F1 among all methods on this set, showing that the fusion architecture is strong even without PU learning.

	## Limitations and Biases

	- Host label noise. Host-isolate labels from NCBI/VHDB are incomplete and biased toward well-studied zoonoses. Both EVHost variants may under-predict human host probability for viruses with sparse surveillance records (especially non-mammalian reservoirs).
	- Taxonomic coverage. Trained on 26 viral families; performance on families outside this set (e.g., bacteriophages, plant viruses) is unvalidated.
	- Length dependence. Evo2 embeddings are computed at 8,192-token context. Sequences substantially shorter or longer are truncated/aggregated by the upstream embedding script, which may degrade performance.
	- Temporal drift. The post-2018 test set is used for headline metrics, but viral evolution outpaces any fixed model. Periodic re-evaluation is required.
	- Confirmation bias from PU learning. The PU-trained variant optimistically assumes the unlabeled set is mostly negative; for surveillance use, prefer the (BC) variant if false-positive cost is high.

	## How to Cite

	See [CITATION.md](https://github.com/Adorably/EVHost/blob/main/CITATION.md) in the repository for the current (placeholder) citation. The canonical citation will be added upon paper acceptance.

	## License

	MIT — see the [LICENSE](https://github.com/Adorably/EVHost/blob/main/LICENSE) file in the repository.

	## Acknowledgments

	- Evo2 model by the Arc Institute.
	- NCBI Virus and VHDB for viral genome and host metadata.
	- BERT-Infect dataset authors.