luanbei
/

PACMT

+---
+license: other
+language:
+- en
+tags:
+- metagenomics
+- viral-identification
+- hierarchical-classification
+- taxonomic-classification
+- DNABERT-2
+- bioinformatics
+pipeline_tag: text-classification
+---
+# PACMT
+PACMT is a pretrained sequence model-based framework for viral identification and hierarchical taxonomic classification of metagenomic sequences.
+This repository contains the trained PACMT model files and taxonomy resources. The source code, example files and detailed usage instructions are available at:
+```text
+https://github.com/luanbei/PACMT
+```
+## Model description
+PACMT uses a two-stage serial workflow:
+1. **Binary viral screening**: a binary classifier predicts whether an input sequence is viral or non-viral.
+2. **Hierarchical viral classification**: sequences predicted as viral are further classified at the order, family, genus and species levels.
+For hierarchical classification, PACMT uses taxonomy-consistent path decoding to select a biologically valid order-family-genus-species prediction path.
+## Repository contents
+The recommended file structure of this Hugging Face repository is:
+```text
+PACMT/
+├── README.md
+├── backbone/
+│   ├── config.json
+│   ├── pytorch_model.bin
+│   ├── tokenizer.json
+│   ├── tokenizer_config.json
+│   ├── configuration_bert.py
+│   ├── bert_layers.py
+│   ├── bert_padding.py
+│   └── flash_attn_triton.py
+├── binary_model/
+│   ├── pytorch_model.bin
+│   ├── head_config.json
+│   ├── tokenizer.json
+│   ├── tokenizer_config.json
+│   └── special_tokens_map.json
+├── hierarchy_model/
+│   ├── pytorch_model.bin
+│   ├── head_config.json
+│   ├── tokenizer.json
+│   ├── tokenizer_config.json
+│   ├── special_tokens_map.json
+│   ├── label_taxonomy_mapping.csv
+│   ├── taxonomy_paths.csv
+│   ├── taxonomy_paths_with_names.csv
+│   └── label_sizes.json
+└── taxonomy/
+    ├── label_taxonomy_mapping.csv
+    └── taxonomy_paths.csv
+```
+## Required files
+To run the complete PACMT prediction workflow, the following files or directories are required:
+```text
+backbone/
+binary_model/
+hierarchy_model/
+taxonomy/label_taxonomy_mapping.csv
+taxonomy/taxonomy_paths.csv
+```
+The `label_taxonomy_mapping.csv` file maps internal label IDs to taxonomy names and should contain at least:
+```text
+rank,label_id,taxonomy_name
+```
+The `taxonomy_paths.csv` file defines valid hierarchical taxonomy paths and should contain at least:
+```text
+order_id,family_id,genus_id,species_id
+```
+## Installation and usage
+Please install PACMT from the GitHub repository:
+```bash
+git clone https://github.com/luanbei/PACMT.git
+cd PACMT
+conda create -n pacmt python=3.8 -y
+conda activate pacmt
+pip install -r requirements.txt
+```
+Download this Hugging Face model repository and place the files under the `models/` directory:
+```bash
+pip install -U huggingface_hub
+hf download luanbei/PACMT --local-dir models
+```
+After downloading, the local model directory should look like:
+```text
+models/
+├── backbone/
+├── binary_model/
+├── hierarchy_model/
+└── taxonomy/
+```
+## Complete prediction workflow
+The complete two-stage PACMT workflow first performs binary viral screening and then applies hierarchical taxonomic classification to sequences predicted as viral.
+```bash
+python scripts/predict_binary_hierarchy.py \
+  --backbone_dir models/backbone \
+  --binary_ckpt_dir models/binary_model \
+  --hierarchy_ckpt_dir models/hierarchy_model \
+  --mapping_csv models/taxonomy/label_taxonomy_mapping.csv \
+  --taxonomy_path_csv models/taxonomy/taxonomy_paths.csv \
+  --input_csv examples/example.csv \
+  --seq_col seq \
+  --id_col id \
+  --seg_len 500 \
+  --stride 250 \
+  --max_length 512 \
+  --batch_size 32 \
+  --device cuda \
+  --virus_threshold 0.5 \
+  --tau 0.2 \
+  --out_csv pacmt_predictions.csv
+```
+For FASTA input, replace the CSV input arguments with:
+```bash
+--input_fasta examples/example.fasta
+```
+## Binary viral screening only
+```bash
+python scripts/predict_binary.py \
+  --backbone_dir models/backbone \
+  --ckpt_dir models/binary_model \
+  --input_csv examples/example.csv \
+  --seq_col seq \
+  --id_col id \
+  --seg_len 500 \
+  --stride 250 \
+  --max_length 512 \
+  --batch_size 32 \
+  --device cuda \
+  --tau 0.2 \
+  --threshold 0.5 \
+  --out_csv binary_predictions.csv
+```
+## Hierarchical classification only
+```bash
+python scripts/predict_hierarchy.py \
+  --backbone_dir models/backbone \
+  --ckpt_dir models/hierarchy_model \
+  --mapping_csv models/taxonomy/label_taxonomy_mapping.csv \
+  --taxonomy_path_csv models/taxonomy/taxonomy_paths.csv \
+  --input_csv examples/example.csv \
+  --seq_col seq \
+  --id_col id \
+  --seg_len 500 \
+  --stride 250 \
+  --max_length 512 \
+  --batch_size 32 \
+  --device cuda \
+  --tau 0.2 \
+  --out_csv hierarchy_predictions.csv
+```
+## Output
+The complete workflow outputs a CSV file containing:
+```text
+id
+seq_len
+n_segments
+is_virus
+virus_confidence
+order_id, order_name, order_conf
+family_id, family_name, family_conf
+genus_id, genus_name, genus_conf
+species_id, species_name, species_conf
+joint_score
+log_joint_score
+```
+`is_virus=1` indicates that the input sequence is predicted as viral. If `is_virus=0`, the hierarchical taxonomic fields are left empty.
+## Intended use
+PACMT is intended for research use in viral sequence screening and hierarchical taxonomic annotation of metagenomic sequences.
+## Limitations
+- Species-level prediction is generally more difficult than higher-rank prediction.
+- Predictions for short, divergent or underrepresented viral sequences should be interpreted carefully.
+- The hierarchical classifier relies on the released taxonomy mapping files and valid taxonomy paths.
+- PACMT should be used as a research tool and should not be used as the sole basis for clinical decision-making.
+## Citation
+If you use PACMT, please cite:
+```text
+Luan B, Li P, et al. PACMT: a pretrained language model-based framework for viral identification and hierarchical taxonomic classification of metagenomic data.
+```