--- license: other language: - en tags: - metagenomics - viral-identification - hierarchical-classification - taxonomic-classification - DNABERT-2 - bioinformatics pipeline_tag: text-classification --- # PACMT PACMT is a pretrained sequence model-based framework for viral identification and hierarchical taxonomic classification of metagenomic sequences. This repository contains the trained PACMT model files and taxonomy resources. The source code, example files and detailed usage instructions are available at: ```text https://github.com/luanbei/PACMT ``` ## Model description PACMT uses a two-stage serial workflow: 1. **Binary viral screening**: a binary classifier predicts whether an input sequence is viral or non-viral. 2. **Hierarchical viral classification**: sequences predicted as viral are further classified at the order, family, genus and species levels. For hierarchical classification, PACMT uses taxonomy-consistent path decoding to select a biologically valid order-family-genus-species prediction path. ## Repository contents The recommended file structure of this Hugging Face repository is: ```text PACMT/ ├── README.md ├── backbone/ │ ├── config.json │ ├── pytorch_model.bin │ ├── tokenizer.json │ ├── tokenizer_config.json │ ├── configuration_bert.py │ ├── bert_layers.py │ ├── bert_padding.py │ └── flash_attn_triton.py ├── binary_model/ │ ├── pytorch_model.bin │ ├── head_config.json │ ├── tokenizer.json │ ├── tokenizer_config.json │ └── special_tokens_map.json ├── hierarchy_model/ │ ├── pytorch_model.bin │ ├── head_config.json │ ├── tokenizer.json │ ├── tokenizer_config.json │ ├── special_tokens_map.json │ ├── label_taxonomy_mapping.csv │ ├── taxonomy_paths.csv │ ├── taxonomy_paths_with_names.csv │ └── label_sizes.json └── taxonomy/ ├── label_taxonomy_mapping.csv └── taxonomy_paths.csv ``` ## Required files To run the complete PACMT prediction workflow, the following files or directories are required: ```text backbone/ binary_model/ hierarchy_model/ taxonomy/label_taxonomy_mapping.csv taxonomy/taxonomy_paths.csv ``` The `label_taxonomy_mapping.csv` file maps internal label IDs to taxonomy names and should contain at least: ```text rank,label_id,taxonomy_name ``` The `taxonomy_paths.csv` file defines valid hierarchical taxonomy paths and should contain at least: ```text order_id,family_id,genus_id,species_id ``` ## Installation and usage Please install PACMT from the GitHub repository: ```bash git clone https://github.com/luanbei/PACMT.git cd PACMT conda create -n pacmt python=3.8 -y conda activate pacmt pip install -r requirements.txt ``` Download this Hugging Face model repository and place the files under the `models/` directory: ```bash pip install -U huggingface_hub hf download luanbei/PACMT --local-dir models ``` After downloading, the local model directory should look like: ```text models/ ├── backbone/ ├── binary_model/ ├── hierarchy_model/ └── taxonomy/ ``` ## Complete prediction workflow The complete two-stage PACMT workflow first performs binary viral screening and then applies hierarchical taxonomic classification to sequences predicted as viral. ```bash python scripts/predict_binary_hierarchy.py \ --backbone_dir models/backbone \ --binary_ckpt_dir models/binary_model \ --hierarchy_ckpt_dir models/hierarchy_model \ --mapping_csv models/taxonomy/label_taxonomy_mapping.csv \ --taxonomy_path_csv models/taxonomy/taxonomy_paths.csv \ --input_csv examples/example.csv \ --seq_col seq \ --id_col id \ --seg_len 500 \ --stride 250 \ --max_length 512 \ --batch_size 32 \ --device cuda \ --virus_threshold 0.5 \ --tau 0.2 \ --out_csv pacmt_predictions.csv ``` For FASTA input, replace the CSV input arguments with: ```bash --input_fasta examples/example.fasta ``` ## Binary viral screening only ```bash python scripts/predict_binary.py \ --backbone_dir models/backbone \ --ckpt_dir models/binary_model \ --input_csv examples/example.csv \ --seq_col seq \ --id_col id \ --seg_len 500 \ --stride 250 \ --max_length 512 \ --batch_size 32 \ --device cuda \ --tau 0.2 \ --threshold 0.5 \ --out_csv binary_predictions.csv ``` ## Hierarchical classification only ```bash python scripts/predict_hierarchy.py \ --backbone_dir models/backbone \ --ckpt_dir models/hierarchy_model \ --mapping_csv models/taxonomy/label_taxonomy_mapping.csv \ --taxonomy_path_csv models/taxonomy/taxonomy_paths.csv \ --input_csv examples/example.csv \ --seq_col seq \ --id_col id \ --seg_len 500 \ --stride 250 \ --max_length 512 \ --batch_size 32 \ --device cuda \ --tau 0.2 \ --out_csv hierarchy_predictions.csv ``` ## Output The complete workflow outputs a CSV file containing: ```text id seq_len n_segments is_virus virus_confidence order_id, order_name, order_conf family_id, family_name, family_conf genus_id, genus_name, genus_conf species_id, species_name, species_conf joint_score log_joint_score ``` `is_virus=1` indicates that the input sequence is predicted as viral. If `is_virus=0`, the hierarchical taxonomic fields are left empty. ## Intended use PACMT is intended for research use in viral sequence screening and hierarchical taxonomic annotation of metagenomic sequences. ## Limitations - Species-level prediction is generally more difficult than higher-rank prediction. - Predictions for short, divergent or underrepresented viral sequences should be interpreted carefully. - The hierarchical classifier relies on the released taxonomy mapping files and valid taxonomy paths. - PACMT should be used as a research tool and should not be used as the sole basis for clinical decision-making. ## Citation If you use PACMT, please cite: ```text Luan B, Li P, et al. PACMT: a pretrained language model-based framework for viral identification and hierarchical taxonomic classification of metagenomic data. ```