| --- |
| license: other |
| language: |
| - en |
| tags: |
| - metagenomics |
| - viral-identification |
| - hierarchical-classification |
| - taxonomic-classification |
| - DNABERT-2 |
| - bioinformatics |
| pipeline_tag: text-classification |
| --- |
| |
| # PACMT |
|
|
| PACMT is a pretrained sequence model-based framework for viral identification and hierarchical taxonomic classification of metagenomic sequences. |
|
|
| This repository contains the trained PACMT model files and taxonomy resources. The source code, example files and detailed usage instructions are available at: |
|
|
| ```text |
| https://github.com/luanbei/PACMT |
| ``` |
|
|
| ## Model description |
|
|
| PACMT uses a two-stage serial workflow: |
|
|
| 1. **Binary viral screening**: a binary classifier predicts whether an input sequence is viral or non-viral. |
| 2. **Hierarchical viral classification**: sequences predicted as viral are further classified at the order, family, genus and species levels. |
|
|
| For hierarchical classification, PACMT uses taxonomy-consistent path decoding to select a biologically valid order-family-genus-species prediction path. |
|
|
| ## Repository contents |
|
|
| The recommended file structure of this Hugging Face repository is: |
|
|
| ```text |
| PACMT/ |
| βββ README.md |
| βββ backbone/ |
| β βββ config.json |
| β βββ pytorch_model.bin |
| β βββ tokenizer.json |
| β βββ tokenizer_config.json |
| β βββ configuration_bert.py |
| β βββ bert_layers.py |
| β βββ bert_padding.py |
| β βββ flash_attn_triton.py |
| βββ binary_model/ |
| β βββ pytorch_model.bin |
| β βββ head_config.json |
| β βββ tokenizer.json |
| β βββ tokenizer_config.json |
| β βββ special_tokens_map.json |
| βββ hierarchy_model/ |
| β βββ pytorch_model.bin |
| β βββ head_config.json |
| β βββ tokenizer.json |
| β βββ tokenizer_config.json |
| β βββ special_tokens_map.json |
| β βββ label_taxonomy_mapping.csv |
| β βββ taxonomy_paths.csv |
| β βββ taxonomy_paths_with_names.csv |
| β βββ label_sizes.json |
| βββ taxonomy/ |
| βββ label_taxonomy_mapping.csv |
| βββ taxonomy_paths.csv |
| ``` |
|
|
| ## Required files |
|
|
| To run the complete PACMT prediction workflow, the following files or directories are required: |
|
|
| ```text |
| backbone/ |
| binary_model/ |
| hierarchy_model/ |
| taxonomy/label_taxonomy_mapping.csv |
| taxonomy/taxonomy_paths.csv |
| ``` |
|
|
| The `label_taxonomy_mapping.csv` file maps internal label IDs to taxonomy names and should contain at least: |
|
|
| ```text |
| rank,label_id,taxonomy_name |
| ``` |
|
|
| The `taxonomy_paths.csv` file defines valid hierarchical taxonomy paths and should contain at least: |
|
|
| ```text |
| order_id,family_id,genus_id,species_id |
| ``` |
|
|
| ## Installation and usage |
|
|
| Please install PACMT from the GitHub repository: |
|
|
| ```bash |
| git clone https://github.com/luanbei/PACMT.git |
| cd PACMT |
| conda create -n pacmt python=3.8 -y |
| conda activate pacmt |
| pip install -r requirements.txt |
| ``` |
|
|
| Download this Hugging Face model repository and place the files under the `models/` directory: |
|
|
| ```bash |
| pip install -U huggingface_hub |
| hf download luanbei/PACMT --local-dir models |
| ``` |
|
|
| After downloading, the local model directory should look like: |
|
|
| ```text |
| models/ |
| βββ backbone/ |
| βββ binary_model/ |
| βββ hierarchy_model/ |
| βββ taxonomy/ |
| ``` |
|
|
| ## Complete prediction workflow |
|
|
| The complete two-stage PACMT workflow first performs binary viral screening and then applies hierarchical taxonomic classification to sequences predicted as viral. |
|
|
| ```bash |
| python scripts/predict_binary_hierarchy.py \ |
| --backbone_dir models/backbone \ |
| --binary_ckpt_dir models/binary_model \ |
| --hierarchy_ckpt_dir models/hierarchy_model \ |
| --mapping_csv models/taxonomy/label_taxonomy_mapping.csv \ |
| --taxonomy_path_csv models/taxonomy/taxonomy_paths.csv \ |
| --input_csv examples/example.csv \ |
| --seq_col seq \ |
| --id_col id \ |
| --seg_len 500 \ |
| --stride 250 \ |
| --max_length 512 \ |
| --batch_size 32 \ |
| --device cuda \ |
| --virus_threshold 0.5 \ |
| --tau 0.2 \ |
| --out_csv pacmt_predictions.csv |
| ``` |
|
|
| For FASTA input, replace the CSV input arguments with: |
|
|
| ```bash |
| --input_fasta examples/example.fasta |
| ``` |
|
|
| ## Binary viral screening only |
|
|
| ```bash |
| python scripts/predict_binary.py \ |
| --backbone_dir models/backbone \ |
| --ckpt_dir models/binary_model \ |
| --input_csv examples/example.csv \ |
| --seq_col seq \ |
| --id_col id \ |
| --seg_len 500 \ |
| --stride 250 \ |
| --max_length 512 \ |
| --batch_size 32 \ |
| --device cuda \ |
| --tau 0.2 \ |
| --threshold 0.5 \ |
| --out_csv binary_predictions.csv |
| ``` |
|
|
| ## Hierarchical classification only |
|
|
| ```bash |
| python scripts/predict_hierarchy.py \ |
| --backbone_dir models/backbone \ |
| --ckpt_dir models/hierarchy_model \ |
| --mapping_csv models/taxonomy/label_taxonomy_mapping.csv \ |
| --taxonomy_path_csv models/taxonomy/taxonomy_paths.csv \ |
| --input_csv examples/example.csv \ |
| --seq_col seq \ |
| --id_col id \ |
| --seg_len 500 \ |
| --stride 250 \ |
| --max_length 512 \ |
| --batch_size 32 \ |
| --device cuda \ |
| --tau 0.2 \ |
| --out_csv hierarchy_predictions.csv |
| ``` |
|
|
| ## Output |
|
|
| The complete workflow outputs a CSV file containing: |
|
|
| ```text |
| id |
| seq_len |
| n_segments |
| is_virus |
| virus_confidence |
| order_id, order_name, order_conf |
| family_id, family_name, family_conf |
| genus_id, genus_name, genus_conf |
| species_id, species_name, species_conf |
| joint_score |
| log_joint_score |
| ``` |
|
|
| `is_virus=1` indicates that the input sequence is predicted as viral. If `is_virus=0`, the hierarchical taxonomic fields are left empty. |
|
|
| ## Intended use |
|
|
| PACMT is intended for research use in viral sequence screening and hierarchical taxonomic annotation of metagenomic sequences. |
|
|
| ## Limitations |
|
|
| - Species-level prediction is generally more difficult than higher-rank prediction. |
| - Predictions for short, divergent or underrepresented viral sequences should be interpreted carefully. |
| - The hierarchical classifier relies on the released taxonomy mapping files and valid taxonomy paths. |
| - PACMT should be used as a research tool and should not be used as the sole basis for clinical decision-making. |
|
|
| ## Citation |
|
|
| If you use PACMT, please cite: |
|
|
| ```text |
| Luan B, Li P, et al. PACMT: a pretrained language model-based framework for viral identification and hierarchical taxonomic classification of metagenomic data. |
| ``` |
|
|