File size: 6,188 Bytes
2d91c8e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
license: other
language:
- en
tags:
- metagenomics
- viral-identification
- hierarchical-classification
- taxonomic-classification
- DNABERT-2
- bioinformatics
pipeline_tag: text-classification
---

# PACMT

PACMT is a pretrained sequence model-based framework for viral identification and hierarchical taxonomic classification of metagenomic sequences.

This repository contains the trained PACMT model files and taxonomy resources. The source code, example files and detailed usage instructions are available at:

```text
https://github.com/luanbei/PACMT
```

## Model description

PACMT uses a two-stage serial workflow:

1. **Binary viral screening**: a binary classifier predicts whether an input sequence is viral or non-viral.
2. **Hierarchical viral classification**: sequences predicted as viral are further classified at the order, family, genus and species levels.

For hierarchical classification, PACMT uses taxonomy-consistent path decoding to select a biologically valid order-family-genus-species prediction path.

## Repository contents

The recommended file structure of this Hugging Face repository is:

```text
PACMT/
β”œβ”€β”€ README.md
β”œβ”€β”€ backbone/
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ pytorch_model.bin
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   β”œβ”€β”€ configuration_bert.py
β”‚   β”œβ”€β”€ bert_layers.py
β”‚   β”œβ”€β”€ bert_padding.py
β”‚   └── flash_attn_triton.py
β”œβ”€β”€ binary_model/
β”‚   β”œβ”€β”€ pytorch_model.bin
β”‚   β”œβ”€β”€ head_config.json
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   └── special_tokens_map.json
β”œβ”€β”€ hierarchy_model/
β”‚   β”œβ”€β”€ pytorch_model.bin
β”‚   β”œβ”€β”€ head_config.json
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   β”œβ”€β”€ special_tokens_map.json
β”‚   β”œβ”€β”€ label_taxonomy_mapping.csv
β”‚   β”œβ”€β”€ taxonomy_paths.csv
β”‚   β”œβ”€β”€ taxonomy_paths_with_names.csv
β”‚   └── label_sizes.json
└── taxonomy/
    β”œβ”€β”€ label_taxonomy_mapping.csv
    └── taxonomy_paths.csv
```

## Required files

To run the complete PACMT prediction workflow, the following files or directories are required:

```text
backbone/
binary_model/
hierarchy_model/
taxonomy/label_taxonomy_mapping.csv
taxonomy/taxonomy_paths.csv
```

The `label_taxonomy_mapping.csv` file maps internal label IDs to taxonomy names and should contain at least:

```text
rank,label_id,taxonomy_name
```

The `taxonomy_paths.csv` file defines valid hierarchical taxonomy paths and should contain at least:

```text
order_id,family_id,genus_id,species_id
```

## Installation and usage

Please install PACMT from the GitHub repository:

```bash
git clone https://github.com/luanbei/PACMT.git
cd PACMT
conda create -n pacmt python=3.8 -y
conda activate pacmt
pip install -r requirements.txt
```

Download this Hugging Face model repository and place the files under the `models/` directory:

```bash
pip install -U huggingface_hub
hf download luanbei/PACMT --local-dir models
```

After downloading, the local model directory should look like:

```text
models/
β”œβ”€β”€ backbone/
β”œβ”€β”€ binary_model/
β”œβ”€β”€ hierarchy_model/
└── taxonomy/
```

## Complete prediction workflow

The complete two-stage PACMT workflow first performs binary viral screening and then applies hierarchical taxonomic classification to sequences predicted as viral.

```bash
python scripts/predict_binary_hierarchy.py \
  --backbone_dir models/backbone \
  --binary_ckpt_dir models/binary_model \
  --hierarchy_ckpt_dir models/hierarchy_model \
  --mapping_csv models/taxonomy/label_taxonomy_mapping.csv \
  --taxonomy_path_csv models/taxonomy/taxonomy_paths.csv \
  --input_csv examples/example.csv \
  --seq_col seq \
  --id_col id \
  --seg_len 500 \
  --stride 250 \
  --max_length 512 \
  --batch_size 32 \
  --device cuda \
  --virus_threshold 0.5 \
  --tau 0.2 \
  --out_csv pacmt_predictions.csv
```

For FASTA input, replace the CSV input arguments with:

```bash
--input_fasta examples/example.fasta
```

## Binary viral screening only

```bash
python scripts/predict_binary.py \
  --backbone_dir models/backbone \
  --ckpt_dir models/binary_model \
  --input_csv examples/example.csv \
  --seq_col seq \
  --id_col id \
  --seg_len 500 \
  --stride 250 \
  --max_length 512 \
  --batch_size 32 \
  --device cuda \
  --tau 0.2 \
  --threshold 0.5 \
  --out_csv binary_predictions.csv
```

## Hierarchical classification only

```bash
python scripts/predict_hierarchy.py \
  --backbone_dir models/backbone \
  --ckpt_dir models/hierarchy_model \
  --mapping_csv models/taxonomy/label_taxonomy_mapping.csv \
  --taxonomy_path_csv models/taxonomy/taxonomy_paths.csv \
  --input_csv examples/example.csv \
  --seq_col seq \
  --id_col id \
  --seg_len 500 \
  --stride 250 \
  --max_length 512 \
  --batch_size 32 \
  --device cuda \
  --tau 0.2 \
  --out_csv hierarchy_predictions.csv
```

## Output

The complete workflow outputs a CSV file containing:

```text
id
seq_len
n_segments
is_virus
virus_confidence
order_id, order_name, order_conf
family_id, family_name, family_conf
genus_id, genus_name, genus_conf
species_id, species_name, species_conf
joint_score
log_joint_score
```

`is_virus=1` indicates that the input sequence is predicted as viral. If `is_virus=0`, the hierarchical taxonomic fields are left empty.

## Intended use

PACMT is intended for research use in viral sequence screening and hierarchical taxonomic annotation of metagenomic sequences.

## Limitations

- Species-level prediction is generally more difficult than higher-rank prediction.
- Predictions for short, divergent or underrepresented viral sequences should be interpreted carefully.
- The hierarchical classifier relies on the released taxonomy mapping files and valid taxonomy paths.
- PACMT should be used as a research tool and should not be used as the sole basis for clinical decision-making.

## Citation

If you use PACMT, please cite:

```text
Luan B, Li P, et al. PACMT: a pretrained language model-based framework for viral identification and hierarchical taxonomic classification of metagenomic data.
```