luanbei commited on
Commit
2d91c8e
Β·
verified Β·
1 Parent(s): bc621b7

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +230 -0
README.md ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - en
5
+ tags:
6
+ - metagenomics
7
+ - viral-identification
8
+ - hierarchical-classification
9
+ - taxonomic-classification
10
+ - DNABERT-2
11
+ - bioinformatics
12
+ pipeline_tag: text-classification
13
+ ---
14
+
15
+ # PACMT
16
+
17
+ PACMT is a pretrained sequence model-based framework for viral identification and hierarchical taxonomic classification of metagenomic sequences.
18
+
19
+ This repository contains the trained PACMT model files and taxonomy resources. The source code, example files and detailed usage instructions are available at:
20
+
21
+ ```text
22
+ https://github.com/luanbei/PACMT
23
+ ```
24
+
25
+ ## Model description
26
+
27
+ PACMT uses a two-stage serial workflow:
28
+
29
+ 1. **Binary viral screening**: a binary classifier predicts whether an input sequence is viral or non-viral.
30
+ 2. **Hierarchical viral classification**: sequences predicted as viral are further classified at the order, family, genus and species levels.
31
+
32
+ For hierarchical classification, PACMT uses taxonomy-consistent path decoding to select a biologically valid order-family-genus-species prediction path.
33
+
34
+ ## Repository contents
35
+
36
+ The recommended file structure of this Hugging Face repository is:
37
+
38
+ ```text
39
+ PACMT/
40
+ β”œβ”€β”€ README.md
41
+ β”œβ”€β”€ backbone/
42
+ β”‚ β”œβ”€β”€ config.json
43
+ β”‚ β”œβ”€β”€ pytorch_model.bin
44
+ β”‚ β”œβ”€β”€ tokenizer.json
45
+ β”‚ β”œβ”€β”€ tokenizer_config.json
46
+ β”‚ β”œβ”€β”€ configuration_bert.py
47
+ β”‚ β”œβ”€β”€ bert_layers.py
48
+ β”‚ β”œβ”€β”€ bert_padding.py
49
+ β”‚ └── flash_attn_triton.py
50
+ β”œβ”€β”€ binary_model/
51
+ β”‚ β”œβ”€β”€ pytorch_model.bin
52
+ β”‚ β”œβ”€β”€ head_config.json
53
+ β”‚ β”œβ”€β”€ tokenizer.json
54
+ β”‚ β”œβ”€β”€ tokenizer_config.json
55
+ β”‚ └── special_tokens_map.json
56
+ β”œβ”€β”€ hierarchy_model/
57
+ β”‚ β”œβ”€β”€ pytorch_model.bin
58
+ β”‚ β”œβ”€β”€ head_config.json
59
+ β”‚ β”œβ”€β”€ tokenizer.json
60
+ β”‚ β”œβ”€β”€ tokenizer_config.json
61
+ β”‚ β”œβ”€β”€ special_tokens_map.json
62
+ β”‚ β”œβ”€β”€ label_taxonomy_mapping.csv
63
+ β”‚ β”œβ”€β”€ taxonomy_paths.csv
64
+ β”‚ β”œβ”€β”€ taxonomy_paths_with_names.csv
65
+ β”‚ └── label_sizes.json
66
+ └── taxonomy/
67
+ β”œβ”€β”€ label_taxonomy_mapping.csv
68
+ └── taxonomy_paths.csv
69
+ ```
70
+
71
+ ## Required files
72
+
73
+ To run the complete PACMT prediction workflow, the following files or directories are required:
74
+
75
+ ```text
76
+ backbone/
77
+ binary_model/
78
+ hierarchy_model/
79
+ taxonomy/label_taxonomy_mapping.csv
80
+ taxonomy/taxonomy_paths.csv
81
+ ```
82
+
83
+ The `label_taxonomy_mapping.csv` file maps internal label IDs to taxonomy names and should contain at least:
84
+
85
+ ```text
86
+ rank,label_id,taxonomy_name
87
+ ```
88
+
89
+ The `taxonomy_paths.csv` file defines valid hierarchical taxonomy paths and should contain at least:
90
+
91
+ ```text
92
+ order_id,family_id,genus_id,species_id
93
+ ```
94
+
95
+ ## Installation and usage
96
+
97
+ Please install PACMT from the GitHub repository:
98
+
99
+ ```bash
100
+ git clone https://github.com/luanbei/PACMT.git
101
+ cd PACMT
102
+ conda create -n pacmt python=3.8 -y
103
+ conda activate pacmt
104
+ pip install -r requirements.txt
105
+ ```
106
+
107
+ Download this Hugging Face model repository and place the files under the `models/` directory:
108
+
109
+ ```bash
110
+ pip install -U huggingface_hub
111
+ hf download luanbei/PACMT --local-dir models
112
+ ```
113
+
114
+ After downloading, the local model directory should look like:
115
+
116
+ ```text
117
+ models/
118
+ β”œβ”€β”€ backbone/
119
+ β”œβ”€β”€ binary_model/
120
+ β”œβ”€β”€ hierarchy_model/
121
+ └── taxonomy/
122
+ ```
123
+
124
+ ## Complete prediction workflow
125
+
126
+ The complete two-stage PACMT workflow first performs binary viral screening and then applies hierarchical taxonomic classification to sequences predicted as viral.
127
+
128
+ ```bash
129
+ python scripts/predict_binary_hierarchy.py \
130
+ --backbone_dir models/backbone \
131
+ --binary_ckpt_dir models/binary_model \
132
+ --hierarchy_ckpt_dir models/hierarchy_model \
133
+ --mapping_csv models/taxonomy/label_taxonomy_mapping.csv \
134
+ --taxonomy_path_csv models/taxonomy/taxonomy_paths.csv \
135
+ --input_csv examples/example.csv \
136
+ --seq_col seq \
137
+ --id_col id \
138
+ --seg_len 500 \
139
+ --stride 250 \
140
+ --max_length 512 \
141
+ --batch_size 32 \
142
+ --device cuda \
143
+ --virus_threshold 0.5 \
144
+ --tau 0.2 \
145
+ --out_csv pacmt_predictions.csv
146
+ ```
147
+
148
+ For FASTA input, replace the CSV input arguments with:
149
+
150
+ ```bash
151
+ --input_fasta examples/example.fasta
152
+ ```
153
+
154
+ ## Binary viral screening only
155
+
156
+ ```bash
157
+ python scripts/predict_binary.py \
158
+ --backbone_dir models/backbone \
159
+ --ckpt_dir models/binary_model \
160
+ --input_csv examples/example.csv \
161
+ --seq_col seq \
162
+ --id_col id \
163
+ --seg_len 500 \
164
+ --stride 250 \
165
+ --max_length 512 \
166
+ --batch_size 32 \
167
+ --device cuda \
168
+ --tau 0.2 \
169
+ --threshold 0.5 \
170
+ --out_csv binary_predictions.csv
171
+ ```
172
+
173
+ ## Hierarchical classification only
174
+
175
+ ```bash
176
+ python scripts/predict_hierarchy.py \
177
+ --backbone_dir models/backbone \
178
+ --ckpt_dir models/hierarchy_model \
179
+ --mapping_csv models/taxonomy/label_taxonomy_mapping.csv \
180
+ --taxonomy_path_csv models/taxonomy/taxonomy_paths.csv \
181
+ --input_csv examples/example.csv \
182
+ --seq_col seq \
183
+ --id_col id \
184
+ --seg_len 500 \
185
+ --stride 250 \
186
+ --max_length 512 \
187
+ --batch_size 32 \
188
+ --device cuda \
189
+ --tau 0.2 \
190
+ --out_csv hierarchy_predictions.csv
191
+ ```
192
+
193
+ ## Output
194
+
195
+ The complete workflow outputs a CSV file containing:
196
+
197
+ ```text
198
+ id
199
+ seq_len
200
+ n_segments
201
+ is_virus
202
+ virus_confidence
203
+ order_id, order_name, order_conf
204
+ family_id, family_name, family_conf
205
+ genus_id, genus_name, genus_conf
206
+ species_id, species_name, species_conf
207
+ joint_score
208
+ log_joint_score
209
+ ```
210
+
211
+ `is_virus=1` indicates that the input sequence is predicted as viral. If `is_virus=0`, the hierarchical taxonomic fields are left empty.
212
+
213
+ ## Intended use
214
+
215
+ PACMT is intended for research use in viral sequence screening and hierarchical taxonomic annotation of metagenomic sequences.
216
+
217
+ ## Limitations
218
+
219
+ - Species-level prediction is generally more difficult than higher-rank prediction.
220
+ - Predictions for short, divergent or underrepresented viral sequences should be interpreted carefully.
221
+ - The hierarchical classifier relies on the released taxonomy mapping files and valid taxonomy paths.
222
+ - PACMT should be used as a research tool and should not be used as the sole basis for clinical decision-making.
223
+
224
+ ## Citation
225
+
226
+ If you use PACMT, please cite:
227
+
228
+ ```text
229
+ Luan B, Li P, et al. PACMT: a pretrained language model-based framework for viral identification and hierarchical taxonomic classification of metagenomic data.
230
+ ```