File size: 4,602 Bytes
3c83555
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: mit
---

# BCE-Vir-Prediction

A virus epitope prediction tool based on ESM (Evolutionary Scale Modeling). This tool uses a pre-trained ESM classification model to perform sliding window predictions on protein sequences, identifying potential antigen epitopes and functional domains.

## Features

- **Epitope Prediction** (`bcepre_predict_logits.py`): Uses a pre-trained ESM classification model to split protein sequences with sliding windows, performs classification predictions on each subsequence (e.g., whether it is an antigen epitope, functional domain, etc.), and saves prediction results along with corresponding logits values.
- **Amino Acid Probability Prediction** (`bcepre_predict_softmax.py`): Converts sliding window prediction results into probability values aggregated by amino acid position, outputting a results table containing amino acid types, epitope probabilities, and coverage counts.

## Model

The pre-trained model can be downloaded from Hugging Face:

- **Model Repository**: [jackkuo/BCE-Vir-Prediction_model](https://huggingface.co/jackkuo/BCE-Vir-Prediction_model)

- **Code Repository**: [JackKuo666/BCE-Vir-Prediction](https://github.com/JackKuo666/BCE-Vir-Prediction)


# Model Download Instructions

This folder is used to store the trained ESM model files.

## How to Download the Model

### Method 1: Using Hugging Face Hub (Recommended)

Use the `huggingface_hub` library to download the model:

```bash
pip install huggingface_hub
```

Then run the following Python code:

```python
from huggingface_hub import snapshot_download

# Download the model to the current folder
snapshot_download(
    repo_id="jackkuo/BCE-Vir-Prediction_model",
    local_dir="./",
    local_dir_use_symlinks=False
)
```

Or use `huggingface-cli` in the command line:

```bash
huggingface-cli download jackkuo/BCE-Vir-Prediction_model --local-dir ./ --local-dir-use-symlinks False
```

### Method 2: Using Git LFS

If Git LFS is installed, you can clone directly:

```bash
git lfs install
git clone https://huggingface.co/jackkuo/BCE-Vir-Prediction_model .
```

### Method 3: Manual Download

Visit the model page: https://huggingface.co/jackkuo/BCE-Vir-Prediction_model

Select the required files from the file list to download and save them to this folder.

## Model File Structure

After downloading, this folder should contain the following files:
- `config.json` - Model configuration file
- `model.safetensors` - Model weights file (in safetensors format)
- `tokenizer_config.json` - Tokenizer configuration file
- `vocab.txt` - Vocabulary file
- `special_tokens_map.json` - Special tokens mapping file

## Usage

### Step 1: Download the Model

First, download the pre-trained model to the `trained_esm_model` folder. 

### Step 2: Prepare Input Files

Place the protein sequence file (FASTA format) to be predicted in the `example_data` folder, or modify the input file path in the script.

### Step 3: Run Epitope Prediction

Run the `bcepre_predict_logits.py` script for epitope prediction:

```bash
python bcepre_predict_logits.py
```

This script will:
- Read the protein sequence file in FASTA format
- Split the sequence using sliding windows (default minimum window size is 5)
- Perform classification predictions on each subsequence
- Output a CSV file containing the following fields:
  - `sequence`: Subsequence
  - `window_size`: Window size
  - `prediction`: Predicted class
  - `logit_0`, `logit_1`, ...: Logits values for each class

Output files are saved in the `predictions/` folder by default.

### Step 4: Calculate Amino Acid Position Probabilities

Run the `bcepre_predict_softmax.py` script to convert prediction results into aggregated probabilities by amino acid position:

```bash
python bcepre_predict_softmax.py
```

This script will:
- Read the CSV file generated by `bcepre_predict_logits.py`
- Calculate epitope probability for each subsequence (using softmax function)
- Aggregate probability values by amino acid position
- Output a CSV file containing the following fields:
  - `position`: Amino acid position (starting from 1)
  - `amino_acid`: Amino acid type
  - `probability`: Epitope probability at this position (average of all window predictions covering this position)
  - `coverage`: Number of windows covering this position

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## Citation

If you use this tool for research, please cite the relevant models and code repositories.

## Contact

For questions or suggestions, please contact us through GitHub Issues.