File size: 1,641 Bytes
0cffe5c
61fd294
0cffe5c
61fd294
 
 
db34cae
0cffe5c
 
0248453
0cffe5c
0248453
42dc529
0248453
 
42dc529
 
0248453
 
 
 
 
 
 
42dc529
0248453
42dc529
 
 
 
 
 
 
0cffe5c
0248453
42dc529
 
 
 
 
 
 
 
0248453
42dc529
 
 
 
 
 
 
 
0248453
 
 
42dc529
0248453
 
 
 
 
 
0cffe5c
0248453
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
base_model: westlake-repl/SaProt_35M_AF2
library_name: peft
license: mit
metrics:
- accuracy
accuracy: 0.68
---

Base model: westlake-repl/SaProt_35M_AF2

Task type: protein-level classification

Dataset: This model classifies proteins into 6 major EC classes (EC1-EC6). EC7 was excluded due to only 31 samples available.
To address class imbalance, Label 4 (EC5) was duplicated 2 times and Label 5 (EC6) was duplicated 1 time in the training set.
Training data is obtained from: https://academic.oup.com/nar/article/54/D1/D643/8313833

Label mapping:
Label 0: Oxidoreductase (EC1)
Label 1: Transferase (EC2)
Label 2: Hydrolase (EC3)
Label 3: Lyase (EC4)
Label 4: Isomerase (EC5)
Label 5: Ligase (EC6)

Training set distribution:
- Label 0: 1497 (28.5%)
- Label 2: 1217 (23.2%)
- Label 1: 1050 (19.9%)
- Label 3: 512 (9.7%)
- Label 4: 496 (9.4%)
- Label 5: 483 (9.2%)
Total: 5255 samples

Validation set distribution:
- Label 0: 187 (32.0%)
- Label 2: 152 (26.0%)
- Label 1: 131 (22.4%)
- Label 3: 64 (10.9%)
- Label 4: 31 (5.3%)
- Label 5: 20 (3.4%)
Total: 585 samples

Test set distribution:
- Label 0: 188 (31.8%)
- Label 2: 153 (25.9%)
- Label 1: 132 (22.3%)
- Label 3: 65 (11.0%)
- Label 4: 32 (5.4%)
- Label 5: 21 (3.5%)
Total: 591 samples

Model input type: Amino acid sequence

Performance (on test set): 0.68 accuracy

LoRA config:
r: 8
lora_dropout: 0.1
lora_alpha: 16
target_modules: ["key", "value", "output.dense", "intermediate.dense", "query"]
modules_to_save: ["classifier"]

Training config:
optimizer:
class: AdamW
betas: (0.9, 0.98)
weight_decay: 0.01
learning rate: 0.0005
epoch: 25
batch size: 64
precision: 16-mixed