evankomp commited on
Commit
bd6f161
·
1 Parent(s): 4e990ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -9
README.md CHANGED
@@ -7,14 +7,6 @@ tags:
7
 
8
  __Purpose__: classifies protein sequence into Thermophilic (> 60C) or Mesophilic (<40C) by host organism growth temperature.
9
 
10
- __Training__:
11
- ProteinBERT (Rostlab/prot_bert) was fine tuned on a class balanced version of learn2therm (see [here]()), about 250k protein amino acid sequences.
12
-
13
- Training parameters below:
14
- TODO
15
-
16
- See the [training repository](https://github.com/BeckResearchLab/learn2thermML) for code.
17
-
18
  __Usage__:
19
  Prepare sequences identically to using the original pretrained model:
20
 
@@ -30,4 +22,118 @@ encoded_input = tokenizer(sequence_Example, return_tensors='pt')
30
  output = torch.argmax(model(**encoded_input), dim=1)
31
  ```
32
 
33
- 1 indicates thermophilic, 0 mesophilic.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  __Purpose__: classifies protein sequence into Thermophilic (> 60C) or Mesophilic (<40C) by host organism growth temperature.
9
 
 
 
 
 
 
 
 
 
10
  __Usage__:
11
  Prepare sequences identically to using the original pretrained model:
12
 
 
22
  output = torch.argmax(model(**encoded_input), dim=1)
23
  ```
24
 
25
+ 1 indicates thermophilic, 0 mesophilic.
26
+
27
+ __Training__:
28
+ ProteinBERT (Rostlab/prot_bert) was fine tuned on a class balanced version of learn2therm (see [here]()), about 250k protein amino acid sequences.
29
+
30
+ Training parameters below:
31
+ TrainingArguments(
32
+ _n_gpu=1,
33
+ adafactor=False,
34
+ adam_beta1=0.9,
35
+ adam_beta2=0.999,
36
+ adam_epsilon=1e-08,
37
+ auto_find_batch_size=False,
38
+ bf16=False,
39
+ bf16_full_eval=False,
40
+ data_seed=None,
41
+ dataloader_drop_last=False,
42
+ dataloader_num_workers=0,
43
+ dataloader_pin_memory=True,
44
+ ddp_bucket_cap_mb=None,
45
+ ddp_find_unused_parameters=None,
46
+ ddp_timeout=1800,
47
+ debug=[],
48
+ deepspeed=None,
49
+ disable_tqdm=False,
50
+ do_eval=True,
51
+ do_predict=False,
52
+ do_train=True,
53
+ eval_accumulation_steps=25,
54
+ eval_delay=0,
55
+ eval_steps=6,
56
+ evaluation_strategy=steps,
57
+ fp16=True,
58
+ fp16_backend=auto,
59
+ fp16_full_eval=False,
60
+ fp16_opt_level=O1,
61
+ fsdp=[],
62
+ fsdp_min_num_params=0,
63
+ fsdp_transformer_layer_cls_to_wrap=None,
64
+ full_determinism=False,
65
+ gradient_accumulation_steps=25,
66
+ gradient_checkpointing=True,
67
+ greater_is_better=False,
68
+ group_by_length=False,
69
+ half_precision_backend=cuda_amp,
70
+ hub_model_id=None,
71
+ hub_private_repo=False,
72
+ hub_strategy=every_save,
73
+ hub_token=<HUB_TOKEN>,
74
+ ignore_data_skip=False,
75
+ include_inputs_for_metrics=False,
76
+ jit_mode_eval=False,
77
+ label_names=None,
78
+ label_smoothing_factor=0.0,
79
+ learning_rate=5e-05,
80
+ length_column_name=length,
81
+ load_best_model_at_end=True,
82
+ local_rank=0,
83
+ log_level=info,
84
+ log_level_replica=passive,
85
+ log_on_each_node=True,
86
+ logging_dir=./data/ogt_protein_classifier/model/runs/Jun19_12-16-35_g3070,
87
+ logging_first_step=False,
88
+ logging_nan_inf_filter=True,
89
+ logging_steps=1,
90
+ logging_strategy=steps,
91
+ lr_scheduler_type=linear,
92
+ max_grad_norm=1.0,
93
+ max_steps=-1,
94
+ metric_for_best_model=loss,
95
+ mp_parameters=,
96
+ no_cuda=False,
97
+ num_train_epochs=2,
98
+ optim=adamw_hf,
99
+ optim_args=None,
100
+ output_dir=./data/ogt_protein_classifier/model,
101
+ overwrite_output_dir=False,
102
+ past_index=-1,
103
+ per_device_eval_batch_size=32,
104
+ per_device_train_batch_size=32,
105
+ prediction_loss_only=False,
106
+ push_to_hub=False,
107
+ push_to_hub_model_id=None,
108
+ push_to_hub_organization=None,
109
+ push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
110
+ ray_scope=last,
111
+ remove_unused_columns=True,
112
+ report_to=['tensorboard', 'codecarbon'],
113
+ resume_from_checkpoint=None,
114
+ run_name=./data/ogt_protein_classifier/model,
115
+ save_on_each_node=False,
116
+ save_steps=6,
117
+ save_strategy=steps,
118
+ save_total_limit=None,
119
+ seed=42,
120
+ sharded_ddp=[],
121
+ skip_memory_metrics=True,
122
+ tf32=None,
123
+ torch_compile=False,
124
+ torch_compile_backend=None,
125
+ torch_compile_mode=None,
126
+ torchdynamo=None,
127
+ tpu_metrics_debug=False,
128
+ tpu_num_cores=None,
129
+ use_ipex=False,
130
+ use_legacy_prediction_loop=False,
131
+ use_mps_device=False,
132
+ warmup_ratio=0.0,
133
+ warmup_steps=0,
134
+ weight_decay=0.0,
135
+ xpu_backend=None,
136
+ )
137
+
138
+
139
+ See the [training repository](https://github.com/BeckResearchLab/learn2thermML) for code.