Update README.md
Browse files
README.md
CHANGED
|
@@ -24,7 +24,8 @@ In the example below, ProtGPT2 generates sequences that follow the amino acid 'M
|
|
| 24 |
```
|
| 25 |
>>> from transformers import pipeline
|
| 26 |
>>> protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2")
|
| 27 |
-
|
|
|
|
| 28 |
>>> for seq in sequences:
|
| 29 |
print(seq):
|
| 30 |
{'generated_text': 'MINDLLDISRIISGKMTLDRAEVNLTAIARQVVEEQRQAAEAKSIQLLCSTPDTNHYVFG\nDFDRLKQTLWNLLSNAVKFTPSGGTVELELGYNAEGMEVYVKDSGIGIDPAFLPYVFDRF\nRQSDAADSRNYGGLGLGLAIVKHLLDLHEGNVSAQSEGFGKGATFTVLLPLKPLKRELAA\nVNRHTAVQQSAPLNDNLAGMKILIVEDRPDTNEMVSYILEEAGAIVETAESGAAALTSLK\nSYSPDLVLSDIGMPMMDGYEMIEYIREWKTTKGG'}
|
|
@@ -54,15 +55,27 @@ The HuggingFace script run_clm.py can be found here: https://github.com/huggingf
|
|
| 54 |
|
| 55 |
### **How to select the best sequences**
|
| 56 |
We've observed that perplexity values correlate with AlphaFold2's plddt.
|
| 57 |
-
We recommend to compute perplexity for each sequence
|
| 58 |
|
| 59 |
```
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
```
|
| 64 |
|
| 65 |
-
Where `
|
| 66 |
We do not yet have a threshold as of what perplexity value gives a 'good' or 'bad' sequence, but given the fast inference times, the best is to sample many sequences, order them by perplexity, and select those with the lower values (the lower the better).
|
| 67 |
|
| 68 |
|
|
|
|
| 24 |
```
|
| 25 |
>>> from transformers import pipeline
|
| 26 |
>>> protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2")
|
| 27 |
+
# length is expressed in tokens, where each token has an average length of 4 amino acids.
|
| 28 |
+
>>> sequences = protgpt2("<|endoftext|>", max_length=100, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0)
|
| 29 |
>>> for seq in sequences:
|
| 30 |
print(seq):
|
| 31 |
{'generated_text': 'MINDLLDISRIISGKMTLDRAEVNLTAIARQVVEEQRQAAEAKSIQLLCSTPDTNHYVFG\nDFDRLKQTLWNLLSNAVKFTPSGGTVELELGYNAEGMEVYVKDSGIGIDPAFLPYVFDRF\nRQSDAADSRNYGGLGLGLAIVKHLLDLHEGNVSAQSEGFGKGATFTVLLPLKPLKRELAA\nVNRHTAVQQSAPLNDNLAGMKILIVEDRPDTNEMVSYILEEAGAIVETAESGAAALTSLK\nSYSPDLVLSDIGMPMMDGYEMIEYIREWKTTKGG'}
|
|
|
|
| 55 |
|
| 56 |
### **How to select the best sequences**
|
| 57 |
We've observed that perplexity values correlate with AlphaFold2's plddt.
|
| 58 |
+
We recommend to compute perplexity for each sequence as follows:
|
| 59 |
|
| 60 |
```
|
| 61 |
+
def calculatePerplexity(sequence, model, tokenizer):
|
| 62 |
+
with torch.no_grad():
|
| 63 |
+
outputs = model(sequence, labels=input_ids)
|
| 64 |
+
loss, logits = outputs[:2]
|
| 65 |
+
return math.exp(loss)
|
| 66 |
+
|
| 67 |
+
# Generate sequences by loading model and tokenizer (previously downloaded)
|
| 68 |
+
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
| 69 |
+
tokenizer = AutoTokenizer.from_pretrained('/path/to/tokenizer') # replace with the actual path
|
| 70 |
+
model = GPT2LMHeadModel.from_pretrained('/path/to/output').to(device)
|
| 71 |
+
output = model.generate("<|endoftext|>", max_length=400, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0)
|
| 72 |
+
|
| 73 |
+
# Take (for example) the first sequence
|
| 74 |
+
sequence = output[0]
|
| 75 |
+
ppl = calculatePerplexity(sequence, model, tokenizer)
|
| 76 |
```
|
| 77 |
|
| 78 |
+
Where `ppl` is a value with the perplexity for that sequence.
|
| 79 |
We do not yet have a threshold as of what perplexity value gives a 'good' or 'bad' sequence, but given the fast inference times, the best is to sample many sequences, order them by perplexity, and select those with the lower values (the lower the better).
|
| 80 |
|
| 81 |
|