Ninjani commited on
Commit
251f5cc
·
verified ·
1 Parent(s): 1059e0e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -3
README.md CHANGED
@@ -5,6 +5,94 @@ tags:
5
  - pytorch_model_hub_mixin
6
  ---
7
 
8
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
9
- - Library: tea
10
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - pytorch_model_hub_mixin
6
  ---
7
 
8
+ # The Embedded Alphabet (TEA)
9
+
10
+ ![Model Architecture](Model_Architecture.png)
11
+
12
+ This repository contains the code accompanying our pre-print (link coming soon).
13
+
14
+ ## Installation
15
+
16
+ ```bash
17
+ pip install git+https://github.com/PickyBinders/tea.git
18
+ ```
19
+
20
+ ## Sequence Conversion with TEA
21
+
22
+ The `tea_convert` command takes protein sequences from a FASTA file and generates new tea-FASTA. It supports confidence-based sequence output where low-confidence positions are displayed in lowercase, and has options for saving logits and entropy. If `--save_avg_entropy` is set, the FASTA identifiers will contain the average entropy of the sequence in the format `<key>|avg_entropy=<avg_entropy>`.
23
+
24
+ ```bash
25
+ usage: tea_convert [-h] -f FASTA_FILE -o OUTPUT_FILE [-s] [-e] [-r] [-l] [-t ENTROPY_THRESHOLD]
26
+
27
+ options:
28
+ -h, --help show this help message and exit
29
+ -f FASTA_FILE, --fasta_file FASTA_FILE
30
+ Input FASTA file containing protein amino acid sequences
31
+ -o OUTPUT_FILE, --output_file OUTPUT_FILE
32
+ Output FASTA file for generated tea sequences
33
+ -s, --save_logits Save per-residue logits to .pt file
34
+ -e, --save_avg_entropy
35
+ Save average entropy values in FASTA identifiers
36
+ -r, --save_residue_entropy
37
+ Save per-residue entropy values to .pt file
38
+ -l, --lowercase_entropy
39
+ Save residues with entropy > threshold in lowercase
40
+ -t ENTROPY_THRESHOLD, --entropy_threshold ENTROPY_THRESHOLD
41
+ Entropy threshold for lowercase conversion
42
+ ```
43
+
44
+ ### Using the huggingface model
45
+
46
+ ```python
47
+ from tea.model import Tea
48
+ from transformers import AutoTokenizer, AutoModel
49
+ from transformers import BitsAndBytesConfig
50
+ import torch
51
+ import re
52
+
53
+ tea = Tea.from_pretrained("PickyBinders/tea")
54
+ device = next(tea.parameters()).device
55
+ tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
56
+ bnb_config = BitsAndBytesConfig(load_in_4bit=True) if torch.cuda.is_available() else None
57
+ esm2 = AutoModel.from_pretrained(
58
+ "facebook/esm2_t33_650M_UR50D",
59
+ torch_dtype="auto",
60
+ quantization_config=bnb_config,
61
+ add_pooling_layer=False,
62
+ ).to(device)
63
+ esm2.eval()
64
+ sequence_examples = ["PRTEINO", "SEQWENCE"]
65
+ sequence_examples = [" ".join(list(re.sub(r"[UZOBJ]", "X", sequence))) for sequence in sequence_examples]
66
+ ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
67
+ input_ids = torch.tensor(ids['input_ids']).to(device)
68
+ attention_mask = torch.tensor(ids['attention_mask']).to(device)
69
+ with torch.no_grad():
70
+ x = esm2(
71
+ input_ids=input_ids, attention_mask=attention_mask
72
+ ).last_hidden_state.to(device)
73
+ results = tea.to_sequences(embeddings=x, input_ids=input_ids, return_avg_entropy=True, return_logits=False, return_residue_entropy=False)
74
+ results
75
+ ```
76
+
77
+ ## Using tea sequences with MMseqs2
78
+
79
+ The `matcha.out` substitution matrix is included with the tea package. You can get its path programmatically:
80
+
81
+ ```python
82
+ from tea import get_matrix_path
83
+ matcha_path = get_matrix_path()
84
+ print(f"Matrix path: {matcha_path}")
85
+ ```
86
+
87
+ Then use it with MMseqs2:
88
+
89
+ ```bash
90
+ mmseqs easy-search tea_query.fasta tea_target.fasta results.m8 tmp/ \
91
+ --comp-bias-corr 0 \
92
+ --mask 0 \
93
+ --gap-open 18 \
94
+ --gap-extend 3 \
95
+ --sub-mat /path/to/matcha.out \
96
+ --seed-sub-mat /path/to/matcha.out \
97
+ --exact-kmer-matching 1
98
+ ```