| datasets: | |
| - adrianhenkel/tokenized-total-512-reduced | |
| This is the tokenizer used in the lucid prots project. The lower case letters represent the 3Di state of a residue introduced in the [Foldseek](https://www.nature.com/articles/s41587-023-01773-0) paper. | |
| | Token | Word | | |
| |-------|-------| | |
| | 0 | [PAD] | | |
| | 1 | [UNK] | | |
| | 2 | [CLS] | | |
| | 3 | [SEP] | | |
| | 4 | [MASK] | | |
| | 5 | L | | |
| | 6 | A | | |
| | 7 | G | | |
| | 8 | V | | |
| | 9 | E | | |
| | 10 | S | | |
| | 11 | I | | |
| | 12 | K | | |
| | 13 | R | | |
| | 14 | D | | |
| | 15 | T | | |
| | 16 | P | | |
| | 17 | N | | |
| | 18 | Q | | |
| | 19 | F | | |
| | 20 | Y | | |
| | 21 | M | | |
| | 22 | H | | |
| | 23 | C | | |
| | 24 | W | | |
| | 25 | X | | |
| | 26 | U | | |
| | 27 | B | | |
| | 28 | Z | | |
| | 29 | O | | |
| | 30 | a | | |
| | 31 | c | | |
| | 32 | d | | |
| | 33 | e | | |
| | 34 | f | | |
| | 35 | g | | |
| | 36 | h | | |
| | 37 | i | | |
| | 38 | k | | |
| | 39 | l | | |
| | 40 | m | | |
| | 41 | n | | |
| | 42 | p | | |
| | 43 | q | | |
| | 44 | r | | |
| | 45 | s | | |
| | 46 | t | | |
| | 47 | v | | |
| | 48 | w | | |
| | 49 | y | | |