Protein Amino-Acid Fast Tokenizer

Fast Rust-backed tokenizer for protein sequences.

Features

1 token = 1 amino acid — character-level tokenization
Fast Rust backend — efficient processing via HuggingFace Tokenizers
Transformer-ready — compatible with AutoTokenizer

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("pszmk/protein-aa-fast-tokenizer")

# Single sequence
tokens = tokenizer("MKTLLILAVAVCSAA")
print(tokens)
# {'input_ids': [2, 16, 14, ...], 'attention_mask': [1, 1, ...]}

# Batch with padding
batch = tokenizer(
    ["MKTLLILAVAVCSAA", "ACDEFGHIK"],
    padding=True,
    return_tensors="pt",
)

Vocabulary

ID	Token	Description
0	`<PAD>`	Padding
1	`<MASK>`	Masked token
2	`<CLS>`	Classification / Start
3	`<SEP>`	Separator
4	`<EOS>`	End of sequence
5	`<UNK>`	Unknown
6-25	A-Y	Standard amino acids
26	X	Any amino acid
27	B	Asparagine or Aspartic acid
28	Z	Glutamine or Glutamic acid

Template Processing

Single sequence: <CLS> SEQUENCE <EOS>
Pair sequences: <CLS> SEQ_A <SEP> SEQ_B <EOS>

Citation

Part of the LAMP (Latent Anti-Microbial Peptides) project.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including pszmk/protein-aa-fast-tokenizer

LAMP

Collection

5 items • Updated May 2