Protein Amino-Acid Fast Tokenizer

Fast Rust-backed tokenizer for protein sequences.

Features

  • 1 token = 1 amino acid — character-level tokenization
  • Fast Rust backend — efficient processing via HuggingFace Tokenizers
  • Transformer-ready — compatible with AutoTokenizer

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("pszmk/protein-aa-fast-tokenizer")

# Single sequence
tokens = tokenizer("MKTLLILAVAVCSAA")
print(tokens)
# {'input_ids': [2, 16, 14, ...], 'attention_mask': [1, 1, ...]}

# Batch with padding
batch = tokenizer(
    ["MKTLLILAVAVCSAA", "ACDEFGHIK"],
    padding=True,
    return_tensors="pt",
)

Vocabulary

ID Token Description
0 <PAD> Padding
1 <MASK> Masked token
2 <CLS> Classification / Start
3 <SEP> Separator
4 <EOS> End of sequence
5 <UNK> Unknown
6-25 A-Y Standard amino acids
26 X Any amino acid
27 B Asparagine or Aspartic acid
28 Z Glutamine or Glutamic acid

Template Processing

  • Single sequence: <CLS> SEQUENCE <EOS>
  • Pair sequences: <CLS> SEQ_A <SEP> SEQ_B <EOS>

Citation

Part of the LAMP (Latent Anti-Microbial Peptides) project.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support