Safetensors
TEDDY / teddy /tokenizer /README.md
soumyatghosh's picture
Upload folder using huggingface_hub
4527b5f verified

This directory contains code, scripts, and configuration files for the Gene Tokenizer, which is used to tokenize gene identifiers, gene expression data, and related metadata for downstream tasks.

Files and Their Roles

vocab.json

  • An alternative vocabulary file in JSON format, mapping gene identifiers (e.g., ENSG00000000003) to their token IDs.
  • This can be loaded by the GeneTokenizer if you prefer JSON-based vocab mapping rather than a plain text vocab.

gene_tokenizer.py

  • The main implementation of the GeneTokenizer class, which inherits from Hugging Face’s PreTrainedTokenizer.
  • Key responsibilities:
    • Loading vocabularies (vocab.txt, vocab.json).
    • Tokenizing gene identifiers or text into token IDs (e.g., WordPiece tokenization).
    • Adding special tokens like <cls>, <sep>, <pad>, <unk>, and <mask>.
    • Saving/loading the tokenizer configuration.
  • Important methods:
    • load_vocab: Loads the vocabulary from a file.
    • _tokenize: Performs the core tokenization of text/gene identifiers.
    • build_inputs_with_special_tokens: Adds special tokens to sequences.
    • save_vocabulary: Saves the tokenizer vocabulary files.

How These Files Work Together

  1. Vocabulary Loading

    • The GeneTokenizer can load its vocabulary from gene_freq_tokenizer/vocab.txt or vocab.json.
    • During initialization, the tokenizer references the tokenizer_config.json and other metadata files in gene_freq_tokenizer/ (if you’re using that pre-built tokenizer) or directly uses a JSON/text vocab file specified in your code.
  2. Special Tokens

    • Special tokens are defined in the tokenizer_config.json and special_tokens_map.json files (both inside gene_freq_tokenizer/).
    • These tokens (e.g., <cls>, <pad>, <mask>) are injected automatically during the tokenization process (via build_inputs_with_special_tokens).
  3. Annotation Tokens

    • annotation_tokens.txt lists extra domain-specific tokens (e.g., <cell>, <disease>), which can be appended to the main vocabulary.
    • If bio_annotations is set to True in your TokenizationArgs, you can incorporate these tokens into your final vocabulary or your input sequences.
  4. Tokenization Process

    • You create an instance of GeneTokenizer (from gene_tokenizer.py) and pass it the relevant vocab/config.
    • The GeneTokenizer then uses methods like _tokenize to split input strings (gene IDs or free text) into subtokens.
    • If configured, it adds [CLS] at the start or [SEP] at the end, and uses [PAD] to ensure uniform sequence lengths.
  5. Saving/Loading

    • The tokenizer can be saved via tokenizer.save_pretrained(<directory>), which writes out vocab.txt, tokenizer_config.json, etc. to a specified directory.