This directory contains code, scripts, and configuration files for the Gene Tokenizer, which is used to tokenize gene identifiers, gene expression data, and related metadata for downstream tasks.
Files and Their Roles
vocab.json
- An alternative vocabulary file in JSON format, mapping gene identifiers (e.g.,
ENSG00000000003) to their token IDs. - This can be loaded by the
GeneTokenizerif you prefer JSON-based vocab mapping rather than a plain text vocab.
gene_tokenizer.py
- The main implementation of the
GeneTokenizerclass, which inherits from Hugging Face’sPreTrainedTokenizer. - Key responsibilities:
- Loading vocabularies (
vocab.txt,vocab.json). - Tokenizing gene identifiers or text into token IDs (e.g., WordPiece tokenization).
- Adding special tokens like
<cls>,<sep>,<pad>,<unk>, and<mask>. - Saving/loading the tokenizer configuration.
- Loading vocabularies (
- Important methods:
load_vocab: Loads the vocabulary from a file._tokenize: Performs the core tokenization of text/gene identifiers.build_inputs_with_special_tokens: Adds special tokens to sequences.save_vocabulary: Saves the tokenizer vocabulary files.
How These Files Work Together
Vocabulary Loading
- The
GeneTokenizercan load its vocabulary fromgene_freq_tokenizer/vocab.txtorvocab.json. - During initialization, the tokenizer references the
tokenizer_config.jsonand other metadata files ingene_freq_tokenizer/(if you’re using that pre-built tokenizer) or directly uses a JSON/text vocab file specified in your code.
- The
Special Tokens
- Special tokens are defined in the
tokenizer_config.jsonandspecial_tokens_map.jsonfiles (both insidegene_freq_tokenizer/). - These tokens (e.g.,
<cls>,<pad>,<mask>) are injected automatically during the tokenization process (viabuild_inputs_with_special_tokens).
- Special tokens are defined in the
Annotation Tokens
annotation_tokens.txtlists extra domain-specific tokens (e.g.,<cell>,<disease>), which can be appended to the main vocabulary.- If
bio_annotationsis set toTruein yourTokenizationArgs, you can incorporate these tokens into your final vocabulary or your input sequences.
Tokenization Process
- You create an instance of
GeneTokenizer(fromgene_tokenizer.py) and pass it the relevant vocab/config. - The
GeneTokenizerthen uses methods like_tokenizeto split input strings (gene IDs or free text) into subtokens. - If configured, it adds
[CLS]at the start or[SEP]at the end, and uses[PAD]to ensure uniform sequence lengths.
- You create an instance of
Saving/Loading
- The tokenizer can be saved via
tokenizer.save_pretrained(<directory>), which writes outvocab.txt,tokenizer_config.json, etc. to a specified directory.
- The tokenizer can be saved via