This directory contains code, scripts, and configuration files for the **Gene Tokenizer**, which is used to tokenize gene identifiers, gene expression data, and related metadata for downstream tasks. ## Files and Their Roles ### `vocab.json` - An alternative vocabulary file in JSON format, mapping gene identifiers (e.g., `ENSG00000000003`) to their token IDs. - This can be loaded by the `GeneTokenizer` if you prefer JSON-based vocab mapping rather than a plain text vocab. ### `gene_tokenizer.py` - The **main implementation** of the `GeneTokenizer` class, which inherits from Hugging Face’s `PreTrainedTokenizer`. - Key responsibilities: - **Loading vocabularies** (`vocab.txt`, `vocab.json`). - **Tokenizing** gene identifiers or text into token IDs (e.g., WordPiece tokenization). - **Adding special tokens** like ``, ``, ``, ``, and ``. - **Saving/loading** the tokenizer configuration. - Important methods: - `load_vocab`: Loads the vocabulary from a file. - `_tokenize`: Performs the core tokenization of text/gene identifiers. - `build_inputs_with_special_tokens`: Adds special tokens to sequences. - `save_vocabulary`: Saves the tokenizer vocabulary files. --- ## How These Files Work Together 1. **Vocabulary Loading** - The `GeneTokenizer` can load its vocabulary from `gene_freq_tokenizer/vocab.txt` or `vocab.json`. - During initialization, the tokenizer references the `tokenizer_config.json` and other metadata files in `gene_freq_tokenizer/` (if you’re using that pre-built tokenizer) or directly uses a JSON/text vocab file specified in your code. 2. **Special Tokens** - Special tokens are defined in the `tokenizer_config.json` and `special_tokens_map.json` files (both inside `gene_freq_tokenizer/`). - These tokens (e.g., ``, ``, ``) are injected automatically during the tokenization process (via `build_inputs_with_special_tokens`). 3. **Annotation Tokens** - `annotation_tokens.txt` lists extra domain-specific tokens (e.g., ``, ``), which can be appended to the main vocabulary. - If `bio_annotations` is set to `True` in your `TokenizationArgs`, you can incorporate these tokens into your final vocabulary or your input sequences. 4. **Tokenization Process** - You create an instance of `GeneTokenizer` (from `gene_tokenizer.py`) and pass it the relevant vocab/config. - The `GeneTokenizer` then uses methods like `_tokenize` to split input strings (gene IDs or free text) into subtokens. - If configured, it adds `[CLS]` at the start or `[SEP]` at the end, and uses `[PAD]` to ensure uniform sequence lengths. 5. **Saving/Loading** - The tokenizer can be saved via `tokenizer.save_pretrained()`, which writes out `vocab.txt`, `tokenizer_config.json`, etc. to a specified directory.