| This directory contains code, scripts, and configuration files for the **Gene Tokenizer**, which is used to tokenize gene identifiers, gene expression data, and related metadata for downstream tasks. | |
| ## Files and Their Roles | |
| ### `vocab.json` | |
| - An alternative vocabulary file in JSON format, mapping gene identifiers (e.g., `ENSG00000000003`) to their token IDs. | |
| - This can be loaded by the `GeneTokenizer` if you prefer JSON-based vocab mapping rather than a plain text vocab. | |
| ### `gene_tokenizer.py` | |
| - The **main implementation** of the `GeneTokenizer` class, which inherits from Hugging Face’s `PreTrainedTokenizer`. | |
| - Key responsibilities: | |
| - **Loading vocabularies** (`vocab.txt`, `vocab.json`). | |
| - **Tokenizing** gene identifiers or text into token IDs (e.g., WordPiece tokenization). | |
| - **Adding special tokens** like `<cls>`, `<sep>`, `<pad>`, `<unk>`, and `<mask>`. | |
| - **Saving/loading** the tokenizer configuration. | |
| - Important methods: | |
| - `load_vocab`: Loads the vocabulary from a file. | |
| - `_tokenize`: Performs the core tokenization of text/gene identifiers. | |
| - `build_inputs_with_special_tokens`: Adds special tokens to sequences. | |
| - `save_vocabulary`: Saves the tokenizer vocabulary files. | |
| --- | |
| ## How These Files Work Together | |
| 1. **Vocabulary Loading** | |
| - The `GeneTokenizer` can load its vocabulary from `gene_freq_tokenizer/vocab.txt` or `vocab.json`. | |
| - During initialization, the tokenizer references the `tokenizer_config.json` and other metadata files in `gene_freq_tokenizer/` (if you’re using that pre-built tokenizer) or directly uses a JSON/text vocab file specified in your code. | |
| 2. **Special Tokens** | |
| - Special tokens are defined in the `tokenizer_config.json` and `special_tokens_map.json` files (both inside `gene_freq_tokenizer/`). | |
| - These tokens (e.g., `<cls>`, `<pad>`, `<mask>`) are injected automatically during the tokenization process (via `build_inputs_with_special_tokens`). | |
| 3. **Annotation Tokens** | |
| - `annotation_tokens.txt` lists extra domain-specific tokens (e.g., `<cell>`, `<disease>`), which can be appended to the main vocabulary. | |
| - If `bio_annotations` is set to `True` in your `TokenizationArgs`, you can incorporate these tokens into your final vocabulary or your input sequences. | |
| 4. **Tokenization Process** | |
| - You create an instance of `GeneTokenizer` (from `gene_tokenizer.py`) and pass it the relevant vocab/config. | |
| - The `GeneTokenizer` then uses methods like `_tokenize` to split input strings (gene IDs or free text) into subtokens. | |
| - If configured, it adds `[CLS]` at the start or `[SEP]` at the end, and uses `[PAD]` to ensure uniform sequence lengths. | |
| 5. **Saving/Loading** | |
| - The tokenizer can be saved via `tokenizer.save_pretrained(<directory>)`, which writes out `vocab.txt`, `tokenizer_config.json`, etc. to a specified directory. | |