Safetensors
TEDDY / teddy /tokenizer /README.md
soumyatghosh's picture
Upload folder using huggingface_hub
4527b5f verified
This directory contains code, scripts, and configuration files for the **Gene Tokenizer**, which is used to tokenize gene identifiers, gene expression data, and related metadata for downstream tasks.
## Files and Their Roles
### `vocab.json`
- An alternative vocabulary file in JSON format, mapping gene identifiers (e.g., `ENSG00000000003`) to their token IDs.
- This can be loaded by the `GeneTokenizer` if you prefer JSON-based vocab mapping rather than a plain text vocab.
### `gene_tokenizer.py`
- The **main implementation** of the `GeneTokenizer` class, which inherits from Hugging Face’s `PreTrainedTokenizer`.
- Key responsibilities:
- **Loading vocabularies** (`vocab.txt`, `vocab.json`).
- **Tokenizing** gene identifiers or text into token IDs (e.g., WordPiece tokenization).
- **Adding special tokens** like `<cls>`, `<sep>`, `<pad>`, `<unk>`, and `<mask>`.
- **Saving/loading** the tokenizer configuration.
- Important methods:
- `load_vocab`: Loads the vocabulary from a file.
- `_tokenize`: Performs the core tokenization of text/gene identifiers.
- `build_inputs_with_special_tokens`: Adds special tokens to sequences.
- `save_vocabulary`: Saves the tokenizer vocabulary files.
---
## How These Files Work Together
1. **Vocabulary Loading**
- The `GeneTokenizer` can load its vocabulary from `gene_freq_tokenizer/vocab.txt` or `vocab.json`.
- During initialization, the tokenizer references the `tokenizer_config.json` and other metadata files in `gene_freq_tokenizer/` (if you’re using that pre-built tokenizer) or directly uses a JSON/text vocab file specified in your code.
2. **Special Tokens**
- Special tokens are defined in the `tokenizer_config.json` and `special_tokens_map.json` files (both inside `gene_freq_tokenizer/`).
- These tokens (e.g., `<cls>`, `<pad>`, `<mask>`) are injected automatically during the tokenization process (via `build_inputs_with_special_tokens`).
3. **Annotation Tokens**
- `annotation_tokens.txt` lists extra domain-specific tokens (e.g., `<cell>`, `<disease>`), which can be appended to the main vocabulary.
- If `bio_annotations` is set to `True` in your `TokenizationArgs`, you can incorporate these tokens into your final vocabulary or your input sequences.
4. **Tokenization Process**
- You create an instance of `GeneTokenizer` (from `gene_tokenizer.py`) and pass it the relevant vocab/config.
- The `GeneTokenizer` then uses methods like `_tokenize` to split input strings (gene IDs or free text) into subtokens.
- If configured, it adds `[CLS]` at the start or `[SEP]` at the end, and uses `[PAD]` to ensure uniform sequence lengths.
5. **Saving/Loading**
- The tokenizer can be saved via `tokenizer.save_pretrained(<directory>)`, which writes out `vocab.txt`, `tokenizer_config.json`, etc. to a specified directory.