Safetensors
File size: 2,863 Bytes
4527b5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
This directory contains code, scripts, and configuration files for the **Gene Tokenizer**, which is used to tokenize gene identifiers, gene expression data, and related metadata for downstream tasks.

## Files and Their Roles

### `vocab.json`
- An alternative vocabulary file in JSON format, mapping gene identifiers (e.g., `ENSG00000000003`) to their token IDs.
- This can be loaded by the `GeneTokenizer` if you prefer JSON-based vocab mapping rather than a plain text vocab.

### `gene_tokenizer.py`
- The **main implementation** of the `GeneTokenizer` class, which inherits from Hugging Face’s `PreTrainedTokenizer`.
- Key responsibilities:
  - **Loading vocabularies** (`vocab.txt`, `vocab.json`).
  - **Tokenizing** gene identifiers or text into token IDs (e.g., WordPiece tokenization).
  - **Adding special tokens** like `<cls>`, `<sep>`, `<pad>`, `<unk>`, and `<mask>`.
  - **Saving/loading** the tokenizer configuration.
- Important methods:
  - `load_vocab`: Loads the vocabulary from a file.
  - `_tokenize`: Performs the core tokenization of text/gene identifiers.
  - `build_inputs_with_special_tokens`: Adds special tokens to sequences.
  - `save_vocabulary`: Saves the tokenizer vocabulary files.

---

## How These Files Work Together

1. **Vocabulary Loading**  
   - The `GeneTokenizer` can load its vocabulary from `gene_freq_tokenizer/vocab.txt` or `vocab.json`.  
   - During initialization, the tokenizer references the `tokenizer_config.json` and other metadata files in `gene_freq_tokenizer/` (if you’re using that pre-built tokenizer) or directly uses a JSON/text vocab file specified in your code.

2. **Special Tokens**  
   - Special tokens are defined in the `tokenizer_config.json` and `special_tokens_map.json` files (both inside `gene_freq_tokenizer/`).  
   - These tokens (e.g., `<cls>`, `<pad>`, `<mask>`) are injected automatically during the tokenization process (via `build_inputs_with_special_tokens`).

3. **Annotation Tokens**  
   - `annotation_tokens.txt` lists extra domain-specific tokens (e.g., `<cell>`, `<disease>`), which can be appended to the main vocabulary.  
   - If `bio_annotations` is set to `True` in your `TokenizationArgs`, you can incorporate these tokens into your final vocabulary or your input sequences.

4. **Tokenization Process**  
   - You create an instance of `GeneTokenizer` (from `gene_tokenizer.py`) and pass it the relevant vocab/config.  
   - The `GeneTokenizer` then uses methods like `_tokenize` to split input strings (gene IDs or free text) into subtokens.  
   - If configured, it adds `[CLS]` at the start or `[SEP]` at the end, and uses `[PAD]` to ensure uniform sequence lengths.

5. **Saving/Loading**  
   - The tokenizer can be saved via `tokenizer.save_pretrained(<directory>)`, which writes out `vocab.txt`, `tokenizer_config.json`, etc. to a specified directory.