Merck
/

TEDDY

Model card Files Files and versions

TEDDY / teddy /tokenizer /README.md

soumyatghosh's picture

Upload folder using huggingface_hub

4527b5f verified 7 months ago

|

history blame contribute delete

2.86 kB

	This directory contains code, scripts, and configuration files for the Gene Tokenizer, which is used to tokenize gene identifiers, gene expression data, and related metadata for downstream tasks.

	## Files and Their Roles

	### `vocab.json`
	- An alternative vocabulary file in JSON format, mapping gene identifiers (e.g., `ENSG00000000003`) to their token IDs.
	- This can be loaded by the `GeneTokenizer` if you prefer JSON-based vocab mapping rather than a plain text vocab.

	### `gene_tokenizer.py`
	- The main implementation of the `GeneTokenizer` class, which inherits from Hugging Face’s `PreTrainedTokenizer`.
	- Key responsibilities:
	- Loading vocabularies (`vocab.txt`, `vocab.json`).
	- Tokenizing gene identifiers or text into token IDs (e.g., WordPiece tokenization).
	- Adding special tokens like `<cls>`, `<sep>`, `<pad>`, `<unk>`, and `<mask>`.
	- Saving/loading the tokenizer configuration.
	- Important methods:
	- `load_vocab`: Loads the vocabulary from a file.
	- `_tokenize`: Performs the core tokenization of text/gene identifiers.
	- `build_inputs_with_special_tokens`: Adds special tokens to sequences.
	- `save_vocabulary`: Saves the tokenizer vocabulary files.

	---

	## How These Files Work Together

	1. Vocabulary Loading
	- The `GeneTokenizer` can load its vocabulary from `gene_freq_tokenizer/vocab.txt` or `vocab.json`.
	- During initialization, the tokenizer references the `tokenizer_config.json` and other metadata files in `gene_freq_tokenizer/` (if you’re using that pre-built tokenizer) or directly uses a JSON/text vocab file specified in your code.

	2. Special Tokens
	- Special tokens are defined in the `tokenizer_config.json` and `special_tokens_map.json` files (both inside `gene_freq_tokenizer/`).
	- These tokens (e.g., `<cls>`, `<pad>`, `<mask>`) are injected automatically during the tokenization process (via `build_inputs_with_special_tokens`).

	3. Annotation Tokens
	- `annotation_tokens.txt` lists extra domain-specific tokens (e.g., `<cell>`, `<disease>`), which can be appended to the main vocabulary.
	- If `bio_annotations` is set to `True` in your `TokenizationArgs`, you can incorporate these tokens into your final vocabulary or your input sequences.

	4. Tokenization Process
	- You create an instance of `GeneTokenizer` (from `gene_tokenizer.py`) and pass it the relevant vocab/config.
	- The `GeneTokenizer` then uses methods like `_tokenize` to split input strings (gene IDs or free text) into subtokens.
	- If configured, it adds `[CLS]` at the start or `[SEP]` at the end, and uses `[PAD]` to ensure uniform sequence lengths.

	5. Saving/Loading
	- The tokenizer can be saved via `tokenizer.save_pretrained(<directory>)`, which writes out `vocab.txt`, `tokenizer_config.json`, etc. to a specified directory.