NVIDIA-Nemotron-3-Bio-tokenizer

Tokenizer based on NVIDIA Nemotron-3 with 5 biological modalities injected: single-cell transcriptomics (scRNA-seq), BEL pathways, protein sequences, DNA methylation, and biomedical text.

Overview

This tokenizer was created by:

  1. Removing tokens for distant writing systems (Arabic, Cyrillic, CJK, Hangul, Devanagari, and 14 other scripts) from the original Nemotron-3
  2. Removing long non-bio Latin tokens (diacritics-heavy and final-merge tokens) that are not needed for biomedical NLP
  3. Injecting 44,943 HGNC gene symbols as single never-split tokens
  4. Injecting ~300 special tokens for multi-modal bio input: modality markers, BEL DSL, protein amino acids, bin tokens, namespace prefixes

The vocab size is preserved at 131,072 (2^17), matching the original Nemotron-3 architecture.

Key Features

Feature Detail
Vocab size 131,072 (unchanged from original)
Merges 199,108
BPE type Byte-level (GPT-2 style)
HGNC genes 44,943 single tokens (100% single-token encoding)
Modalities Text, scRNA-seq, BEL pathways, Proteins (FASTA), DNA methylation
Scripts kept Latin + Greek
Base model nvidia/Nemotron-3-Nano-3B-Instruct

Supported Modalities

1. Biomedical Text

Standard English biomedical text. Bio/medical terms preserved (immunohistochemistry, phosphorylation, chromatography, etc.). Greek letters preserved for formulas.

2. Single-cell Transcriptomics (scRNA-seq)

<SC> CELL 17
HGNC:MS4A1 EXPR_BIN_73
HGNC:CD79A EXPR_BIN_68
HGNC:CD74  EXPR_BIN_61
HGNC:LYZ   EXPR_BIN_5
</SC>

Each gene is a single token. Expression values are single bin tokens: EXPR_BIN_0 through EXPR_BIN_100 (101 bins).

3. BEL Pathways (Biological Expression Language)

<BEL>
p(HGNC:AKT1) directlyIncreases act(p(HGNC:MTOR))
p(HGNC:TP53) decreases bp(GO:"cell cycle")
</BEL>

BEL functions (p(, a(, bp(, etc.) and relations (increases, directlyIncreases, etc.) are single tokens.

4. Protein Sequences (FASTA-style)

<PROT>
<p>M <p>V <p>L <p>S <p>P <p>A <p>D <p>K
</PROT>

Amino acids use <p> prefix to avoid conflict with English letters. 20 standard + 7 extended amino acid tokens.

5. DNA Methylation

<METH>
HGNC:TP53 METH_BIN_82
HGNC:MLH1 METH_BIN_95
</METH>

Gene-centric mode: same HGNC tokens + single bin tokens: METH_BIN_0 through METH_BIN_100 (101 bins, beta-values 0-100).

Token Inventory

Modality Markers (12 tokens, special=True)

<TEXT> </TEXT> <SC> </SC> <BEL> </BEL> <PROT> </PROT> <METH> </METH> <SEP> <MASK>

Structural Tokens (8)

CELL SAMPLE META BATCH DONOR TISSUE MASK_GENE MASK_VALUE

Bin Tokens (202)

Expression bins: EXPR_BIN_0 through EXPR_BIN_100 (101 tokens for scRNA-seq quantile bins) Methylation bins: METH_BIN_0 through METH_BIN_100 (101 tokens for DNA methylation beta-values)

Namespace Prefixes (6)

HGNC: CHEBI: GO: GOBP: MESH: DO:

BEL DSL (43 tokens)

Functions: a( act( bp( sec( surf( complex( composite( deg( frag( fus( g( loc( m( ma( path( pop( p( pmod( rxn( r( tloc( var(

Relations: increases directlyIncreases decreases directlyDecreases association regulates positiveCorrelation negativeCorrelation causesNoChange transcribedTo translatedTo hasComponent hasComponents hasMember hasMembers hasActivity isA subProcessOf rateLimitingStepOf orthologous noCorrelation

Protein Amino Acids (27)

Standard: <p>A through <p>Y (20 tokens) Extended: <p>X <p>B <p>Z <p>U <p>O <p>* <p>-

HGNC Genes (44,943)

Full HGNC "complete set" as HGNC:<SYMBOL> tokens. Examples: HGNC:TP53, HGNC:BRCA1, HGNC:EGFR, HGNC:MS4A1.

Source: https://www.genenames.org/download/

What Was Removed

Category Tokens removed
Non-Latin/non-Greek scripts (Arabic, Cyrillic, CJK, Hangul, Devanagari, Hiragana, Armenian, Hebrew, Telugu, Bengali, Katakana, Thai, Kannada, Tamil, Georgian, Malayalam, Gujarati, Myanmar) 34,952
<SPECIAL_N> placeholder tokens 982
Final Latin tokens with diacritics (German, French, Spanish, Portuguese, Turkish, etc.) 5,705
Long non-bio ASCII Latin final tokens (len >= 11) 3,602
Mixed-script tokens 2
Total removed 45,243

Preserved: All bio/medical English terms (170 tokens like immunohistochemistry, phosphorylation, chromatography, mitochondrial, etc.), all Greek letters, all merge-component tokens (BPE chain integrity).

Usage

Python (tokenizers)

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")

# Single-cell example
text = "<SC> CELL 1 HGNC:TP53 EXPR_BIN_92 HGNC:BRCA1 EXPR_BIN_45 </SC>"
encoded = tokenizer.encode(text)
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
decoded = tokenizer.decode(encoded.ids)

Python (transformers)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("transhumanist-already-exists/NVIDIA-Nemotron-3-Bio-tokenizer")

text = "<SC> CELL 1 HGNC:TP53 EXPR_BIN_92 </SC>"
tokens = tokenizer(text, add_special_tokens=False)
print(tokens["input_ids"])

Validation

Test Result
Vocab size = 131,072 PASS
IDs contiguous 0-131,071 PASS
Merge integrity (0 errors) PASS
HGNC single-token encoding (100/100 sampled) PASS
English text roundtrip PASS
Greek letters roundtrip PASS
Bio terms preserved PASS
Non-Latin/non-Greek leak check (0 leaked) PASS
scRNA format encode/decode PASS
BEL format encode/decode PASS
Protein format encode/decode PASS
Methylation format encode/decode PASS

Files

File Size Description
tokenizer.json 13 MB Main tokenizer (HuggingFace JSON format)
tokenizer_config.json 440 B Tokenizer config for transformers
special_tokens_map.json 251 B Special tokens mapping
merge_info.json 2.6 MB Full list of removed and added tokens

merge_info.json

Contains parallel lists of removed/added token IDs and migration metadata:

{
  "replaced_nemotron_total_tokens_count": 45243,
  "added_hgnc_gene_tokens_count": 44944,
  "added_bio_special_tokens_count": 299,
  "added_bin_tokens_count": 202,
  "replaced_nemotron_tokens_ids": [1692, 1702, ...],
  "bio_tokens_ids_in_bio_vocab": [86009, 86010, ...],
  "bin_token_migration": {
    "old_tokens_removed": ["EXPR", "BIN", "METH"],
    "new_tokens_added_count": 202,
    "expr_bin_range": "EXPR_BIN_0 to EXPR_BIN_100",
    "meth_bin_range": "METH_BIN_0 to METH_BIN_100"
  }
}

Technical Details

Unified Bin Tokens

Each bin value (0-100) is a single dedicated token, enabling 2-token gene-value encoding:

  • scRNA: HGNC:<GENE> EXPR_BIN_<0-100> (e.g., HGNC:TP53 EXPR_BIN_92 = 2 tokens)
  • Methylation: HGNC:<GENE> METH_BIN_<0-100> (e.g., HGNC:MLH1 METH_BIN_95 = 2 tokens)

This maximizes compression: each gene-value pair is exactly 2 tokens (gene + bin), compared to 4 tokens with the old multi-token format.

BPE Merge Chain Integrity

When removing tokens, all merges referencing removed tokens were also cleaned (70,335 orphaned merges removed). Only tokens that are not used as components in any merge (final/leaf tokens) were candidates for removal from the Latin set, preserving BPE chain integrity.

Embedding Layer Update

When using with a Nemotron model, you need to:

  1. Replace the embedding and output projection layers
  2. Initialize new bio token embeddings (45,243 tokens at IDs 86,009+)
  3. Fine-tune / continually pre-train on bio corpora

Recommended initialization strategies:

  • Random init + targeted pre-training
  • NACHOS-style embedding initialization (see Kiulian et al., 2025)
  • Mean of subword embeddings from original tokenizer

References

  • Vocabulary design: Based on token_groups_blueprint.md multi-modal token design
  • HGNC gene nomenclature: genenames.org
  • BEL language specification: language.bel.bio
  • Protein tokenization approach: BioT5+ (Pei et al., 2024), ProtT5 (Elnaggar et al., 2022)
  • Base model: nvidia/Nemotron-3-Nano-3B-Instruct
  • Embedding initialization: Kiulian et al. (2025) "From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages"

License

Follows the base model license:

Created

2026-03-03

Author

Bogdan Didenko

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support