--- tags: ['tokenizer', 'bert', 'wordpiece'] language: en license: mit --- # bert-astronomy-tokenizer ## Description WordPiece tokenizer (30k vocab) shared across all astronomy models ## Tokenizer Details - **Type**: WordPiece (BERT-style) - **Vocabulary Size**: 30,000 tokens - **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]` - **Trained On**: 95,000 Wikipedia documents (full corpus train split) - **Normalization**: Lowercase, NFD, strip accents - **Max Length**: 256 tokens ## Usage ```python from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer") # Tokenize text text = "The Hubble telescope orbits Earth." tokens = tokenizer.tokenize(text) print(tokens) # Output: ['the', 'hub', '##ble', 'telescope', 'orbit', '##s', 'earth', '.'] ``` ## Research Context This tokenizer is part of a research project studying the effect of corpus composition on language model performance. **Project**: Effect of Corpus on Language Model Performance **Institution**: [Your University] **Course**: NLP - Master's Computer Science **Date**: November 2024