bert-astronomy-tokenizer
Description
WordPiece tokenizer (30k vocab) shared across all astronomy models
Tokenizer Details
- Type: WordPiece (BERT-style)
- Vocabulary Size: 30,000 tokens
- Special Tokens:
[PAD],[UNK],[CLS],[SEP],[MASK] - Trained On: 95,000 Wikipedia documents (full corpus train split)
- Normalization: Lowercase, NFD, strip accents
- Max Length: 256 tokens
Usage
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer")
# Tokenize text
text = "The Hubble telescope orbits Earth."
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['the', 'hub', '##ble', 'telescope', 'orbit', '##s', 'earth', '.']
Research Context
This tokenizer is part of a research project studying the effect of corpus composition on language model performance.
Project: Effect of Corpus on Language Model Performance
Institution: [Your University]
Course: NLP - Master's Computer Science
Date: November 2024
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support