|
|
--- |
|
|
tags: ['tokenizer', 'bert', 'wordpiece'] |
|
|
language: en |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# bert-astronomy-tokenizer |
|
|
|
|
|
## Description |
|
|
WordPiece tokenizer (30k vocab) shared across all astronomy models |
|
|
|
|
|
## Tokenizer Details |
|
|
- **Type**: WordPiece (BERT-style) |
|
|
- **Vocabulary Size**: 30,000 tokens |
|
|
- **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]` |
|
|
- **Trained On**: 95,000 Wikipedia documents (full corpus train split) |
|
|
- **Normalization**: Lowercase, NFD, strip accents |
|
|
- **Max Length**: 256 tokens |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import PreTrainedTokenizerFast |
|
|
|
|
|
tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer") |
|
|
|
|
|
# Tokenize text |
|
|
text = "The Hubble telescope orbits Earth." |
|
|
tokens = tokenizer.tokenize(text) |
|
|
print(tokens) |
|
|
# Output: ['the', 'hub', '##ble', 'telescope', 'orbit', '##s', 'earth', '.'] |
|
|
``` |
|
|
|
|
|
## Research Context |
|
|
This tokenizer is part of a research project studying the effect of corpus composition on language model performance. |
|
|
|
|
|
**Project**: Effect of Corpus on Language Model Performance |
|
|
**Institution**: [Your University] |
|
|
**Course**: NLP - Master's Computer Science |
|
|
**Date**: November 2024 |
|
|
|