File size: 1,154 Bytes
52d283e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
tags: ['tokenizer', 'bert', 'wordpiece']
language: en
license: mit
---

# bert-astronomy-tokenizer

## Description
WordPiece tokenizer (30k vocab) shared across all astronomy models

## Tokenizer Details
- **Type**: WordPiece (BERT-style)
- **Vocabulary Size**: 30,000 tokens
- **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
- **Trained On**: 95,000 Wikipedia documents (full corpus train split)
- **Normalization**: Lowercase, NFD, strip accents
- **Max Length**: 256 tokens

## Usage

```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer")

# Tokenize text
text = "The Hubble telescope orbits Earth."
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['the', 'hub', '##ble', 'telescope', 'orbit', '##s', 'earth', '.']
```

## Research Context
This tokenizer is part of a research project studying the effect of corpus composition on language model performance.

**Project**: Effect of Corpus on Language Model Performance  
**Institution**: [Your University]  
**Course**: NLP - Master's Computer Science  
**Date**: November 2024