Upload WordPiece tokenizer (30k vocab) shared across all astronomy models

52d283e verified about 2 months ago

1.15 kB

	---
	tags: ['tokenizer', 'bert', 'wordpiece']
	language: en
	license: mit
	---

	# bert-astronomy-tokenizer

	## Description
	WordPiece tokenizer (30k vocab) shared across all astronomy models

	## Tokenizer Details
	- Type: WordPiece (BERT-style)
	- Vocabulary Size: 30,000 tokens
	- Special Tokens: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
	- Trained On: 95,000 Wikipedia documents (full corpus train split)
	- Normalization: Lowercase, NFD, strip accents
	- Max Length: 256 tokens

	## Usage

	```python
	from transformers import PreTrainedTokenizerFast

	tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer")

	# Tokenize text
	text = "The Hubble telescope orbits Earth."
	tokens = tokenizer.tokenize(text)
	print(tokens)
	# Output: ['the', 'hub', '##ble', 'telescope', 'orbit', '##s', 'earth', '.']
	```

	## Research Context
	This tokenizer is part of a research project studying the effect of corpus composition on language model performance.

	Project: Effect of Corpus on Language Model Performance
	Institution: [Your University]
	Course: NLP - Master's Computer Science
	Date: November 2024