Upload WordPiece tokenizer (30k vocab) shared across all astronomy models

Files changed (5) hide show

README.md ADDED Viewed

+---
+tags: ['tokenizer', 'bert', 'wordpiece']
+language: en
+license: mit
+---
+# bert-astronomy-tokenizer
+## Description
+WordPiece tokenizer (30k vocab) shared across all astronomy models
+## Tokenizer Details
+- **Type**: WordPiece (BERT-style)
+- **Vocabulary Size**: 30,000 tokens
+- **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
+- **Trained On**: 95,000 Wikipedia documents (full corpus train split)
+- **Normalization**: Lowercase, NFD, strip accents
+- **Max Length**: 256 tokens
+## Usage
+```python
+from transformers import PreTrainedTokenizerFast
+tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer")
+# Tokenize text
+text = "The Hubble telescope orbits Earth."
+tokens = tokenizer.tokenize(text)
+print(tokens)
+# Output: ['the', 'hub', '##ble', 'telescope', 'orbit', '##s', 'earth', '.']
+```
+## Research Context
+This tokenizer is part of a research project studying the effect of corpus composition on language model performance.
+**Project**: Effect of Corpus on Language Model Performance
+**Institution**: [Your University]
+**Course**: NLP - Master's Computer Science
+**Date**: November 2024

special_tokens_map.json ADDED Viewed

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer-wp.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[UNK]"
+}