vraj1 commited on
Commit
52d283e
·
verified ·
1 Parent(s): 57e75ac

Upload WordPiece tokenizer (30k vocab) shared across all astronomy models

Browse files
README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags: ['tokenizer', 'bert', 'wordpiece']
3
+ language: en
4
+ license: mit
5
+ ---
6
+
7
+ # bert-astronomy-tokenizer
8
+
9
+ ## Description
10
+ WordPiece tokenizer (30k vocab) shared across all astronomy models
11
+
12
+ ## Tokenizer Details
13
+ - **Type**: WordPiece (BERT-style)
14
+ - **Vocabulary Size**: 30,000 tokens
15
+ - **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
16
+ - **Trained On**: 95,000 Wikipedia documents (full corpus train split)
17
+ - **Normalization**: Lowercase, NFD, strip accents
18
+ - **Max Length**: 256 tokens
19
+
20
+ ## Usage
21
+
22
+ ```python
23
+ from transformers import PreTrainedTokenizerFast
24
+
25
+ tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer")
26
+
27
+ # Tokenize text
28
+ text = "The Hubble telescope orbits Earth."
29
+ tokens = tokenizer.tokenize(text)
30
+ print(tokens)
31
+ # Output: ['the', 'hub', '##ble', 'telescope', 'orbit', '##s', 'earth', '.']
32
+ ```
33
+
34
+ ## Research Context
35
+ This tokenizer is part of a research project studying the effect of corpus composition on language model performance.
36
+
37
+ **Project**: Effect of Corpus on Language Model Performance
38
+ **Institution**: [Your University]
39
+ **Course**: NLP - Master's Computer Science
40
+ **Date**: November 2024
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer-wp.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "extra_special_tokens": {},
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 1000000000000000019884624838656,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "tokenizer_class": "PreTrainedTokenizerFast",
52
+ "unk_token": "[UNK]"
53
+ }