Fill-Mask
Transformers
Safetensors
English
mega
16384
16k
pszemraj commited on
Commit
4867f26
·
verified ·
0 Parent(s):

Super-squash branch 'main' using huggingface_hub

Browse files
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: artistic-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - '16384'
7
+ - 16k
8
+ ---
9
+
10
+ # mega-encoder-small-16k-v1
11
+
12
+ This is a "huggingface-native" pretrained encoder-only model with 16384 context length. The model architecture is [MEGA](https://arxiv.org/abs/2209.10655).
13
+
14
+ ## Numbers
15
+
16
+ Despite being a long-context model evaluated on a short-context benchmark, MEGA holds up decently:
17
+
18
+ | Model | Size | CTX | Avg |
19
+ | :------------------------ | :---- | ----: | -----: |
20
+ | mega-encoder-small-16k-v1 | 122M | 16384 | 0.777 |
21
+ | bert-base-uncased | 110M | 512 | 0.7905 |
22
+ | roberta-base | 125M | 514 | 0.86 |
23
+ | [bert-plus-L8-4096-v1.0](https://huggingface.co/BEE-spoke-data/bert-plus-L8-4096-v1.0) | 88.1M | 4096 | 0.8278 |
24
+ | [mega-wikitext103](https://huggingface.co/mnaylor/mega-base-wikitext) | 7.0M | 10000 | 0.48 |
25
+
26
+ <details>
27
+ <summary><strong>GLUE Details</strong></summary>
28
+
29
+ | Model | Size | CTX | Avg | CoLA | SST2 | MRPC | STSB | QQP | MNLI | QNLI | RTE |
30
+ | :------------------------ | :---- | ----: | -----: | -----: | ----: | -----: | -----: | ----: | ----: | ----: | -----: |
31
+ | mega-encoder-small-16k-v1 | 122M | 16384 | 0.777 | 0.454 | 0.914 | 0.8404 | 0.906 | 0.894 | 0.806 | 0.842 | 0.556 |
32
+ | bert-base-uncased | 110M | 512 | 0.7905 | 0.521 | 0.935 | 0.889 | 0.858 | 0.712 | 0.84 | 0.905 | 0.664 |
33
+ | roberta-base | 125M | 514 | 0.86 | 0.64 | 0.95 | 0.9 | 0.91 | 0.92 | 0.88 | 0.93 | 0.79 |
34
+ | bert-plus-L8-4096-v1.0 | 88.1M | 4096 | 0.8278 | 0.6272 | 0.906 | 0.8659 | 0.9207 | 0.906 | 0.832 | 0.9 | 0.6643 |
35
+ | mega-wikitext103 | 7M | 10000| 0.480 | 0.00 | 0.732 | 0.748 | -0.087 | 0.701 | 0.54 | 0.598 | 0.513 |
36
+
37
+ The evals for MEGA/bert-plus can be found in [this open wandb project](https://wandb.ai/pszemraj/glue-benchmarking) and are taken as the max observed values on the validation sets. The values for other models are taken as reported in their papers.
38
+ </details>
39
+
40
+ ## Design
41
+
42
+ ### Architecture
43
+
44
+ This encoder model has 8 layers, hidden size 768, and a feedforward ratio of 3x. The resulting total size is 122M params.
45
+
46
+ <details>
47
+ <summary><strong>Architecture Details</strong></summary>
48
+
49
+ Details:
50
+
51
+ 1. We use a hidden size of 768, and a 3x hidden:feedforward ratio.
52
+ - This contrasts with the 2x ratio used in the paper
53
+ 2. To handle the long context, we use MEGA's chunking mechanism, with a chunk length of 1024. As such, there is a linear increase in VRAM usage for multiples of this context length past 1024.
54
+ 3. EMA dimension: we use an EMA dimension of 32 in the interest of modeling long and (potentially) complex sequences
55
+ 4. We use 8 layers, and a context length of 16384 tokens.
56
+ 5. We use `"simple"` relative positional embeddings instead of the rotary embeddings touted in the paper.
57
+ - This choice came from examining [the detailed logs of models](https://github.com/facebookresearch/mega/blob/aeaa4b44592cd1d60a9a34554e359eda2a62b03b/examples/mega/README.lra.md) trained/evaluated on [the LRA benchmark](https://paperswithcode.com/sota/long-range-modeling-on-lra). Models geared towards encoder-type tasks all use the simple relative positional embeddings
58
+ - We observed poor performance/unexplicable 'walls' in previous experiments using rotary positional embeddings with MEGA as an encoder
59
+ 6. BART tokenizer: we use the tokenizer from `facebook/bart-large`
60
+ - This choice was motivated mostly from the desire to use the MEGA encoder in combination with a decoder model in the [HF EncoderDecoderModel class](https://huggingface.co/docs/transformers/model_doc/encoder-decoder) in a "huggingface-native" way. BART is supported as a decoder for the this class, **and** BART's tokenizer has the necessary preprocessing for encoder training.
61
+ - - Example usage of MEGA+BART to create an encoder-decoder [here](https://colab.research.google.com/gist/pszemraj/4bac8635361543b66207d73e4b25a13a/mega-encoder-small-16k-v1-for-text2text.ipynb)
62
+ - The tokenizer's vocab is **exactly** the same as Roberta's
63
+ </details>
64
+
65
+
66
+ ### Training
67
+
68
+ This model was trained with the transformers package. You can find (mostly unorganized) [training runs on wandb here](https://wandb.ai/pszemraj/mega-tuning-longctx).
69
+
70
+ <details>
71
+ <summary><strong>Training Details</strong></summary>
72
+
73
+ 1. **Multi-task training:** the majority of training is "standard" MLM, with no next-sentence prediction, etc. However, in the interest of pretraining a _useful_ encoder for fine-tuning on various tasks, we mix-in such tasks in between several of the MLM phases, carrying-over the model's backbone to the next training phase.
74
+ - an example would be multiple-choice tuning on the [swag](https://huggingface.co/datasets/swag)dataset
75
+ 2. **MLM Mask Ratio 40% default:** we use 40% for the MLM ratio, following [Wettig et al. 2022](https://arxiv.org/abs/2202.08005). This is decreased slightly for training at longer sequences (8192+) to encourage the model to learn/leverage the available context in predictions.
76
+ 3. AMP with bf16
77
+ 4. **Gradient checkpointing implementation**: training this (or similar) models at ctx 8192 or longer becomes quite vram intensive despite the linear increase in memory usage
78
+ </details>
79
+
80
+ ## Usage
81
+
82
+ This is a pretrained model intended to be [fine-tuned on various encoder-compatible tasks](https://github.com/huggingface/transformers/tree/831bc25d8fdb85768402f772cf65cc3d7872b211/examples/pytorch). However, if you are interested in testing inference with this model or have a deep passion for predicting mask tokens, you can use the following code:
83
+
84
+ ```python
85
+ import json
86
+ from transformers import pipeline
87
+
88
+ pipe = pipeline("fill-mask", model="BEE-spoke-data/mega-encoder-small-16k-v1")
89
+ text = "I love to <mask> memes."
90
+ result = pipe(text)
91
+ print(json.dumps(result, indent=2))
92
+ ```
93
+
94
+ ### Gradient checkpointing implementation
95
+
96
+ If fine-tuning this model on `<task>`, using gradient checkpointing makes training at 16384 context quite feasible. By installing the transformers fork below and passing `gradient_checkpointing=True` in the training args, you should be able to finetune at batch size 1 with VRAM to spare on a single 3090/4090.
97
+
98
+ ```sh
99
+ pip uninstall -y transformers
100
+ pip install -U git+https://github.com/pszemraj/transformers.git@mega-gradient-checkpointing
101
+ pip install -U huggingface-hub
102
+ ```
103
+
104
+ if there is sufficient interest, we can look at making a PR into the official repo.
105
+
106
+ ## Citation
107
+
108
+ if you find this useful, please consider citing this DOI, it would make us happy.
109
+
110
+ ```
111
+ @misc{beespoke_data_2024,
112
+ author = {Peter Szemraj and Vincent Haines and {BEEspoke Data}},
113
+ title = {mega-encoder-small-16k-v1 (Revision 1476bcf)},
114
+ year = 2024,
115
+ url = {https://huggingface.co/BEE-spoke-data/mega-encoder-small-16k-v1},
116
+ doi = {10.57967/hf/1837},
117
+ publisher = {Hugging Face}
118
+ }
119
+ ```
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "<SEP>": 50265
3
+ }
config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "BEE-spoke-data/mega-enc-MKVs-L8-v0.8-dolma-xlong_16384",
3
+ "activation": "silu",
4
+ "add_lm_hidden_dense_layer": false,
5
+ "add_token_type_embeddings": true,
6
+ "architectures": [
7
+ "MegaForMaskedLM"
8
+ ],
9
+ "attention_activation": "softmax",
10
+ "attention_probs_dropout_prob": 0,
11
+ "bidirectional": true,
12
+ "bos_token_id": 0,
13
+ "chunk_size": 1024,
14
+ "classifier_dropout": null,
15
+ "dropout_prob": 0.05,
16
+ "ema_beta_range": 0.02,
17
+ "ema_delta_alpha_range": 0.2,
18
+ "ema_gamma_omega_range": 1.0,
19
+ "ema_projection_size": 32,
20
+ "eos_token_id": 2,
21
+ "hidden_dropout_prob": 0,
22
+ "hidden_size": 768,
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 2304,
25
+ "max_positions": 16384,
26
+ "model_type": "mega",
27
+ "nffn_activation_dropout_prob": 0,
28
+ "nffn_hidden_size": 2304,
29
+ "norm_affine": true,
30
+ "normalization_type": "scalenorm",
31
+ "normalize_before_ffn": false,
32
+ "normalize_before_mega": false,
33
+ "num_attention_heads": 1,
34
+ "num_hidden_layers": 8,
35
+ "pad_token_id": 1,
36
+ "relative_positional_bias": "simple",
37
+ "sep_token_id": 2,
38
+ "shared_representation_size": 192,
39
+ "torch_dtype": "float32",
40
+ "transformers_version": "4.38.2",
41
+ "truncation": null,
42
+ "type_vocab_size": 2,
43
+ "use_cache": true,
44
+ "use_chunking": true,
45
+ "use_feature_dropout": false,
46
+ "use_normalized_ffn": true,
47
+ "vocab_size": 50304
48
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d1acb34fbb0279191844c0fd0ca8c0dfd86f1633760b01a831f2a68a11515501
3
+ size 488057176
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": true,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "</s>",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "50264": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": true,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "50265": {
45
+ "content": "<SEP>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ }
52
+ },
53
+ "bos_token": "<s>",
54
+ "clean_up_tokenization_spaces": true,
55
+ "cls_token": "<s>",
56
+ "eos_token": "</s>",
57
+ "errors": "replace",
58
+ "mask_token": "<mask>",
59
+ "max_length": 16384,
60
+ "model_max_length": 16384,
61
+ "pad_to_multiple_of": 1024,
62
+ "pad_token": "<pad>",
63
+ "pad_token_type_id": 0,
64
+ "padding_side": "right",
65
+ "sep_token": "</s>",
66
+ "stride": 0,
67
+ "tokenizer_class": "BartTokenizer",
68
+ "trim_offsets": true,
69
+ "truncation_side": "right",
70
+ "truncation_strategy": "longest_first",
71
+ "unk_token": "</s>"
72
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff