| --- |
| language: |
| - en |
| library_name: pytorch |
| tags: |
| - language-model |
| - gpt2 |
| - transformer |
| - wikitext-103 |
|
|
| model-index: |
|
|
| - name: gpt2_wt103-40m_12-layer |
| results: |
| - task: |
| type: language-modeling |
| dataset: |
| type: wikitext |
| name: Wikitext-103 |
| metrics: |
| - type: perplexity |
| value: 40.6 |
| |
| --- |
| |
| # Model description |
|
|
| paper: [Characterizing Verbatim Short-Term Memory in Neural Language Models](https://arxiv.org/abs/2210.13569) |
|
|
| This is a gpt2-small-like decoder-only transformer model trained on a the [wikitext-103 dataset](https://paperswithcode.com/dataset/wikitext-103). |
|
|
| # Usage |
|
|
| You can download and load the model as follows: |
|
|
| ```python |
| from transformers import GPT2LMHeadModel |
| |
| model = GPT2LMHeadModel.from_pretrained("Kristijan/gpt2_wt103_12-layer") |
| |
| ``` |
|
|
| Alternatively, if you've downloaded the checkpoint files in this repository, you could also do: |
|
|
| ```python |
| from transformers import GPT2LMHeadModel |
| |
| model = GPT2LMHeadModel.from_pretrained(path_to_folder_with_checkpoint_files) |
| |
| ``` |
|
|
| ## BPE Tokenizer |
|
|
| You should first pretokenize your text using the [MosesTokenizer](https://pypi.org/project/mosestokenizer/): |
|
|
| ```python |
| from mosestokenizer import MosesTokenizer |
| |
| with MosesTokenizer('en') as pretokenize: |
| pretokenized_text = " ".join(pretokenize(text_string)) |
| ``` |
|
|
| Then, to BPE tokenize your text for this model, you should use the [tokenizer trained on Wikitext-103](https://huggingface.co/Kristijan/wikitext-103_tokenizer_v2): |
|
|
| ```python |
| from transformers import GPT2TokenizerFast |
| |
| tokenizer = GPT2TokenizerFast.from_pretrained("Kristijan/wikitext-103-tokenizer_v2") |
| tokenized_text = tokenizer.tokenize(pretokenized_text) |
| |
| ``` |
|
|
| # Intended uses |
|
|
| This checkpoint is intended for research purposes, for example those interested in studying the behavior of transformer language models trained on smaller datasets. |