File size: 2,245 Bytes
7e9c2e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2280fbd
 
 
 
1660eaf
7e9c2e1
 
 
 
 
 
2280fbd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
library_name: transformers
tags: []
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->



## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is a chess-playing GPT2. 
It was finetuned from the [austindavis/chessGPT_d12](https://huggingface.co/austindavis/chessGPT2) model, but uses a 3-tokens-per-ply tokenization scheme rather than the variable-length tokenization from chessGPT_d12 (where promotion tokens interrupt the otherwise consistent 2-token-per-ply structure).
The model was finetuned using the [Feb 2023 Lichess UCI](https://huggingface.co/datasets/austindavis/lichess-uci/viewer/202302) dataset. 
Training progress and configurations are saved to the Weights & Biases run at: [https://wandb.ai/austinleedavis/chess_public/runs/itgnfae4](https://wandb.ai/austinleedavis/chess_public/runs/itgnfae4).
Although 27 epochs were completed, the version here is captured from epoch 20 (step 399,825) because validation loss skyrocketed during epoch 25.


## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

This model requires a custom version of the Tokenizers library. The customization adds a normalizer `Append` which adds a space at the end of every input sequence.
To install the custom tokenizer, run:
```sh
pip install git+https://github.com/austindavis/tokenizers.git#subdirectory=bindings/python
```
Without this customization, you can still run the model, but you must remove the following lines from tokenizer.json (lines 43 through 46), and you must manually add a space to the end of every input sequence:
```json
"normalizer": {
    "type": "Append",
    "append": " "
  },
```

Here's a nice lambda function which facilitates decoding into valid UCI by removing the extra spaces that are added by the tokenizer:
```python
tokenizer = PreTrainedTokenizerFast.from_pretrained("austindavis/chessGPT2")
decode = lambda ids: tokenizer.decode(ids).replace("  ", "_").replace(" ", "").replace("_", " ")
```
Then, you can encode/decode as follows:
```python
>>> decode(tokenizer("e2e4")['input_ids'])
'<|startoftext|>e2e4 '
```