File size: 5,610 Bytes
78c54ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4012ebc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78c54ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28cf0b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78c54ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
license: other
library_name: pytorch
tags:
  - pytorch
  - gpt
  - gpt2-style
  - polish
  - tokenizer
  - byte-level-bpe
  - causal-lm
language:
  - pl
pipeline_tag: text-generation
---

# Slayer GPT Tokenizer Model

Teaching archive for the Slayer GPT-style Polish language-model experiment.

This repo is meant to help people replicate the workflow, not just store artifacts. It includes a runnable local GPT checkpoint, the custom tokenizer it was trained with, a later tokenizer variant, the Slayer H100 training scripts, run logs, and replication docs.

This is a raw PyTorch/custom-code checkpoint, not a Transformers-native `AutoModelForCausalLM` repository.

## Start Here

Read:

- `docs/REPLICATION_GUIDE.md` - full step-by-step replication lesson.
- `docs/TOKENIZER_NOTES.md` - tokenizer-specific teaching notes.
- `metadata/artifact_manifest.json` - exact artifact provenance and model/tokenizer metadata.

Run the saved model:

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python scripts/sample_mac.py "Polska jest" 80
```

## Inference From Hugging Face

This is a custom PyTorch checkpoint, so use the included model code instead of `AutoModelForCausalLM`.

Option 1: clone the model repo and run the bundled sampler:

```bash
git lfs install
git clone https://huggingface.co/SlayerLab/slayer-gpt-tokenizer-model
cd slayer-gpt-tokenizer-model
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python scripts/sample_mac.py "Polska jest" 80
```

Option 2: download only the needed files via `huggingface_hub`:

```bash
pip install torch tokenizers huggingface-hub
python examples/inference_from_hf.py "Polska jest" 80
```

Minimal Python pattern:

```python
import importlib.util
import sys
import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer

repo_id = "SlayerLab/slayer-gpt-tokenizer-model"

model_py = hf_hub_download(repo_id, "scripts/model.py")
ckpt_path = hf_hub_download(repo_id, "model/ckpt.pt")
tok_path = hf_hub_download(repo_id, "tokenizers/polish_bpe_32k.json")

spec = importlib.util.spec_from_file_location("slayer_gpt_model", model_py)
module = importlib.util.module_from_spec(spec)
sys.modules[spec.name] = module
spec.loader.exec_module(module)

ckpt = torch.load(ckpt_path, map_location="cpu")
model = module.GPT(module.GPTConfig(**ckpt["model_args"]))
model.load_state_dict(ckpt["model"])
model.eval()
tok = Tokenizer.from_file(tok_path)
```

## What Is Included

- `model/ckpt.pt` - runnable nanoGPT-style checkpoint from `/Users/kacper/Local/Ventures/Slayer/gpt2-pl-mac/ckpt.pt`.
- `tokenizers/polish_bpe_32k.json` - custom byte-level BPE tokenizer paired with `model/ckpt.pt`.
- `tokenizers/rxlm_polish_bpe_65k.json` - separate later 65k custom tokenizer from RXLM/Slayer work.
- `scripts/model.py` - GPT model definition for the checkpoint.
- `scripts/sample_mac.py` - local sampler.
- `scripts/knowbench_mac.py`, `scripts/syntaxbench_mac.py` - simple probes.
- `examples/prepare_corpus.py` - reference corpus/tokenizer/shard preparation script.
- `training/` - Slayer remote nanoGPT training code and launch script.
- `logs/` and `metadata/` - run evidence.

## Key Compatibility Rule

Use this pairing:

```text
model/ckpt.pt -> tokenizers/polish_bpe_32k.json
```

Do not sample `model/ckpt.pt` with `tokenizers/rxlm_polish_bpe_65k.json`. That tokenizer is a separate later artifact.

Why this matters:

- `model/ckpt.pt` was trained with `vocab_size=32768`, so its token embedding table and output head have 32768 rows.
- `tokenizers/rxlm_polish_bpe_65k.json` has 65536 vocabulary entries and can emit token IDs that the model does not have embeddings for.
- Even if a token ID is below 32768, the two tokenizers do not guarantee that the same ID means the same text fragment.
- To use the 65k tokenizer correctly, train a separate model with a matching 65536-token vocabulary.

## Tokenizer Construction

![Byte-level BPE tokenizer pipeline](docs/assets/tokenizer_bpe_pipeline.png)

`tokenizers/polish_bpe_32k.json` is a pure statistical byte-level BPE tokenizer. It was not built as a morphological tokenizer:

- no Polish inflection rules,
- no lemmatizer,
- no morpheme dictionary,
- no hand-written segmentation grammar.

The tokenizer learns frequent byte/subword merges from the corpus. Polish-looking pieces emerge only because they were statistically useful in the training text.

## Checkpoint Summary

`model/ckpt.pt`:

- 12 layers
- 12 attention heads
- 768 embedding dimension
- 1024 token context
- 32768 vocabulary size
- about 136M state-dict parameters
- checkpoint step stored as `iter=500`

## Remote Slayer Training States

The full remote training states are not committed because each is about 3.6 GB and includes optimizer/runtime state.

Known local copies:

```text
/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step000500.pt
/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step001000.pt
/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step001500.pt
```

Known remote copies:

```text
ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step000500.pt
ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001000.pt
ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt
```

Fetch explicitly when needed:

```bash
mkdir -p training-states
rsync -avP slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt training-states/
```