File size: 5,610 Bytes
78c54ec 4012ebc 78c54ec 28cf0b2 78c54ec | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | ---
license: other
library_name: pytorch
tags:
- pytorch
- gpt
- gpt2-style
- polish
- tokenizer
- byte-level-bpe
- causal-lm
language:
- pl
pipeline_tag: text-generation
---
# Slayer GPT Tokenizer Model
Teaching archive for the Slayer GPT-style Polish language-model experiment.
This repo is meant to help people replicate the workflow, not just store artifacts. It includes a runnable local GPT checkpoint, the custom tokenizer it was trained with, a later tokenizer variant, the Slayer H100 training scripts, run logs, and replication docs.
This is a raw PyTorch/custom-code checkpoint, not a Transformers-native `AutoModelForCausalLM` repository.
## Start Here
Read:
- `docs/REPLICATION_GUIDE.md` - full step-by-step replication lesson.
- `docs/TOKENIZER_NOTES.md` - tokenizer-specific teaching notes.
- `metadata/artifact_manifest.json` - exact artifact provenance and model/tokenizer metadata.
Run the saved model:
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python scripts/sample_mac.py "Polska jest" 80
```
## Inference From Hugging Face
This is a custom PyTorch checkpoint, so use the included model code instead of `AutoModelForCausalLM`.
Option 1: clone the model repo and run the bundled sampler:
```bash
git lfs install
git clone https://huggingface.co/SlayerLab/slayer-gpt-tokenizer-model
cd slayer-gpt-tokenizer-model
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python scripts/sample_mac.py "Polska jest" 80
```
Option 2: download only the needed files via `huggingface_hub`:
```bash
pip install torch tokenizers huggingface-hub
python examples/inference_from_hf.py "Polska jest" 80
```
Minimal Python pattern:
```python
import importlib.util
import sys
import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer
repo_id = "SlayerLab/slayer-gpt-tokenizer-model"
model_py = hf_hub_download(repo_id, "scripts/model.py")
ckpt_path = hf_hub_download(repo_id, "model/ckpt.pt")
tok_path = hf_hub_download(repo_id, "tokenizers/polish_bpe_32k.json")
spec = importlib.util.spec_from_file_location("slayer_gpt_model", model_py)
module = importlib.util.module_from_spec(spec)
sys.modules[spec.name] = module
spec.loader.exec_module(module)
ckpt = torch.load(ckpt_path, map_location="cpu")
model = module.GPT(module.GPTConfig(**ckpt["model_args"]))
model.load_state_dict(ckpt["model"])
model.eval()
tok = Tokenizer.from_file(tok_path)
```
## What Is Included
- `model/ckpt.pt` - runnable nanoGPT-style checkpoint from `/Users/kacper/Local/Ventures/Slayer/gpt2-pl-mac/ckpt.pt`.
- `tokenizers/polish_bpe_32k.json` - custom byte-level BPE tokenizer paired with `model/ckpt.pt`.
- `tokenizers/rxlm_polish_bpe_65k.json` - separate later 65k custom tokenizer from RXLM/Slayer work.
- `scripts/model.py` - GPT model definition for the checkpoint.
- `scripts/sample_mac.py` - local sampler.
- `scripts/knowbench_mac.py`, `scripts/syntaxbench_mac.py` - simple probes.
- `examples/prepare_corpus.py` - reference corpus/tokenizer/shard preparation script.
- `training/` - Slayer remote nanoGPT training code and launch script.
- `logs/` and `metadata/` - run evidence.
## Key Compatibility Rule
Use this pairing:
```text
model/ckpt.pt -> tokenizers/polish_bpe_32k.json
```
Do not sample `model/ckpt.pt` with `tokenizers/rxlm_polish_bpe_65k.json`. That tokenizer is a separate later artifact.
Why this matters:
- `model/ckpt.pt` was trained with `vocab_size=32768`, so its token embedding table and output head have 32768 rows.
- `tokenizers/rxlm_polish_bpe_65k.json` has 65536 vocabulary entries and can emit token IDs that the model does not have embeddings for.
- Even if a token ID is below 32768, the two tokenizers do not guarantee that the same ID means the same text fragment.
- To use the 65k tokenizer correctly, train a separate model with a matching 65536-token vocabulary.
## Tokenizer Construction

`tokenizers/polish_bpe_32k.json` is a pure statistical byte-level BPE tokenizer. It was not built as a morphological tokenizer:
- no Polish inflection rules,
- no lemmatizer,
- no morpheme dictionary,
- no hand-written segmentation grammar.
The tokenizer learns frequent byte/subword merges from the corpus. Polish-looking pieces emerge only because they were statistically useful in the training text.
## Checkpoint Summary
`model/ckpt.pt`:
- 12 layers
- 12 attention heads
- 768 embedding dimension
- 1024 token context
- 32768 vocabulary size
- about 136M state-dict parameters
- checkpoint step stored as `iter=500`
## Remote Slayer Training States
The full remote training states are not committed because each is about 3.6 GB and includes optimizer/runtime state.
Known local copies:
```text
/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step000500.pt
/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step001000.pt
/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step001500.pt
```
Known remote copies:
```text
ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step000500.pt
ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001000.pt
ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt
```
Fetch explicitly when needed:
```bash
mkdir -p training-states
rsync -avP slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt training-states/
```
|