| --- |
| license: other |
| library_name: pytorch |
| tags: |
| - pytorch |
| - gpt |
| - gpt2-style |
| - polish |
| - tokenizer |
| - byte-level-bpe |
| - causal-lm |
| language: |
| - pl |
| pipeline_tag: text-generation |
| --- |
| |
| # Slayer GPT Tokenizer Model |
|
|
| Teaching archive for the Slayer GPT-style Polish language-model experiment. |
|
|
| This repo is meant to help people replicate the workflow, not just store artifacts. It includes a runnable local GPT checkpoint, the custom tokenizer it was trained with, a later tokenizer variant, the Slayer H100 training scripts, run logs, and replication docs. |
|
|
| This is a raw PyTorch/custom-code checkpoint, not a Transformers-native `AutoModelForCausalLM` repository. |
|
|
| ## Start Here |
|
|
| Read: |
|
|
| - `docs/REPLICATION_GUIDE.md` - full step-by-step replication lesson. |
| - `docs/TOKENIZER_NOTES.md` - tokenizer-specific teaching notes. |
| - `metadata/artifact_manifest.json` - exact artifact provenance and model/tokenizer metadata. |
|
|
| Run the saved model: |
|
|
| ```bash |
| python3 -m venv .venv |
| source .venv/bin/activate |
| pip install -r requirements.txt |
| python scripts/sample_mac.py "Polska jest" 80 |
| ``` |
|
|
| ## Inference From Hugging Face |
|
|
| This is a custom PyTorch checkpoint, so use the included model code instead of `AutoModelForCausalLM`. |
|
|
| Option 1: clone the model repo and run the bundled sampler: |
|
|
| ```bash |
| git lfs install |
| git clone https://huggingface.co/SlayerLab/slayer-gpt-tokenizer-model |
| cd slayer-gpt-tokenizer-model |
| python3 -m venv .venv |
| source .venv/bin/activate |
| pip install -r requirements.txt |
| python scripts/sample_mac.py "Polska jest" 80 |
| ``` |
|
|
| Option 2: download only the needed files via `huggingface_hub`: |
|
|
| ```bash |
| pip install torch tokenizers huggingface-hub |
| python examples/inference_from_hf.py "Polska jest" 80 |
| ``` |
|
|
| Minimal Python pattern: |
|
|
| ```python |
| import importlib.util |
| import sys |
| import torch |
| from huggingface_hub import hf_hub_download |
| from tokenizers import Tokenizer |
| |
| repo_id = "SlayerLab/slayer-gpt-tokenizer-model" |
| |
| model_py = hf_hub_download(repo_id, "scripts/model.py") |
| ckpt_path = hf_hub_download(repo_id, "model/ckpt.pt") |
| tok_path = hf_hub_download(repo_id, "tokenizers/polish_bpe_32k.json") |
| |
| spec = importlib.util.spec_from_file_location("slayer_gpt_model", model_py) |
| module = importlib.util.module_from_spec(spec) |
| sys.modules[spec.name] = module |
| spec.loader.exec_module(module) |
| |
| ckpt = torch.load(ckpt_path, map_location="cpu") |
| model = module.GPT(module.GPTConfig(**ckpt["model_args"])) |
| model.load_state_dict(ckpt["model"]) |
| model.eval() |
| tok = Tokenizer.from_file(tok_path) |
| ``` |
|
|
| ## What Is Included |
|
|
| - `model/ckpt.pt` - runnable nanoGPT-style checkpoint from `/Users/kacper/Local/Ventures/Slayer/gpt2-pl-mac/ckpt.pt`. |
| - `tokenizers/polish_bpe_32k.json` - custom byte-level BPE tokenizer paired with `model/ckpt.pt`. |
| - `tokenizers/rxlm_polish_bpe_65k.json` - separate later 65k custom tokenizer from RXLM/Slayer work. |
| - `scripts/model.py` - GPT model definition for the checkpoint. |
| - `scripts/sample_mac.py` - local sampler. |
| - `scripts/knowbench_mac.py`, `scripts/syntaxbench_mac.py` - simple probes. |
| - `examples/prepare_corpus.py` - reference corpus/tokenizer/shard preparation script. |
| - `training/` - Slayer remote nanoGPT training code and launch script. |
| - `logs/` and `metadata/` - run evidence. |
|
|
| ## Key Compatibility Rule |
|
|
| Use this pairing: |
|
|
| ```text |
| model/ckpt.pt -> tokenizers/polish_bpe_32k.json |
| ``` |
|
|
| Do not sample `model/ckpt.pt` with `tokenizers/rxlm_polish_bpe_65k.json`. That tokenizer is a separate later artifact. |
|
|
| Why this matters: |
|
|
| - `model/ckpt.pt` was trained with `vocab_size=32768`, so its token embedding table and output head have 32768 rows. |
| - `tokenizers/rxlm_polish_bpe_65k.json` has 65536 vocabulary entries and can emit token IDs that the model does not have embeddings for. |
| - Even if a token ID is below 32768, the two tokenizers do not guarantee that the same ID means the same text fragment. |
| - To use the 65k tokenizer correctly, train a separate model with a matching 65536-token vocabulary. |
|
|
| ## Tokenizer Construction |
|
|
|  |
|
|
| `tokenizers/polish_bpe_32k.json` is a pure statistical byte-level BPE tokenizer. It was not built as a morphological tokenizer: |
|
|
| - no Polish inflection rules, |
| - no lemmatizer, |
| - no morpheme dictionary, |
| - no hand-written segmentation grammar. |
|
|
| The tokenizer learns frequent byte/subword merges from the corpus. Polish-looking pieces emerge only because they were statistically useful in the training text. |
|
|
| ## Checkpoint Summary |
|
|
| `model/ckpt.pt`: |
|
|
| - 12 layers |
| - 12 attention heads |
| - 768 embedding dimension |
| - 1024 token context |
| - 32768 vocabulary size |
| - about 136M state-dict parameters |
| - checkpoint step stored as `iter=500` |
|
|
| ## Remote Slayer Training States |
|
|
| The full remote training states are not committed because each is about 3.6 GB and includes optimizer/runtime state. |
|
|
| Known local copies: |
|
|
| ```text |
| /Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step000500.pt |
| /Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step001000.pt |
| /Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step001500.pt |
| ``` |
|
|
| Known remote copies: |
|
|
| ```text |
| ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step000500.pt |
| ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001000.pt |
| ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt |
| ``` |
|
|
| Fetch explicitly when needed: |
|
|
| ```bash |
| mkdir -p training-states |
| rsync -avP slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt training-states/ |
| ``` |
|
|