Slayer GPT Tokenizer Model

Teaching archive for the Slayer GPT-style Polish language-model experiment.

This repo is meant to help people replicate the workflow, not just store artifacts. It includes a runnable local GPT checkpoint, the custom tokenizer it was trained with, a later tokenizer variant, the Slayer H100 training scripts, run logs, and replication docs.

This is a raw PyTorch/custom-code checkpoint, not a Transformers-native AutoModelForCausalLM repository.

Start Here

Read:

  • docs/REPLICATION_GUIDE.md - full step-by-step replication lesson.
  • docs/TOKENIZER_NOTES.md - tokenizer-specific teaching notes.
  • metadata/artifact_manifest.json - exact artifact provenance and model/tokenizer metadata.

Run the saved model:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python scripts/sample_mac.py "Polska jest" 80

Inference From Hugging Face

This is a custom PyTorch checkpoint, so use the included model code instead of AutoModelForCausalLM.

Option 1: clone the model repo and run the bundled sampler:

git lfs install
git clone https://huggingface.co/SlayerLab/slayer-gpt-tokenizer-model
cd slayer-gpt-tokenizer-model
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python scripts/sample_mac.py "Polska jest" 80

Option 2: download only the needed files via huggingface_hub:

pip install torch tokenizers huggingface-hub
python examples/inference_from_hf.py "Polska jest" 80

Minimal Python pattern:

import importlib.util
import sys
import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer

repo_id = "SlayerLab/slayer-gpt-tokenizer-model"

model_py = hf_hub_download(repo_id, "scripts/model.py")
ckpt_path = hf_hub_download(repo_id, "model/ckpt.pt")
tok_path = hf_hub_download(repo_id, "tokenizers/polish_bpe_32k.json")

spec = importlib.util.spec_from_file_location("slayer_gpt_model", model_py)
module = importlib.util.module_from_spec(spec)
sys.modules[spec.name] = module
spec.loader.exec_module(module)

ckpt = torch.load(ckpt_path, map_location="cpu")
model = module.GPT(module.GPTConfig(**ckpt["model_args"]))
model.load_state_dict(ckpt["model"])
model.eval()
tok = Tokenizer.from_file(tok_path)

What Is Included

  • model/ckpt.pt - runnable nanoGPT-style checkpoint from /Users/kacper/Local/Ventures/Slayer/gpt2-pl-mac/ckpt.pt.
  • tokenizers/polish_bpe_32k.json - custom byte-level BPE tokenizer paired with model/ckpt.pt.
  • tokenizers/rxlm_polish_bpe_65k.json - separate later 65k custom tokenizer from RXLM/Slayer work.
  • scripts/model.py - GPT model definition for the checkpoint.
  • scripts/sample_mac.py - local sampler.
  • scripts/knowbench_mac.py, scripts/syntaxbench_mac.py - simple probes.
  • examples/prepare_corpus.py - reference corpus/tokenizer/shard preparation script.
  • training/ - Slayer remote nanoGPT training code and launch script.
  • logs/ and metadata/ - run evidence.

Key Compatibility Rule

Use this pairing:

model/ckpt.pt -> tokenizers/polish_bpe_32k.json

Do not sample model/ckpt.pt with tokenizers/rxlm_polish_bpe_65k.json. That tokenizer is a separate later artifact.

Checkpoint Summary

model/ckpt.pt:

  • 12 layers
  • 12 attention heads
  • 768 embedding dimension
  • 1024 token context
  • 32768 vocabulary size
  • about 136M state-dict parameters
  • checkpoint step stored as iter=500

Remote Slayer Training States

The full remote training states are not committed because each is about 3.6 GB and includes optimizer/runtime state.

Known local copies:

/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step000500.pt
/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step001000.pt
/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step001500.pt

Known remote copies:

ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step000500.pt
ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001000.pt
ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt

Fetch explicitly when needed:

mkdir -p training-states
rsync -avP slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt training-states/
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support