Upload Slayer GPT tokenizer model archive

28cf0b2 verified 3 days ago

5.61 kB

	---
	license: other
	library_name: pytorch
	tags:
	- pytorch
	- gpt
	- gpt2-style
	- polish
	- tokenizer
	- byte-level-bpe
	- causal-lm
	language:
	- pl
	pipeline_tag: text-generation
	---

	# Slayer GPT Tokenizer Model

	Teaching archive for the Slayer GPT-style Polish language-model experiment.

	This repo is meant to help people replicate the workflow, not just store artifacts. It includes a runnable local GPT checkpoint, the custom tokenizer it was trained with, a later tokenizer variant, the Slayer H100 training scripts, run logs, and replication docs.

	This is a raw PyTorch/custom-code checkpoint, not a Transformers-native `AutoModelForCausalLM` repository.

	## Start Here

	Read:

	- `docs/REPLICATION_GUIDE.md` - full step-by-step replication lesson.
	- `docs/TOKENIZER_NOTES.md` - tokenizer-specific teaching notes.
	- `metadata/artifact_manifest.json` - exact artifact provenance and model/tokenizer metadata.

	Run the saved model:

	```bash
	python3 -m venv .venv
	source .venv/bin/activate
	pip install -r requirements.txt
	python scripts/sample_mac.py "Polska jest" 80
	```

	## Inference From Hugging Face

	This is a custom PyTorch checkpoint, so use the included model code instead of `AutoModelForCausalLM`.

	Option 1: clone the model repo and run the bundled sampler:

	```bash
	git lfs install
	git clone https://huggingface.co/SlayerLab/slayer-gpt-tokenizer-model
	cd slayer-gpt-tokenizer-model
	python3 -m venv .venv
	source .venv/bin/activate
	pip install -r requirements.txt
	python scripts/sample_mac.py "Polska jest" 80
	```

	Option 2: download only the needed files via `huggingface_hub`:

	```bash
	pip install torch tokenizers huggingface-hub
	python examples/inference_from_hf.py "Polska jest" 80
	```

	Minimal Python pattern:

	```python
	import importlib.util
	import sys
	import torch
	from huggingface_hub import hf_hub_download
	from tokenizers import Tokenizer

	repo_id = "SlayerLab/slayer-gpt-tokenizer-model"

	model_py = hf_hub_download(repo_id, "scripts/model.py")
	ckpt_path = hf_hub_download(repo_id, "model/ckpt.pt")
	tok_path = hf_hub_download(repo_id, "tokenizers/polish_bpe_32k.json")

	spec = importlib.util.spec_from_file_location("slayer_gpt_model", model_py)
	module = importlib.util.module_from_spec(spec)
	sys.modules[spec.name] = module
	spec.loader.exec_module(module)

	ckpt = torch.load(ckpt_path, map_location="cpu")
	model = module.GPT(module.GPTConfig(**ckpt["model_args"]))
	model.load_state_dict(ckpt["model"])
	model.eval()
	tok = Tokenizer.from_file(tok_path)
	```

	## What Is Included

	- `model/ckpt.pt` - runnable nanoGPT-style checkpoint from `/Users/kacper/Local/Ventures/Slayer/gpt2-pl-mac/ckpt.pt`.
	- `tokenizers/polish_bpe_32k.json` - custom byte-level BPE tokenizer paired with `model/ckpt.pt`.
	- `tokenizers/rxlm_polish_bpe_65k.json` - separate later 65k custom tokenizer from RXLM/Slayer work.
	- `scripts/model.py` - GPT model definition for the checkpoint.
	- `scripts/sample_mac.py` - local sampler.
	- `scripts/knowbench_mac.py`, `scripts/syntaxbench_mac.py` - simple probes.
	- `examples/prepare_corpus.py` - reference corpus/tokenizer/shard preparation script.
	- `training/` - Slayer remote nanoGPT training code and launch script.
	- `logs/` and `metadata/` - run evidence.

	## Key Compatibility Rule

	Use this pairing:

	```text
	model/ckpt.pt -> tokenizers/polish_bpe_32k.json
	```

	Do not sample `model/ckpt.pt` with `tokenizers/rxlm_polish_bpe_65k.json`. That tokenizer is a separate later artifact.

	Why this matters:

	- `model/ckpt.pt` was trained with `vocab_size=32768`, so its token embedding table and output head have 32768 rows.
	- `tokenizers/rxlm_polish_bpe_65k.json` has 65536 vocabulary entries and can emit token IDs that the model does not have embeddings for.
	- Even if a token ID is below 32768, the two tokenizers do not guarantee that the same ID means the same text fragment.
	- To use the 65k tokenizer correctly, train a separate model with a matching 65536-token vocabulary.

	## Tokenizer Construction

	![Byte-level BPE tokenizer pipeline](docs/assets/tokenizer_bpe_pipeline.png)

	`tokenizers/polish_bpe_32k.json` is a pure statistical byte-level BPE tokenizer. It was not built as a morphological tokenizer:

	- no Polish inflection rules,
	- no lemmatizer,
	- no morpheme dictionary,
	- no hand-written segmentation grammar.

	The tokenizer learns frequent byte/subword merges from the corpus. Polish-looking pieces emerge only because they were statistically useful in the training text.

	## Checkpoint Summary

	`model/ckpt.pt`:

	- 12 layers
	- 12 attention heads
	- 768 embedding dimension
	- 1024 token context
	- 32768 vocabulary size
	- about 136M state-dict parameters
	- checkpoint step stored as `iter=500`

	## Remote Slayer Training States

	The full remote training states are not committed because each is about 3.6 GB and includes optimizer/runtime state.

	Known local copies:

	```text
	/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step000500.pt
	/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step001000.pt
	/Users/kacper/Local/Ventures/Slayer/slayer-nanogpt/ckpt/state_step001500.pt
	```

	Known remote copies:

	```text
	ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step000500.pt
	ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001000.pt
	ssh slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt
	```

	Fetch explicitly when needed:

	```bash
	mkdir -p training-states
	rsync -avP slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt training-states/
	```