Add tokenizer, inference code, model card, and 20-query report

ca2f8ca verified 26 days ago

5.38 kB

	---
	language:
	- en
	license: mit
	tags:
	- tiny-llm
	- causal-lm
	- llama-like
	- rope
	- rmsnorm
	- swiglu
	- gqa
	- openwebtext
	- smoltalk
	- pytorch
	pipeline_tag: text-generation
	library_name: pytorch
	---

	# TinyLLM 75M OpenWebText Chat

	This repository contains an experimental 75,074,112 parameter decoder-only tiny language model trained from scratch/near-scratch and then supervised-finetuned for chat.

	> Important quality note: This is a successful end-to-end training pipeline artifact and research toy model, not a production assistant. It can load and generate text, but factual accuracy, instruction following, arithmetic, and repetition control are weak.

	## Model summary

	- Model name: `razor5050/tinyllm-75m-openwebtext-chat`
	- Architecture: LLaMA/SmolLM-style decoder-only causal LM
	- Parameters: 75,074,112
	- Context length: 1024 tokens
	- Vocabulary: 32,000 ByteLevel BPE tokens
	- Tokenizer: custom ByteLevel BPE trained for this run
	- Checkpoint format: PyTorch `.pt` checkpoints
	- Primary final checkpoint: `final.pt`
	- Best checkpoint: `best.pt`

	## Architecture

	The model uses modern tiny-LM components:

	- decoder-only causal Transformer
	- RoPE positional embeddings
	- RMSNorm
	- SwiGLU MLP
	- grouped-query/key-value reduction via fewer KV heads
	- tied input/output token embeddings
	- no attention/MLP bias
	- PyTorch SDPA causal attention

	Approximate config:

	```yaml
	vocab_size: 32000
	hidden_size: 576
	num_hidden_layers: 16
	num_attention_heads: 9
	num_key_value_heads: 3
	intermediate_size: 1536
	max_position_embeddings: 1024
	rope_theta: 10000.0
	rms_norm_eps: 1e-5
	tie_word_embeddings: true
	attention_bias: false
	mlp_bias: false
	dropout: 0.0
	```

	## Training data

	### Base pretraining

	- Dataset: [`Skylion007/openwebtext`](https://huggingface.co/datasets/Skylion007/openwebtext)
	- Rows used: 1,000,000 selected rows
	- Final tokenized train tokens: 1,143,301,833
	- Final tokenized validation tokens: 34,486,473
	- Epochs: 1
	- Optimizer steps: 4,361

	### Chat/SFT

	- Dataset: [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk)
	- Train examples: 100,000
	- Validation examples: 3,000
	- Epochs: 1
	- Optimizer steps: 781
	- Loss masking: assistant-response tokens only

	## Training results

	### Pretraining

	- Final/latest train loss near end: about `4.997`
	- Latest validation loss: about `5.049` at step 4000

	### SFT

	- SFT completed at step `781`
	- Validation trend:
	- step 250: `2.6031`
	- step 500: `2.4505`
	- step 750: `2.3313`

	SFT improved chat formatting and response style, but the model remains very small and undertrained by modern assistant standards.

	## Hardware/run

	- Cloud GPU: Vast.ai RTX 5070 Ti, 16GB VRAM
	- Precision: CUDA/PyTorch mixed precision during training where supported
	- Checkpointing: periodic `latest`, `best`, final, and step checkpoints
	- Training artifacts were preserved separately outside the instance before teardown.

	## Files in this repo

	- `final.pt` — final SFT checkpoint
	- `best.pt` — best SFT checkpoint
	- `latest.pt` — latest SFT checkpoint
	- `metrics.jsonl` — SFT metrics
	- `step_609.pt` — intermediate SFT checkpoint
	- `tokenizer/vocab.json` and `tokenizer/merges.txt` — tokenizer files
	- `configs/model_75m.yaml` — architecture config
	- `src/tinyllm/` — minimal PyTorch model implementation
	- `scripts/infer_tinyllm.py` — simple local inference helper

	## Quick inference

	Clone/download the repo, install dependencies, then run:

	```bash
	pip install torch tokenizers pyyaml huggingface_hub
	python scripts/infer_tinyllm.py \
	--checkpoint final.pt \
	--prompt "What is the capital of France?"
	```

	The chat prompt format used during SFT is:

	```text
	<\|system\|>
	You are a helpful, concise assistant.
	<\|end\|>
	<\|user\|>
	USER_QUESTION
	<\|end\|>
	<\|assistant\|>
	```

	## Observed sample behavior

	In a post-upload local inference test, the model generated text and loaded cleanly, but quality was mixed:

	- Correct on: “What is the capital of France?” → answered Paris, with repetition.
	- Weak on: simple science/world facts, often rambling or hallucinating.
	- Weak on: arithmetic and short-answer discipline.
	- Repetition and generic phrasing are common.

	This is expected for a 75M-parameter scratch-trained model with about 1.14B pretraining tokens and one SFT pass.

	## Limitations

	- Not suitable for factual QA or production use.
	- Hallucinates frequently.
	- Repetition loops occur.
	- Arithmetic is unreliable.
	- Safety behavior was not evaluated.
	- Model is not aligned beyond basic supervised chat finetuning.
	- The checkpoint is a custom PyTorch model, not a standard `transformers` model class.

	## Intended use

	- Educational tiny-LLM experiment
	- Pipeline validation
	- Small-model architecture experimentation
	- Baseline for future 150M+ runs

	## Recommended next steps

	To improve quality meaningfully:

	1. Train a larger ~150M model.
	2. Use more unique pretraining tokens, e.g. ~5B+.
	3. Improve preprocessing/tokenization throughput with multiprocessing/sharding.
	4. Add stronger instruction data and possibly preference tuning.
	5. Export to a standard Hugging Face `transformers` compatible format.

	## Citation / attribution

	Training datasets:

	- `Skylion007/openwebtext`
	- `HuggingFaceTB/smol-smoltalk`

	This repository is an experimental model artifact from a custom tiny-LLM training pipeline.