razor5050's picture
Add tokenizer, inference code, model card, and 20-query report
ca2f8ca verified
---
language:
- en
license: mit
tags:
- tiny-llm
- causal-lm
- llama-like
- rope
- rmsnorm
- swiglu
- gqa
- openwebtext
- smoltalk
- pytorch
pipeline_tag: text-generation
library_name: pytorch
---
# TinyLLM 75M OpenWebText Chat
This repository contains an experimental **75,074,112 parameter decoder-only tiny language model** trained from scratch/near-scratch and then supervised-finetuned for chat.
> **Important quality note:** This is a successful end-to-end training pipeline artifact and research toy model, not a production assistant. It can load and generate text, but factual accuracy, instruction following, arithmetic, and repetition control are weak.
## Model summary
- **Model name:** `razor5050/tinyllm-75m-openwebtext-chat`
- **Architecture:** LLaMA/SmolLM-style decoder-only causal LM
- **Parameters:** 75,074,112
- **Context length:** 1024 tokens
- **Vocabulary:** 32,000 ByteLevel BPE tokens
- **Tokenizer:** custom ByteLevel BPE trained for this run
- **Checkpoint format:** PyTorch `.pt` checkpoints
- **Primary final checkpoint:** `final.pt`
- **Best checkpoint:** `best.pt`
## Architecture
The model uses modern tiny-LM components:
- decoder-only causal Transformer
- RoPE positional embeddings
- RMSNorm
- SwiGLU MLP
- grouped-query/key-value reduction via fewer KV heads
- tied input/output token embeddings
- no attention/MLP bias
- PyTorch SDPA causal attention
Approximate config:
```yaml
vocab_size: 32000
hidden_size: 576
num_hidden_layers: 16
num_attention_heads: 9
num_key_value_heads: 3
intermediate_size: 1536
max_position_embeddings: 1024
rope_theta: 10000.0
rms_norm_eps: 1e-5
tie_word_embeddings: true
attention_bias: false
mlp_bias: false
dropout: 0.0
```
## Training data
### Base pretraining
- Dataset: [`Skylion007/openwebtext`](https://huggingface.co/datasets/Skylion007/openwebtext)
- Rows used: 1,000,000 selected rows
- Final tokenized train tokens: 1,143,301,833
- Final tokenized validation tokens: 34,486,473
- Epochs: 1
- Optimizer steps: 4,361
### Chat/SFT
- Dataset: [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk)
- Train examples: 100,000
- Validation examples: 3,000
- Epochs: 1
- Optimizer steps: 781
- Loss masking: assistant-response tokens only
## Training results
### Pretraining
- Final/latest train loss near end: about `4.997`
- Latest validation loss: about `5.049` at step 4000
### SFT
- SFT completed at step `781`
- Validation trend:
- step 250: `2.6031`
- step 500: `2.4505`
- step 750: `2.3313`
SFT improved chat formatting and response style, but the model remains very small and undertrained by modern assistant standards.
## Hardware/run
- Cloud GPU: Vast.ai RTX 5070 Ti, 16GB VRAM
- Precision: CUDA/PyTorch mixed precision during training where supported
- Checkpointing: periodic `latest`, `best`, final, and step checkpoints
- Training artifacts were preserved separately outside the instance before teardown.
## Files in this repo
- `final.pt` — final SFT checkpoint
- `best.pt` — best SFT checkpoint
- `latest.pt` — latest SFT checkpoint
- `metrics.jsonl` — SFT metrics
- `step_609.pt` — intermediate SFT checkpoint
- `tokenizer/vocab.json` and `tokenizer/merges.txt` — tokenizer files
- `configs/model_75m.yaml` — architecture config
- `src/tinyllm/` — minimal PyTorch model implementation
- `scripts/infer_tinyllm.py` — simple local inference helper
## Quick inference
Clone/download the repo, install dependencies, then run:
```bash
pip install torch tokenizers pyyaml huggingface_hub
python scripts/infer_tinyllm.py \
--checkpoint final.pt \
--prompt "What is the capital of France?"
```
The chat prompt format used during SFT is:
```text
<|system|>
You are a helpful, concise assistant.
<|end|>
<|user|>
USER_QUESTION
<|end|>
<|assistant|>
```
## Observed sample behavior
In a post-upload local inference test, the model generated text and loaded cleanly, but quality was mixed:
- Correct on: “What is the capital of France?” → answered Paris, with repetition.
- Weak on: simple science/world facts, often rambling or hallucinating.
- Weak on: arithmetic and short-answer discipline.
- Repetition and generic phrasing are common.
This is expected for a 75M-parameter scratch-trained model with about 1.14B pretraining tokens and one SFT pass.
## Limitations
- Not suitable for factual QA or production use.
- Hallucinates frequently.
- Repetition loops occur.
- Arithmetic is unreliable.
- Safety behavior was not evaluated.
- Model is not aligned beyond basic supervised chat finetuning.
- The checkpoint is a custom PyTorch model, not a standard `transformers` model class.
## Intended use
- Educational tiny-LLM experiment
- Pipeline validation
- Small-model architecture experimentation
- Baseline for future 150M+ runs
## Recommended next steps
To improve quality meaningfully:
1. Train a larger ~150M model.
2. Use more unique pretraining tokens, e.g. ~5B+.
3. Improve preprocessing/tokenization throughput with multiprocessing/sharding.
4. Add stronger instruction data and possibly preference tuning.
5. Export to a standard Hugging Face `transformers` compatible format.
## Citation / attribution
Training datasets:
- `Skylion007/openwebtext`
- `HuggingFaceTB/smol-smoltalk`
This repository is an experimental model artifact from a custom tiny-LLM training pipeline.