TinyLLM 75M OpenWebText Chat

This repository contains an experimental 75,074,112 parameter decoder-only tiny language model trained from scratch/near-scratch and then supervised-finetuned for chat.

Important quality note: This is a successful end-to-end training pipeline artifact and research toy model, not a production assistant. It can load and generate text, but factual accuracy, instruction following, arithmetic, and repetition control are weak.

Model summary

Model name: razor5050/tinyllm-75m-openwebtext-chat
Architecture: LLaMA/SmolLM-style decoder-only causal LM
Parameters: 75,074,112
Context length: 1024 tokens
Vocabulary: 32,000 ByteLevel BPE tokens
Tokenizer: custom ByteLevel BPE trained for this run
Checkpoint format: PyTorch .pt checkpoints
Primary final checkpoint: final.pt
Best checkpoint: best.pt

Architecture

The model uses modern tiny-LM components:

decoder-only causal Transformer
RoPE positional embeddings
RMSNorm
SwiGLU MLP
grouped-query/key-value reduction via fewer KV heads
tied input/output token embeddings
no attention/MLP bias
PyTorch SDPA causal attention

Approximate config:

vocab_size: 32000
hidden_size: 576
num_hidden_layers: 16
num_attention_heads: 9
num_key_value_heads: 3
intermediate_size: 1536
max_position_embeddings: 1024
rope_theta: 10000.0
rms_norm_eps: 1e-5
tie_word_embeddings: true
attention_bias: false
mlp_bias: false
dropout: 0.0

Training data

Base pretraining

Dataset: Skylion007/openwebtext
Rows used: 1,000,000 selected rows
Final tokenized train tokens: 1,143,301,833
Final tokenized validation tokens: 34,486,473
Epochs: 1
Optimizer steps: 4,361

Chat/SFT

Dataset: HuggingFaceTB/smol-smoltalk
Train examples: 100,000
Validation examples: 3,000
Epochs: 1
Optimizer steps: 781
Loss masking: assistant-response tokens only

Training results

Pretraining

Final/latest train loss near end: about 4.997
Latest validation loss: about 5.049 at step 4000

SFT

SFT completed at step 781
Validation trend:
- step 250: 2.6031
- step 500: 2.4505
- step 750: 2.3313

SFT improved chat formatting and response style, but the model remains very small and undertrained by modern assistant standards.

Hardware/run

Cloud GPU: Vast.ai RTX 5070 Ti, 16GB VRAM
Precision: CUDA/PyTorch mixed precision during training where supported
Checkpointing: periodic latest, best, final, and step checkpoints
Training artifacts were preserved separately outside the instance before teardown.

Files in this repo

final.pt — final SFT checkpoint
best.pt — best SFT checkpoint
latest.pt — latest SFT checkpoint
metrics.jsonl — SFT metrics
step_609.pt — intermediate SFT checkpoint
tokenizer/vocab.json and tokenizer/merges.txt — tokenizer files
configs/model_75m.yaml — architecture config
src/tinyllm/ — minimal PyTorch model implementation
scripts/infer_tinyllm.py — simple local inference helper

Quick inference

Clone/download the repo, install dependencies, then run:

pip install torch tokenizers pyyaml huggingface_hub
python scripts/infer_tinyllm.py \
  --checkpoint final.pt \
  --prompt "What is the capital of France?"

The chat prompt format used during SFT is:

<|system|>
You are a helpful, concise assistant.
<|end|>
<|user|>
USER_QUESTION
<|end|>
<|assistant|>

Observed sample behavior

In a post-upload local inference test, the model generated text and loaded cleanly, but quality was mixed:

Correct on: “What is the capital of France?” → answered Paris, with repetition.
Weak on: simple science/world facts, often rambling or hallucinating.
Weak on: arithmetic and short-answer discipline.
Repetition and generic phrasing are common.

This is expected for a 75M-parameter scratch-trained model with about 1.14B pretraining tokens and one SFT pass.

Limitations

Not suitable for factual QA or production use.
Hallucinates frequently.
Repetition loops occur.
Arithmetic is unreliable.
Safety behavior was not evaluated.
Model is not aligned beyond basic supervised chat finetuning.
The checkpoint is a custom PyTorch model, not a standard transformers model class.

Intended use

Educational tiny-LLM experiment
Pipeline validation
Small-model architecture experimentation
Baseline for future 150M+ runs

Recommended next steps

To improve quality meaningfully:

Train a larger ~150M model.
Use more unique pretraining tokens, e.g. ~5B+.
Improve preprocessing/tokenization throughput with multiprocessing/sharding.
Add stronger instruction data and possibly preference tuning.
Export to a standard Hugging Face transformers compatible format.

Citation / attribution

Training datasets:

Skylion007/openwebtext
HuggingFaceTB/smol-smoltalk

This repository is an experimental model artifact from a custom tiny-LLM training pipeline.

Downloads last month: -; Downloads are not tracked for this model. How to track