TinyLLM 75M OpenWebText Chat

This repository contains an experimental 75,074,112 parameter decoder-only tiny language model trained from scratch/near-scratch and then supervised-finetuned for chat.

Important quality note: This is a successful end-to-end training pipeline artifact and research toy model, not a production assistant. It can load and generate text, but factual accuracy, instruction following, arithmetic, and repetition control are weak.

Model summary

  • Model name: razor5050/tinyllm-75m-openwebtext-chat
  • Architecture: LLaMA/SmolLM-style decoder-only causal LM
  • Parameters: 75,074,112
  • Context length: 1024 tokens
  • Vocabulary: 32,000 ByteLevel BPE tokens
  • Tokenizer: custom ByteLevel BPE trained for this run
  • Checkpoint format: PyTorch .pt checkpoints
  • Primary final checkpoint: final.pt
  • Best checkpoint: best.pt

Architecture

The model uses modern tiny-LM components:

  • decoder-only causal Transformer
  • RoPE positional embeddings
  • RMSNorm
  • SwiGLU MLP
  • grouped-query/key-value reduction via fewer KV heads
  • tied input/output token embeddings
  • no attention/MLP bias
  • PyTorch SDPA causal attention

Approximate config:

vocab_size: 32000
hidden_size: 576
num_hidden_layers: 16
num_attention_heads: 9
num_key_value_heads: 3
intermediate_size: 1536
max_position_embeddings: 1024
rope_theta: 10000.0
rms_norm_eps: 1e-5
tie_word_embeddings: true
attention_bias: false
mlp_bias: false
dropout: 0.0

Training data

Base pretraining

  • Dataset: Skylion007/openwebtext
  • Rows used: 1,000,000 selected rows
  • Final tokenized train tokens: 1,143,301,833
  • Final tokenized validation tokens: 34,486,473
  • Epochs: 1
  • Optimizer steps: 4,361

Chat/SFT

  • Dataset: HuggingFaceTB/smol-smoltalk
  • Train examples: 100,000
  • Validation examples: 3,000
  • Epochs: 1
  • Optimizer steps: 781
  • Loss masking: assistant-response tokens only

Training results

Pretraining

  • Final/latest train loss near end: about 4.997
  • Latest validation loss: about 5.049 at step 4000

SFT

  • SFT completed at step 781
  • Validation trend:
    • step 250: 2.6031
    • step 500: 2.4505
    • step 750: 2.3313

SFT improved chat formatting and response style, but the model remains very small and undertrained by modern assistant standards.

Hardware/run

  • Cloud GPU: Vast.ai RTX 5070 Ti, 16GB VRAM
  • Precision: CUDA/PyTorch mixed precision during training where supported
  • Checkpointing: periodic latest, best, final, and step checkpoints
  • Training artifacts were preserved separately outside the instance before teardown.

Files in this repo

  • final.pt β€” final SFT checkpoint
  • best.pt β€” best SFT checkpoint
  • latest.pt β€” latest SFT checkpoint
  • metrics.jsonl β€” SFT metrics
  • step_609.pt β€” intermediate SFT checkpoint
  • tokenizer/vocab.json and tokenizer/merges.txt β€” tokenizer files
  • configs/model_75m.yaml β€” architecture config
  • src/tinyllm/ β€” minimal PyTorch model implementation
  • scripts/infer_tinyllm.py β€” simple local inference helper

Quick inference

Clone/download the repo, install dependencies, then run:

pip install torch tokenizers pyyaml huggingface_hub
python scripts/infer_tinyllm.py \
  --checkpoint final.pt \
  --prompt "What is the capital of France?"

The chat prompt format used during SFT is:

<|system|>
You are a helpful, concise assistant.
<|end|>
<|user|>
USER_QUESTION
<|end|>
<|assistant|>

Observed sample behavior

In a post-upload local inference test, the model generated text and loaded cleanly, but quality was mixed:

  • Correct on: β€œWhat is the capital of France?” β†’ answered Paris, with repetition.
  • Weak on: simple science/world facts, often rambling or hallucinating.
  • Weak on: arithmetic and short-answer discipline.
  • Repetition and generic phrasing are common.

This is expected for a 75M-parameter scratch-trained model with about 1.14B pretraining tokens and one SFT pass.

Limitations

  • Not suitable for factual QA or production use.
  • Hallucinates frequently.
  • Repetition loops occur.
  • Arithmetic is unreliable.
  • Safety behavior was not evaluated.
  • Model is not aligned beyond basic supervised chat finetuning.
  • The checkpoint is a custom PyTorch model, not a standard transformers model class.

Intended use

  • Educational tiny-LLM experiment
  • Pipeline validation
  • Small-model architecture experimentation
  • Baseline for future 150M+ runs

Recommended next steps

To improve quality meaningfully:

  1. Train a larger ~150M model.
  2. Use more unique pretraining tokens, e.g. ~5B+.
  3. Improve preprocessing/tokenization throughput with multiprocessing/sharding.
  4. Add stronger instruction data and possibly preference tuning.
  5. Export to a standard Hugging Face transformers compatible format.

Citation / attribution

Training datasets:

  • Skylion007/openwebtext
  • HuggingFaceTB/smol-smoltalk

This repository is an experimental model artifact from a custom tiny-LLM training pipeline.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support