---
language:
- en
license: mit
tags:
- tiny-llm
- causal-lm
- llama-like
- rope
- rmsnorm
- swiglu
- gqa
- openwebtext
- smoltalk
- pytorch
pipeline_tag: text-generation
library_name: pytorch
---

# TinyLLM 75M OpenWebText Chat

This repository contains an experimental **75,074,112 parameter decoder-only tiny language model** trained from scratch/near-scratch and then supervised-finetuned for chat.

> **Important quality note:** This is a successful end-to-end training pipeline artifact and research toy model, not a production assistant. It can load and generate text, but factual accuracy, instruction following, arithmetic, and repetition control are weak.

## Model summary

- **Model name:** `razor5050/tinyllm-75m-openwebtext-chat`
- **Architecture:** LLaMA/SmolLM-style decoder-only causal LM
- **Parameters:** 75,074,112
- **Context length:** 1024 tokens
- **Vocabulary:** 32,000 ByteLevel BPE tokens
- **Tokenizer:** custom ByteLevel BPE trained for this run
- **Checkpoint format:** PyTorch `.pt` checkpoints
- **Primary final checkpoint:** `final.pt`
- **Best checkpoint:** `best.pt`

## Architecture

The model uses modern tiny-LM components:

- decoder-only causal Transformer
- RoPE positional embeddings
- RMSNorm
- SwiGLU MLP
- grouped-query/key-value reduction via fewer KV heads
- tied input/output token embeddings
- no attention/MLP bias
- PyTorch SDPA causal attention

Approximate config:

```yaml
vocab_size: 32000
hidden_size: 576
num_hidden_layers: 16
num_attention_heads: 9
num_key_value_heads: 3
intermediate_size: 1536
max_position_embeddings: 1024
rope_theta: 10000.0
rms_norm_eps: 1e-5
tie_word_embeddings: true
attention_bias: false
mlp_bias: false
dropout: 0.0
```

## Training data

### Base pretraining

- Dataset: [`Skylion007/openwebtext`](https://huggingface.co/datasets/Skylion007/openwebtext)
- Rows used: 1,000,000 selected rows
- Final tokenized train tokens: 1,143,301,833
- Final tokenized validation tokens: 34,486,473
- Epochs: 1
- Optimizer steps: 4,361

### Chat/SFT

- Dataset: [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk)
- Train examples: 100,000
- Validation examples: 3,000
- Epochs: 1
- Optimizer steps: 781
- Loss masking: assistant-response tokens only

## Training results

### Pretraining

- Final/latest train loss near end: about `4.997`
- Latest validation loss: about `5.049` at step 4000

### SFT

- SFT completed at step `781`
- Validation trend:
  - step 250: `2.6031`
  - step 500: `2.4505`
  - step 750: `2.3313`

SFT improved chat formatting and response style, but the model remains very small and undertrained by modern assistant standards.

## Hardware/run

- Cloud GPU: Vast.ai RTX 5070 Ti, 16GB VRAM
- Precision: CUDA/PyTorch mixed precision during training where supported
- Checkpointing: periodic `latest`, `best`, final, and step checkpoints
- Training artifacts were preserved separately outside the instance before teardown.

## Files in this repo

- `final.pt` — final SFT checkpoint
- `best.pt` — best SFT checkpoint
- `latest.pt` — latest SFT checkpoint
- `metrics.jsonl` — SFT metrics
- `step_609.pt` — intermediate SFT checkpoint
- `tokenizer/vocab.json` and `tokenizer/merges.txt` — tokenizer files
- `configs/model_75m.yaml` — architecture config
- `src/tinyllm/` — minimal PyTorch model implementation
- `scripts/infer_tinyllm.py` — simple local inference helper

## Quick inference

Clone/download the repo, install dependencies, then run:

```bash
pip install torch tokenizers pyyaml huggingface_hub
python scripts/infer_tinyllm.py \
  --checkpoint final.pt \
  --prompt "What is the capital of France?"
```

The chat prompt format used during SFT is:

```text
<|system|>
You are a helpful, concise assistant.
<|end|>
<|user|>
USER_QUESTION
<|end|>
<|assistant|>
```

## Observed sample behavior

In a post-upload local inference test, the model generated text and loaded cleanly, but quality was mixed:

- Correct on: “What is the capital of France?” → answered Paris, with repetition.
- Weak on: simple science/world facts, often rambling or hallucinating.
- Weak on: arithmetic and short-answer discipline.
- Repetition and generic phrasing are common.

This is expected for a 75M-parameter scratch-trained model with about 1.14B pretraining tokens and one SFT pass.

## Limitations

- Not suitable for factual QA or production use.
- Hallucinates frequently.
- Repetition loops occur.
- Arithmetic is unreliable.
- Safety behavior was not evaluated.
- Model is not aligned beyond basic supervised chat finetuning.
- The checkpoint is a custom PyTorch model, not a standard `transformers` model class.

## Intended use

- Educational tiny-LLM experiment
- Pipeline validation
- Small-model architecture experimentation
- Baseline for future 150M+ runs

## Recommended next steps

To improve quality meaningfully:

1. Train a larger ~150M model.
2. Use more unique pretraining tokens, e.g. ~5B+.
3. Improve preprocessing/tokenization throughput with multiprocessing/sharding.
4. Add stronger instruction data and possibly preference tuning.
5. Export to a standard Hugging Face `transformers` compatible format.

## Citation / attribution

Training datasets:

- `Skylion007/openwebtext`
- `HuggingFaceTB/smol-smoltalk`

This repository is an experimental model artifact from a custom tiny-LLM training pipeline.