--- language: - en license: mit tags: - tiny-llm - causal-lm - llama-like - rope - rmsnorm - swiglu - gqa - openwebtext - smoltalk - pytorch pipeline_tag: text-generation library_name: pytorch --- # TinyLLM 75M OpenWebText Chat This repository contains an experimental **75,074,112 parameter decoder-only tiny language model** trained from scratch/near-scratch and then supervised-finetuned for chat. > **Important quality note:** This is a successful end-to-end training pipeline artifact and research toy model, not a production assistant. It can load and generate text, but factual accuracy, instruction following, arithmetic, and repetition control are weak. ## Model summary - **Model name:** `razor5050/tinyllm-75m-openwebtext-chat` - **Architecture:** LLaMA/SmolLM-style decoder-only causal LM - **Parameters:** 75,074,112 - **Context length:** 1024 tokens - **Vocabulary:** 32,000 ByteLevel BPE tokens - **Tokenizer:** custom ByteLevel BPE trained for this run - **Checkpoint format:** PyTorch `.pt` checkpoints - **Primary final checkpoint:** `final.pt` - **Best checkpoint:** `best.pt` ## Architecture The model uses modern tiny-LM components: - decoder-only causal Transformer - RoPE positional embeddings - RMSNorm - SwiGLU MLP - grouped-query/key-value reduction via fewer KV heads - tied input/output token embeddings - no attention/MLP bias - PyTorch SDPA causal attention Approximate config: ```yaml vocab_size: 32000 hidden_size: 576 num_hidden_layers: 16 num_attention_heads: 9 num_key_value_heads: 3 intermediate_size: 1536 max_position_embeddings: 1024 rope_theta: 10000.0 rms_norm_eps: 1e-5 tie_word_embeddings: true attention_bias: false mlp_bias: false dropout: 0.0 ``` ## Training data ### Base pretraining - Dataset: [`Skylion007/openwebtext`](https://huggingface.co/datasets/Skylion007/openwebtext) - Rows used: 1,000,000 selected rows - Final tokenized train tokens: 1,143,301,833 - Final tokenized validation tokens: 34,486,473 - Epochs: 1 - Optimizer steps: 4,361 ### Chat/SFT - Dataset: [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) - Train examples: 100,000 - Validation examples: 3,000 - Epochs: 1 - Optimizer steps: 781 - Loss masking: assistant-response tokens only ## Training results ### Pretraining - Final/latest train loss near end: about `4.997` - Latest validation loss: about `5.049` at step 4000 ### SFT - SFT completed at step `781` - Validation trend: - step 250: `2.6031` - step 500: `2.4505` - step 750: `2.3313` SFT improved chat formatting and response style, but the model remains very small and undertrained by modern assistant standards. ## Hardware/run - Cloud GPU: Vast.ai RTX 5070 Ti, 16GB VRAM - Precision: CUDA/PyTorch mixed precision during training where supported - Checkpointing: periodic `latest`, `best`, final, and step checkpoints - Training artifacts were preserved separately outside the instance before teardown. ## Files in this repo - `final.pt` — final SFT checkpoint - `best.pt` — best SFT checkpoint - `latest.pt` — latest SFT checkpoint - `metrics.jsonl` — SFT metrics - `step_609.pt` — intermediate SFT checkpoint - `tokenizer/vocab.json` and `tokenizer/merges.txt` — tokenizer files - `configs/model_75m.yaml` — architecture config - `src/tinyllm/` — minimal PyTorch model implementation - `scripts/infer_tinyllm.py` — simple local inference helper ## Quick inference Clone/download the repo, install dependencies, then run: ```bash pip install torch tokenizers pyyaml huggingface_hub python scripts/infer_tinyllm.py \ --checkpoint final.pt \ --prompt "What is the capital of France?" ``` The chat prompt format used during SFT is: ```text <|system|> You are a helpful, concise assistant. <|end|> <|user|> USER_QUESTION <|end|> <|assistant|> ``` ## Observed sample behavior In a post-upload local inference test, the model generated text and loaded cleanly, but quality was mixed: - Correct on: “What is the capital of France?” → answered Paris, with repetition. - Weak on: simple science/world facts, often rambling or hallucinating. - Weak on: arithmetic and short-answer discipline. - Repetition and generic phrasing are common. This is expected for a 75M-parameter scratch-trained model with about 1.14B pretraining tokens and one SFT pass. ## Limitations - Not suitable for factual QA or production use. - Hallucinates frequently. - Repetition loops occur. - Arithmetic is unreliable. - Safety behavior was not evaluated. - Model is not aligned beyond basic supervised chat finetuning. - The checkpoint is a custom PyTorch model, not a standard `transformers` model class. ## Intended use - Educational tiny-LLM experiment - Pipeline validation - Small-model architecture experimentation - Baseline for future 150M+ runs ## Recommended next steps To improve quality meaningfully: 1. Train a larger ~150M model. 2. Use more unique pretraining tokens, e.g. ~5B+. 3. Improve preprocessing/tokenization throughput with multiprocessing/sharding. 4. Add stronger instruction data and possibly preference tuning. 5. Export to a standard Hugging Face `transformers` compatible format. ## Citation / attribution Training datasets: - `Skylion007/openwebtext` - `HuggingFaceTB/smol-smoltalk` This repository is an experimental model artifact from a custom tiny-LLM training pipeline.