| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - tiny-llm |
| - causal-lm |
| - llama-like |
| - rope |
| - rmsnorm |
| - swiglu |
| - gqa |
| - openwebtext |
| - smoltalk |
| - pytorch |
| pipeline_tag: text-generation |
| library_name: pytorch |
| --- |
| |
| # TinyLLM 75M OpenWebText Chat |
|
|
| This repository contains an experimental **75,074,112 parameter decoder-only tiny language model** trained from scratch/near-scratch and then supervised-finetuned for chat. |
|
|
| > **Important quality note:** This is a successful end-to-end training pipeline artifact and research toy model, not a production assistant. It can load and generate text, but factual accuracy, instruction following, arithmetic, and repetition control are weak. |
|
|
| ## Model summary |
|
|
| - **Model name:** `razor5050/tinyllm-75m-openwebtext-chat` |
| - **Architecture:** LLaMA/SmolLM-style decoder-only causal LM |
| - **Parameters:** 75,074,112 |
| - **Context length:** 1024 tokens |
| - **Vocabulary:** 32,000 ByteLevel BPE tokens |
| - **Tokenizer:** custom ByteLevel BPE trained for this run |
| - **Checkpoint format:** PyTorch `.pt` checkpoints |
| - **Primary final checkpoint:** `final.pt` |
| - **Best checkpoint:** `best.pt` |
|
|
| ## Architecture |
|
|
| The model uses modern tiny-LM components: |
|
|
| - decoder-only causal Transformer |
| - RoPE positional embeddings |
| - RMSNorm |
| - SwiGLU MLP |
| - grouped-query/key-value reduction via fewer KV heads |
| - tied input/output token embeddings |
| - no attention/MLP bias |
| - PyTorch SDPA causal attention |
|
|
| Approximate config: |
|
|
| ```yaml |
| vocab_size: 32000 |
| hidden_size: 576 |
| num_hidden_layers: 16 |
| num_attention_heads: 9 |
| num_key_value_heads: 3 |
| intermediate_size: 1536 |
| max_position_embeddings: 1024 |
| rope_theta: 10000.0 |
| rms_norm_eps: 1e-5 |
| tie_word_embeddings: true |
| attention_bias: false |
| mlp_bias: false |
| dropout: 0.0 |
| ``` |
|
|
| ## Training data |
|
|
| ### Base pretraining |
|
|
| - Dataset: [`Skylion007/openwebtext`](https://huggingface.co/datasets/Skylion007/openwebtext) |
| - Rows used: 1,000,000 selected rows |
| - Final tokenized train tokens: 1,143,301,833 |
| - Final tokenized validation tokens: 34,486,473 |
| - Epochs: 1 |
| - Optimizer steps: 4,361 |
|
|
| ### Chat/SFT |
|
|
| - Dataset: [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) |
| - Train examples: 100,000 |
| - Validation examples: 3,000 |
| - Epochs: 1 |
| - Optimizer steps: 781 |
| - Loss masking: assistant-response tokens only |
|
|
| ## Training results |
|
|
| ### Pretraining |
|
|
| - Final/latest train loss near end: about `4.997` |
| - Latest validation loss: about `5.049` at step 4000 |
|
|
| ### SFT |
|
|
| - SFT completed at step `781` |
| - Validation trend: |
| - step 250: `2.6031` |
| - step 500: `2.4505` |
| - step 750: `2.3313` |
|
|
| SFT improved chat formatting and response style, but the model remains very small and undertrained by modern assistant standards. |
|
|
| ## Hardware/run |
|
|
| - Cloud GPU: Vast.ai RTX 5070 Ti, 16GB VRAM |
| - Precision: CUDA/PyTorch mixed precision during training where supported |
| - Checkpointing: periodic `latest`, `best`, final, and step checkpoints |
| - Training artifacts were preserved separately outside the instance before teardown. |
|
|
| ## Files in this repo |
|
|
| - `final.pt` — final SFT checkpoint |
| - `best.pt` — best SFT checkpoint |
| - `latest.pt` — latest SFT checkpoint |
| - `metrics.jsonl` — SFT metrics |
| - `step_609.pt` — intermediate SFT checkpoint |
| - `tokenizer/vocab.json` and `tokenizer/merges.txt` — tokenizer files |
| - `configs/model_75m.yaml` — architecture config |
| - `src/tinyllm/` — minimal PyTorch model implementation |
| - `scripts/infer_tinyllm.py` — simple local inference helper |
|
|
| ## Quick inference |
|
|
| Clone/download the repo, install dependencies, then run: |
|
|
| ```bash |
| pip install torch tokenizers pyyaml huggingface_hub |
| python scripts/infer_tinyllm.py \ |
| --checkpoint final.pt \ |
| --prompt "What is the capital of France?" |
| ``` |
|
|
| The chat prompt format used during SFT is: |
|
|
| ```text |
| <|system|> |
| You are a helpful, concise assistant. |
| <|end|> |
| <|user|> |
| USER_QUESTION |
| <|end|> |
| <|assistant|> |
| ``` |
|
|
| ## Observed sample behavior |
|
|
| In a post-upload local inference test, the model generated text and loaded cleanly, but quality was mixed: |
|
|
| - Correct on: “What is the capital of France?” → answered Paris, with repetition. |
| - Weak on: simple science/world facts, often rambling or hallucinating. |
| - Weak on: arithmetic and short-answer discipline. |
| - Repetition and generic phrasing are common. |
|
|
| This is expected for a 75M-parameter scratch-trained model with about 1.14B pretraining tokens and one SFT pass. |
|
|
| ## Limitations |
|
|
| - Not suitable for factual QA or production use. |
| - Hallucinates frequently. |
| - Repetition loops occur. |
| - Arithmetic is unreliable. |
| - Safety behavior was not evaluated. |
| - Model is not aligned beyond basic supervised chat finetuning. |
| - The checkpoint is a custom PyTorch model, not a standard `transformers` model class. |
|
|
| ## Intended use |
|
|
| - Educational tiny-LLM experiment |
| - Pipeline validation |
| - Small-model architecture experimentation |
| - Baseline for future 150M+ runs |
|
|
| ## Recommended next steps |
|
|
| To improve quality meaningfully: |
|
|
| 1. Train a larger ~150M model. |
| 2. Use more unique pretraining tokens, e.g. ~5B+. |
| 3. Improve preprocessing/tokenization throughput with multiprocessing/sharding. |
| 4. Add stronger instruction data and possibly preference tuning. |
| 5. Export to a standard Hugging Face `transformers` compatible format. |
|
|
| ## Citation / attribution |
|
|
| Training datasets: |
|
|
| - `Skylion007/openwebtext` |
| - `HuggingFaceTB/smol-smoltalk` |
|
|
| This repository is an experimental model artifact from a custom tiny-LLM training pipeline. |
|
|