| | --- |
| | language: en |
| | license: mit |
| | tags: |
| | - tiny |
| | - language-model |
| | - causal-lm |
| | - pytorch |
| | datasets: |
| | - roneneldan/TinyStories |
| | - Skylion007/openwebtext |
| | pipeline_tag: text-generation |
| | library_name: transformers |
| | --- |
| | |
| | # TinyLM |
| |
|
| | A 3.4M parameter causal language model trained from scratch, for experimentation. |
| |
|
| | ## Architecture |
| |
|
| | | Hyperparameter | Value | |
| | |---|---| |
| | | Parameters | 3.403.968 | |
| | | Layers | 4 | |
| | | Hidden size | 64 | |
| | | Attention heads | 4 | |
| | | FFN dim | 192 | |
| | | Embedding rank | 32 | |
| | | Context length | 256 | |
| | | Tokenizer | GPT-2 (50257 vocab) | |
| |
|
| | Uses a **factored (low-rank) embedding** to keep the vocab projection from eating the entire parameter budget, with weight tying on the output head. |
| |
|
| | ## Training |
| |
|
| | | | | |
| | |---|---| |
| | | Datasets | Skylion007/openwebtext (10k samples), roneneldan/TinyStories (10k samples) | |
| | | Optimizer | AdamW (lr=3e-3, weight_decay=0.01) | |
| | | Scheduler | Cosine annealing with warm restarts | |
| | | Mixed precision | fp16 (torch.cuda.amp) | |
| | | Hardware | Nvidia P100 | |
| | |
| | ## Usage |
| | ```python |
| | from huggingface_hub import snapshot_download |
| | import importlib.util |
| | import torch |
| | |
| | # Download files |
| | snapshot_download(repo_id="Fu01978/TinyLM", local_dir="./tinylm") |
| |
|
| | # Load via script |
| | spec = importlib.util.spec_from_file_location("modeling_tinylm", "./tinylm/modeling_tinylm.py") |
| | module = importlib.util.module_from_spec(spec) |
| | spec.loader.exec_module(module) |
| |
|
| | model, tokenizer, config = module.load_tinylm("./tinylm") |
| | model.eval() |
| | |
| | # Generate |
| | output = module.generate(model, tokenizer, "Once upon a time, ") |
| | print(output) |
| | ``` |