| --- |
| license: apache-2.0 |
| language: |
| - en |
| library_name: pytorch |
| tags: |
| - causal-lm |
| - small-lm |
| - gpt |
| - fine-tuned |
| - lyrics |
| - hip-hop |
| - wu-tang |
| - style-transfer |
| pipeline_tag: text-generation |
| --- |
| |
| # tiny-tang-38m |
|
|
| A 37.8M-parameter decoder-only transformer fine-tuned from **[darthcrawl/tiny-30m](https://huggingface.co/darthcrawl/tiny-30m)** on Wu-Tang Clan lyrics. Same architecture, same tokenizer, full-weights continuation training (no LoRA). |
|
|
| Fan project. Educational. Not affiliated with Wu-Tang Clan, RZA, Loud Records, or anyone in the actual catalog. Outputs riff on the corpus's surface style; they do not reproduce specific copyrighted lyrics. |
|
|
| ## Quick start |
|
|
| ```python |
| import json, sys, torch |
| from pathlib import Path |
| from huggingface_hub import snapshot_download |
| from tokenizers import Tokenizer |
| from safetensors.torch import load_file |
| |
| local = snapshot_download("darthcrawl/tiny-tang-38m") |
| sys.path.insert(0, local) |
| from config import ModelConfig |
| from model import GPT |
| |
| cfg_dict = json.loads((Path(local) / "config.json").read_text()) |
| valid = {f for f in ModelConfig.__dataclass_fields__} |
| cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid}) |
| |
| model = GPT(cfg).eval() |
| model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False) |
| |
| tok = Tokenizer.from_file(f"{local}/tokenizer.json") |
| eot = tok.token_to_id("<|endoftext|>") |
| |
| prompt = "[verse 1: rza]\n" |
| ids = torch.tensor([tok.encode(prompt).ids], dtype=torch.long) |
| out = model.generate(ids, max_new_tokens=160, temperature=0.9, top_k=100, eos_id=eot) |
| print(tok.decode(out[0].tolist())) |
| ``` |
|
|
| Try seeding with verse markers: `[verse 1: gza]`, `[hook]`, `[chorus]`, `[intro]`. Higher temperature (0.9-1.1) reads better than the base model's 0.8. |
|
|
| ## Architecture |
|
|
| Same as the base. Inherited verbatim during fine-tune. |
|
|
| | | | |
| |---|---| |
| | Type | Decoder-only transformer | |
| | Parameters | 37.8M | |
| | Layers | 8 | |
| | Hidden dim | 512 | |
| | Attention heads | 8 | |
| | Context length | 1024 | |
| | Vocab size | 8192 | |
| | Position encoding | RoPE | |
| | Norm | RMSNorm (pre-norm) | |
| | MLP | SwiGLU | |
| | Attention | PyTorch SDPA, causal | |
| | Embedding tying | Yes | |
|
|
| ## Fine-tune details |
|
|
| | | | |
| |---|---| |
| | Base model | [darthcrawl/tiny-30m](https://huggingface.co/darthcrawl/tiny-30m) | |
| | Fine-tune corpus | OHHLA Wu-Tang archive, 159 songs / ~488K chars / ~90K words | |
| | Source spec | `tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10` | |
| | Tokens seen (FT) | 477521740 | |
| | Best ckpt step | 19950 (continuation from base) | |
| | Best val loss | 5.0161 | |
| | Optimizer | AdamW (β=(0.9, 0.95), wd=0.1) | |
| | LR | Cosine tail of base schedule (low, no fresh warmup) | |
| | Batch size | 32 × grad_accum 4 | |
| | Precision | bfloat16 (AMP) | |
| |
| Tokenizer is unchanged from the base, so token IDs map cleanly between models. |
| |
| ## What it does |
| |
| Outputs lyric-style text with Wu-Tang-flavored vocabulary, verse/section markers, internal rhyme attempts, and the occasional Shaolin reference. Best with a prompt that primes a verse header. |
| |
| ## What it doesn't do |
| |
| - Coherent narrative outside the lyrics register (the fine-tune drifts the base toward this distribution). |
| - Faithful imitation of any specific member's flow. The model averages over all members in the corpus. |
| - Reproduction of copyrighted lyrics. Output may overlap on common rap idioms (any short n-gram from any rap song); it does not reproduce full verses verbatim. |
| - Anything actually clever. It's 30M params. |
| |
| ## Data and copyright |
| |
| Source: lyrics scraped from [OHHLA](http://www.ohhla.com) (the Original Hip-Hop Lyrics Archive) via the [cing/rapwords](https://github.com/cing/rapwords) GitHub mirror. Lyrics belong to their respective writers, performers, and publishers. This model is a learned distribution over the byte-level BPE tokens that compose those lyrics, distributed for educational and research use under Apache 2.0 (the *code* and *weights*, not the underlying source text). |
| |
| If you are a rights holder and want this taken down, open an issue on the upstream repo. |
| |
| ## Limitations |
| |
| No safety alignment. No instruction tuning. Outputs include profanity, slurs, and other content present in the corpus. Not for any user-facing or production application. |
| |
| ## Reproducibility |
| |
| `model.py`, `config.py`, `sample.py` ship in this repo. Fine-tune was run via the upstream training pipeline: |
| |
| ```bash |
| # data prep with the SAME tokenizer as the base |
| python data_prep.py --mix "wu_tang:100" --tokenizer data/tokenizer.json --out-dir data_wu |
|
|
| # fine-tune. --reset-best forces the new best.pt to track val-on-wu; |
| # without it, best.pt would never update (base's TinyStories val < FT's wu val). |
| python train.py \ |
| --resume <base.pt> --data-dir data_wu \ |
| --reset-best --max-steps <base_step + ~500> \ |
| --save-interval 50 --eval-interval 100 |
| ``` |
| |
| This shipped checkpoint is `latest.pt` from a ~500-step FT run. `best.pt` (lowest val) typically produces more grounded but less stylistically distinctive output; `latest.pt` favors style at the cost of generalization. |
| |
| ## License |
| |
| Apache 2.0 for the code and weights. Underlying source lyrics: see Data and copyright above. |
| |