--- license: apache-2.0 language: - en library_name: pytorch tags: - causal-lm - small-lm - gpt - fine-tuned - lyrics - hip-hop - wu-tang - style-transfer pipeline_tag: text-generation --- # tiny-tang-38m A 37.8M-parameter decoder-only transformer fine-tuned from **[darthcrawl/tiny-30m](https://huggingface.co/darthcrawl/tiny-30m)** on Wu-Tang Clan lyrics. Same architecture, same tokenizer, full-weights continuation training (no LoRA). Fan project. Educational. Not affiliated with Wu-Tang Clan, RZA, Loud Records, or anyone in the actual catalog. Outputs riff on the corpus's surface style; they do not reproduce specific copyrighted lyrics. ## Quick start ```python import json, sys, torch from pathlib import Path from huggingface_hub import snapshot_download from tokenizers import Tokenizer from safetensors.torch import load_file local = snapshot_download("darthcrawl/tiny-tang-38m") sys.path.insert(0, local) from config import ModelConfig from model import GPT cfg_dict = json.loads((Path(local) / "config.json").read_text()) valid = {f for f in ModelConfig.__dataclass_fields__} cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid}) model = GPT(cfg).eval() model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False) tok = Tokenizer.from_file(f"{local}/tokenizer.json") eot = tok.token_to_id("<|endoftext|>") prompt = "[verse 1: rza]\n" ids = torch.tensor([tok.encode(prompt).ids], dtype=torch.long) out = model.generate(ids, max_new_tokens=160, temperature=0.9, top_k=100, eos_id=eot) print(tok.decode(out[0].tolist())) ``` Try seeding with verse markers: `[verse 1: gza]`, `[hook]`, `[chorus]`, `[intro]`. Higher temperature (0.9-1.1) reads better than the base model's 0.8. ## Architecture Same as the base. Inherited verbatim during fine-tune. | | | |---|---| | Type | Decoder-only transformer | | Parameters | 37.8M | | Layers | 8 | | Hidden dim | 512 | | Attention heads | 8 | | Context length | 1024 | | Vocab size | 8192 | | Position encoding | RoPE | | Norm | RMSNorm (pre-norm) | | MLP | SwiGLU | | Attention | PyTorch SDPA, causal | | Embedding tying | Yes | ## Fine-tune details | | | |---|---| | Base model | [darthcrawl/tiny-30m](https://huggingface.co/darthcrawl/tiny-30m) | | Fine-tune corpus | OHHLA Wu-Tang archive, 159 songs / ~488K chars / ~90K words | | Source spec | `tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10` | | Tokens seen (FT) | 477521740 | | Best ckpt step | 19950 (continuation from base) | | Best val loss | 5.0161 | | Optimizer | AdamW (β=(0.9, 0.95), wd=0.1) | | LR | Cosine tail of base schedule (low, no fresh warmup) | | Batch size | 32 × grad_accum 4 | | Precision | bfloat16 (AMP) | Tokenizer is unchanged from the base, so token IDs map cleanly between models. ## What it does Outputs lyric-style text with Wu-Tang-flavored vocabulary, verse/section markers, internal rhyme attempts, and the occasional Shaolin reference. Best with a prompt that primes a verse header. ## What it doesn't do - Coherent narrative outside the lyrics register (the fine-tune drifts the base toward this distribution). - Faithful imitation of any specific member's flow. The model averages over all members in the corpus. - Reproduction of copyrighted lyrics. Output may overlap on common rap idioms (any short n-gram from any rap song); it does not reproduce full verses verbatim. - Anything actually clever. It's 30M params. ## Data and copyright Source: lyrics scraped from [OHHLA](http://www.ohhla.com) (the Original Hip-Hop Lyrics Archive) via the [cing/rapwords](https://github.com/cing/rapwords) GitHub mirror. Lyrics belong to their respective writers, performers, and publishers. This model is a learned distribution over the byte-level BPE tokens that compose those lyrics, distributed for educational and research use under Apache 2.0 (the *code* and *weights*, not the underlying source text). If you are a rights holder and want this taken down, open an issue on the upstream repo. ## Limitations No safety alignment. No instruction tuning. Outputs include profanity, slurs, and other content present in the corpus. Not for any user-facing or production application. ## Reproducibility `model.py`, `config.py`, `sample.py` ship in this repo. Fine-tune was run via the upstream training pipeline: ```bash # data prep with the SAME tokenizer as the base python data_prep.py --mix "wu_tang:100" --tokenizer data/tokenizer.json --out-dir data_wu # fine-tune. --reset-best forces the new best.pt to track val-on-wu; # without it, best.pt would never update (base's TinyStories val < FT's wu val). python train.py \ --resume --data-dir data_wu \ --reset-best --max-steps \ --save-interval 50 --eval-interval 100 ``` This shipped checkpoint is `latest.pt` from a ~500-step FT run. `best.pt` (lowest val) typically produces more grounded but less stylistically distinctive output; `latest.pt` favors style at the cost of generalization. ## License Apache 2.0 for the code and weights. Underlying source lyrics: see Data and copyright above.