tiny-tang-38m
A 37.8M-parameter decoder-only transformer fine-tuned from darthcrawl/tiny-30m on Wu-Tang Clan lyrics. Same architecture, same tokenizer, full-weights continuation training (no LoRA).
Fan project. Educational. Not affiliated with Wu-Tang Clan, RZA, Loud Records, or anyone in the actual catalog. Outputs riff on the corpus's surface style; they do not reproduce specific copyrighted lyrics.
Quick start
import json, sys, torch
from pathlib import Path
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from safetensors.torch import load_file
local = snapshot_download("darthcrawl/tiny-tang-38m")
sys.path.insert(0, local)
from config import ModelConfig
from model import GPT
cfg_dict = json.loads((Path(local) / "config.json").read_text())
valid = {f for f in ModelConfig.__dataclass_fields__}
cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid})
model = GPT(cfg).eval()
model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False)
tok = Tokenizer.from_file(f"{local}/tokenizer.json")
eot = tok.token_to_id("<|endoftext|>")
prompt = "[verse 1: rza]\n"
ids = torch.tensor([tok.encode(prompt).ids], dtype=torch.long)
out = model.generate(ids, max_new_tokens=160, temperature=0.9, top_k=100, eos_id=eot)
print(tok.decode(out[0].tolist()))
Try seeding with verse markers: [verse 1: gza], [hook], [chorus], [intro]. Higher temperature (0.9-1.1) reads better than the base model's 0.8.
Architecture
Same as the base. Inherited verbatim during fine-tune.
| Type | Decoder-only transformer |
| Parameters | 37.8M |
| Layers | 8 |
| Hidden dim | 512 |
| Attention heads | 8 |
| Context length | 1024 |
| Vocab size | 8192 |
| Position encoding | RoPE |
| Norm | RMSNorm (pre-norm) |
| MLP | SwiGLU |
| Attention | PyTorch SDPA, causal |
| Embedding tying | Yes |
Fine-tune details
| Base model | darthcrawl/tiny-30m |
| Fine-tune corpus | OHHLA Wu-Tang archive, 159 songs / ~488K chars / ~90K words |
| Source spec | tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10 |
| Tokens seen (FT) | 477521740 |
| Best ckpt step | 19950 (continuation from base) |
| Best val loss | 5.0161 |
| Optimizer | AdamW (β=(0.9, 0.95), wd=0.1) |
| LR | Cosine tail of base schedule (low, no fresh warmup) |
| Batch size | 32 × grad_accum 4 |
| Precision | bfloat16 (AMP) |
Tokenizer is unchanged from the base, so token IDs map cleanly between models.
What it does
Outputs lyric-style text with Wu-Tang-flavored vocabulary, verse/section markers, internal rhyme attempts, and the occasional Shaolin reference. Best with a prompt that primes a verse header.
What it doesn't do
- Coherent narrative outside the lyrics register (the fine-tune drifts the base toward this distribution).
- Faithful imitation of any specific member's flow. The model averages over all members in the corpus.
- Reproduction of copyrighted lyrics. Output may overlap on common rap idioms (any short n-gram from any rap song); it does not reproduce full verses verbatim.
- Anything actually clever. It's 30M params.
Data and copyright
Source: lyrics scraped from OHHLA (the Original Hip-Hop Lyrics Archive) via the cing/rapwords GitHub mirror. Lyrics belong to their respective writers, performers, and publishers. This model is a learned distribution over the byte-level BPE tokens that compose those lyrics, distributed for educational and research use under Apache 2.0 (the code and weights, not the underlying source text).
If you are a rights holder and want this taken down, open an issue on the upstream repo.
Limitations
No safety alignment. No instruction tuning. Outputs include profanity, slurs, and other content present in the corpus. Not for any user-facing or production application.
Reproducibility
model.py, config.py, sample.py ship in this repo. Fine-tune was run via the upstream training pipeline:
# data prep with the SAME tokenizer as the base
python data_prep.py --mix "wu_tang:100" --tokenizer data/tokenizer.json --out-dir data_wu
# fine-tune. --reset-best forces the new best.pt to track val-on-wu;
# without it, best.pt would never update (base's TinyStories val < FT's wu val).
python train.py \
--resume <base.pt> --data-dir data_wu \
--reset-best --max-steps <base_step + ~500> \
--save-interval 50 --eval-interval 100
This shipped checkpoint is latest.pt from a ~500-step FT run. best.pt (lowest val) typically produces more grounded but less stylistically distinctive output; latest.pt favors style at the cost of generalization.
License
Apache 2.0 for the code and weights. Underlying source lyrics: see Data and copyright above.
- Downloads last month
- 26