tiny-tang-38m

A 37.8M-parameter decoder-only transformer fine-tuned from darthcrawl/tiny-30m on Wu-Tang Clan lyrics. Same architecture, same tokenizer, full-weights continuation training (no LoRA).

Fan project. Educational. Not affiliated with Wu-Tang Clan, RZA, Loud Records, or anyone in the actual catalog. Outputs riff on the corpus's surface style; they do not reproduce specific copyrighted lyrics.

Quick start

import json, sys, torch
from pathlib import Path
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from safetensors.torch import load_file

local = snapshot_download("darthcrawl/tiny-tang-38m")
sys.path.insert(0, local)
from config import ModelConfig
from model import GPT

cfg_dict = json.loads((Path(local) / "config.json").read_text())
valid = {f for f in ModelConfig.__dataclass_fields__}
cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid})

model = GPT(cfg).eval()
model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False)

tok = Tokenizer.from_file(f"{local}/tokenizer.json")
eot = tok.token_to_id("<|endoftext|>")

prompt = "[verse 1: rza]\n"
ids = torch.tensor([tok.encode(prompt).ids], dtype=torch.long)
out = model.generate(ids, max_new_tokens=160, temperature=0.9, top_k=100, eos_id=eot)
print(tok.decode(out[0].tolist()))

Try seeding with verse markers: [verse 1: gza], [hook], [chorus], [intro]. Higher temperature (0.9-1.1) reads better than the base model's 0.8.

Architecture

Same as the base. Inherited verbatim during fine-tune.


Type	Decoder-only transformer
Parameters	37.8M
Layers	8
Hidden dim	512
Attention heads	8
Context length	1024
Vocab size	8192
Position encoding	RoPE
Norm	RMSNorm (pre-norm)
MLP	SwiGLU
Attention	PyTorch SDPA, causal
Embedding tying	Yes

Fine-tune details


Base model	darthcrawl/tiny-30m
Fine-tune corpus	OHHLA Wu-Tang archive, 159 songs / ~488K chars / ~90K words
Source spec	`tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10`
Tokens seen (FT)	477521740
Best ckpt step	19950 (continuation from base)
Best val loss	5.0161
Optimizer	AdamW (β=(0.9, 0.95), wd=0.1)
LR	Cosine tail of base schedule (low, no fresh warmup)
Batch size	32 × grad_accum 4
Precision	bfloat16 (AMP)

Tokenizer is unchanged from the base, so token IDs map cleanly between models.

What it does

Outputs lyric-style text with Wu-Tang-flavored vocabulary, verse/section markers, internal rhyme attempts, and the occasional Shaolin reference. Best with a prompt that primes a verse header.

What it doesn't do

Coherent narrative outside the lyrics register (the fine-tune drifts the base toward this distribution).
Faithful imitation of any specific member's flow. The model averages over all members in the corpus.
Reproduction of copyrighted lyrics. Output may overlap on common rap idioms (any short n-gram from any rap song); it does not reproduce full verses verbatim.
Anything actually clever. It's 30M params.

Data and copyright

Source: lyrics scraped from OHHLA (the Original Hip-Hop Lyrics Archive) via the cing/rapwords GitHub mirror. Lyrics belong to their respective writers, performers, and publishers. This model is a learned distribution over the byte-level BPE tokens that compose those lyrics, distributed for educational and research use under Apache 2.0 (the code and weights, not the underlying source text).

If you are a rights holder and want this taken down, open an issue on the upstream repo.

Limitations

No safety alignment. No instruction tuning. Outputs include profanity, slurs, and other content present in the corpus. Not for any user-facing or production application.

Reproducibility

model.py, config.py, sample.py ship in this repo. Fine-tune was run via the upstream training pipeline:

# data prep with the SAME tokenizer as the base
python data_prep.py --mix "wu_tang:100" --tokenizer data/tokenizer.json --out-dir data_wu

# fine-tune. --reset-best forces the new best.pt to track val-on-wu;
# without it, best.pt would never update (base's TinyStories val < FT's wu val).
python train.py \
  --resume <base.pt> --data-dir data_wu \
  --reset-best --max-steps <base_step + ~500> \
  --save-interval 50 --eval-interval 100

This shipped checkpoint is latest.pt from a ~500-step FT run. best.pt (lowest val) typically produces more grounded but less stylistically distinctive output; latest.pt favors style at the cost of generalization.

License

Apache 2.0 for the code and weights. Underlying source lyrics: see Data and copyright above.

Downloads last month: 26