Add files using upload-large-folder tool

d70a00b verified 28 days ago

5.13 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: pytorch
	tags:
	- causal-lm
	- small-lm
	- gpt
	- fine-tuned
	- lyrics
	- hip-hop
	- wu-tang
	- style-transfer
	pipeline_tag: text-generation
	---

	# tiny-tang-38m

	A 37.8M-parameter decoder-only transformer fine-tuned from [darthcrawl/tiny-30m](https://huggingface.co/darthcrawl/tiny-30m) on Wu-Tang Clan lyrics. Same architecture, same tokenizer, full-weights continuation training (no LoRA).

	Fan project. Educational. Not affiliated with Wu-Tang Clan, RZA, Loud Records, or anyone in the actual catalog. Outputs riff on the corpus's surface style; they do not reproduce specific copyrighted lyrics.

	## Quick start

	```python
	import json, sys, torch
	from pathlib import Path
	from huggingface_hub import snapshot_download
	from tokenizers import Tokenizer
	from safetensors.torch import load_file

	local = snapshot_download("darthcrawl/tiny-tang-38m")
	sys.path.insert(0, local)
	from config import ModelConfig
	from model import GPT

	cfg_dict = json.loads((Path(local) / "config.json").read_text())
	valid = {f for f in ModelConfig.__dataclass_fields__}
	cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid})

	model = GPT(cfg).eval()
	model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False)

	tok = Tokenizer.from_file(f"{local}/tokenizer.json")
	eot = tok.token_to_id("<\|endoftext\|>")

	prompt = "[verse 1: rza]\n"
	ids = torch.tensor([tok.encode(prompt).ids], dtype=torch.long)
	out = model.generate(ids, max_new_tokens=160, temperature=0.9, top_k=100, eos_id=eot)
	print(tok.decode(out[0].tolist()))
	```

	Try seeding with verse markers: `[verse 1: gza]`, `[hook]`, `[chorus]`, `[intro]`. Higher temperature (0.9-1.1) reads better than the base model's 0.8.

	## Architecture

	Same as the base. Inherited verbatim during fine-tune.

	\| \| \|
	\|---\|---\|
	\| Type \| Decoder-only transformer \|
	\| Parameters \| 37.8M \|
	\| Layers \| 8 \|
	\| Hidden dim \| 512 \|
	\| Attention heads \| 8 \|
	\| Context length \| 1024 \|
	\| Vocab size \| 8192 \|
	\| Position encoding \| RoPE \|
	\| Norm \| RMSNorm (pre-norm) \|
	\| MLP \| SwiGLU \|
	\| Attention \| PyTorch SDPA, causal \|
	\| Embedding tying \| Yes \|

	## Fine-tune details

	\| \| \|
	\|---\|---\|
	\| Base model \| [darthcrawl/tiny-30m](https://huggingface.co/darthcrawl/tiny-30m) \|
	\| Fine-tune corpus \| OHHLA Wu-Tang archive, 159 songs / ~488K chars / ~90K words \|
	\| Source spec \| `tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10` \|
	\| Tokens seen (FT) \| 477521740 \|
	\| Best ckpt step \| 19950 (continuation from base) \|
	\| Best val loss \| 5.0161 \|
	\| Optimizer \| AdamW (β=(0.9, 0.95), wd=0.1) \|
	\| LR \| Cosine tail of base schedule (low, no fresh warmup) \|
	\| Batch size \| 32 × grad_accum 4 \|
	\| Precision \| bfloat16 (AMP) \|

	Tokenizer is unchanged from the base, so token IDs map cleanly between models.

	## What it does

	Outputs lyric-style text with Wu-Tang-flavored vocabulary, verse/section markers, internal rhyme attempts, and the occasional Shaolin reference. Best with a prompt that primes a verse header.

	## What it doesn't do

	- Coherent narrative outside the lyrics register (the fine-tune drifts the base toward this distribution).
	- Faithful imitation of any specific member's flow. The model averages over all members in the corpus.
	- Reproduction of copyrighted lyrics. Output may overlap on common rap idioms (any short n-gram from any rap song); it does not reproduce full verses verbatim.
	- Anything actually clever. It's 30M params.

	## Data and copyright

	Source: lyrics scraped from [OHHLA](http://www.ohhla.com) (the Original Hip-Hop Lyrics Archive) via the [cing/rapwords](https://github.com/cing/rapwords) GitHub mirror. Lyrics belong to their respective writers, performers, and publishers. This model is a learned distribution over the byte-level BPE tokens that compose those lyrics, distributed for educational and research use under Apache 2.0 (the code and weights, not the underlying source text).

	If you are a rights holder and want this taken down, open an issue on the upstream repo.

	## Limitations

	No safety alignment. No instruction tuning. Outputs include profanity, slurs, and other content present in the corpus. Not for any user-facing or production application.

	## Reproducibility

	`model.py`, `config.py`, `sample.py` ship in this repo. Fine-tune was run via the upstream training pipeline:

	```bash
	# data prep with the SAME tokenizer as the base
	python data_prep.py --mix "wu_tang:100" --tokenizer data/tokenizer.json --out-dir data_wu

	# fine-tune. --reset-best forces the new best.pt to track val-on-wu;
	# without it, best.pt would never update (base's TinyStories val < FT's wu val).
	python train.py \
	--resume <base.pt> --data-dir data_wu \
	--reset-best --max-steps <base_step + ~500> \
	--save-interval 50 --eval-interval 100
	```

	This shipped checkpoint is `latest.pt` from a ~500-step FT run. `best.pt` (lowest val) typically produces more grounded but less stylistically distinctive output; `latest.pt` favors style at the cost of generalization.

	## License

	Apache 2.0 for the code and weights. Underlying source lyrics: see Data and copyright above.