---
license: apache-2.0
language:
  - en
library_name: pytorch
tags:
  - causal-lm
  - small-lm
  - gpt
  - fine-tuned
  - lyrics
  - hip-hop
  - wu-tang
  - style-transfer
pipeline_tag: text-generation
---

# tiny-tang-38m

A 37.8M-parameter decoder-only transformer fine-tuned from **[darthcrawl/tiny-30m](https://huggingface.co/darthcrawl/tiny-30m)** on Wu-Tang Clan lyrics. Same architecture, same tokenizer, full-weights continuation training (no LoRA).

Fan project. Educational. Not affiliated with Wu-Tang Clan, RZA, Loud Records, or anyone in the actual catalog. Outputs riff on the corpus's surface style; they do not reproduce specific copyrighted lyrics.

## Quick start

```python
import json, sys, torch
from pathlib import Path
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from safetensors.torch import load_file

local = snapshot_download("darthcrawl/tiny-tang-38m")
sys.path.insert(0, local)
from config import ModelConfig
from model import GPT

cfg_dict = json.loads((Path(local) / "config.json").read_text())
valid = {f for f in ModelConfig.__dataclass_fields__}
cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid})

model = GPT(cfg).eval()
model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False)

tok = Tokenizer.from_file(f"{local}/tokenizer.json")
eot = tok.token_to_id("<|endoftext|>")

prompt = "[verse 1: rza]\n"
ids = torch.tensor([tok.encode(prompt).ids], dtype=torch.long)
out = model.generate(ids, max_new_tokens=160, temperature=0.9, top_k=100, eos_id=eot)
print(tok.decode(out[0].tolist()))
```

Try seeding with verse markers: `[verse 1: gza]`, `[hook]`, `[chorus]`, `[intro]`. Higher temperature (0.9-1.1) reads better than the base model's 0.8.

## Architecture

Same as the base. Inherited verbatim during fine-tune.

| | |
|---|---|
| Type | Decoder-only transformer |
| Parameters | 37.8M |
| Layers | 8 |
| Hidden dim | 512 |
| Attention heads | 8 |
| Context length | 1024 |
| Vocab size | 8192 |
| Position encoding | RoPE |
| Norm | RMSNorm (pre-norm) |
| MLP | SwiGLU |
| Attention | PyTorch SDPA, causal |
| Embedding tying | Yes |

## Fine-tune details

| | |
|---|---|
| Base model | [darthcrawl/tiny-30m](https://huggingface.co/darthcrawl/tiny-30m) |
| Fine-tune corpus | OHHLA Wu-Tang archive, 159 songs / ~488K chars / ~90K words |
| Source spec | `tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10` |
| Tokens seen (FT) | 477521740 |
| Best ckpt step | 19950 (continuation from base) |
| Best val loss | 5.0161 |
| Optimizer | AdamW (β=(0.9, 0.95), wd=0.1) |
| LR | Cosine tail of base schedule (low, no fresh warmup) |
| Batch size | 32 × grad_accum 4 |
| Precision | bfloat16 (AMP) |

Tokenizer is unchanged from the base, so token IDs map cleanly between models.

## What it does

Outputs lyric-style text with Wu-Tang-flavored vocabulary, verse/section markers, internal rhyme attempts, and the occasional Shaolin reference. Best with a prompt that primes a verse header.

## What it doesn't do

- Coherent narrative outside the lyrics register (the fine-tune drifts the base toward this distribution).
- Faithful imitation of any specific member's flow. The model averages over all members in the corpus.
- Reproduction of copyrighted lyrics. Output may overlap on common rap idioms (any short n-gram from any rap song); it does not reproduce full verses verbatim.
- Anything actually clever. It's 30M params.

## Data and copyright

Source: lyrics scraped from [OHHLA](http://www.ohhla.com) (the Original Hip-Hop Lyrics Archive) via the [cing/rapwords](https://github.com/cing/rapwords) GitHub mirror. Lyrics belong to their respective writers, performers, and publishers. This model is a learned distribution over the byte-level BPE tokens that compose those lyrics, distributed for educational and research use under Apache 2.0 (the *code* and *weights*, not the underlying source text).

If you are a rights holder and want this taken down, open an issue on the upstream repo.

## Limitations

No safety alignment. No instruction tuning. Outputs include profanity, slurs, and other content present in the corpus. Not for any user-facing or production application.

## Reproducibility

`model.py`, `config.py`, `sample.py` ship in this repo. Fine-tune was run via the upstream training pipeline:

```bash
# data prep with the SAME tokenizer as the base
python data_prep.py --mix "wu_tang:100" --tokenizer data/tokenizer.json --out-dir data_wu

# fine-tune. --reset-best forces the new best.pt to track val-on-wu;
# without it, best.pt would never update (base's TinyStories val < FT's wu val).
python train.py \
  --resume <base.pt> --data-dir data_wu \
  --reset-best --max-steps <base_step + ~500> \
  --save-interval 50 --eval-interval 100
```

This shipped checkpoint is `latest.pt` from a ~500-step FT run. `best.pt` (lowest val) typically produces more grounded but less stylistically distinctive output; `latest.pt` favors style at the cost of generalization.

## License

Apache 2.0 for the code and weights. Underlying source lyrics: see Data and copyright above.