tiny-tang-38m / README.md
darthcrawl's picture
Add files using upload-large-folder tool
d70a00b verified
---
license: apache-2.0
language:
- en
library_name: pytorch
tags:
- causal-lm
- small-lm
- gpt
- fine-tuned
- lyrics
- hip-hop
- wu-tang
- style-transfer
pipeline_tag: text-generation
---
# tiny-tang-38m
A 37.8M-parameter decoder-only transformer fine-tuned from **[darthcrawl/tiny-30m](https://huggingface.co/darthcrawl/tiny-30m)** on Wu-Tang Clan lyrics. Same architecture, same tokenizer, full-weights continuation training (no LoRA).
Fan project. Educational. Not affiliated with Wu-Tang Clan, RZA, Loud Records, or anyone in the actual catalog. Outputs riff on the corpus's surface style; they do not reproduce specific copyrighted lyrics.
## Quick start
```python
import json, sys, torch
from pathlib import Path
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from safetensors.torch import load_file
local = snapshot_download("darthcrawl/tiny-tang-38m")
sys.path.insert(0, local)
from config import ModelConfig
from model import GPT
cfg_dict = json.loads((Path(local) / "config.json").read_text())
valid = {f for f in ModelConfig.__dataclass_fields__}
cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid})
model = GPT(cfg).eval()
model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False)
tok = Tokenizer.from_file(f"{local}/tokenizer.json")
eot = tok.token_to_id("<|endoftext|>")
prompt = "[verse 1: rza]\n"
ids = torch.tensor([tok.encode(prompt).ids], dtype=torch.long)
out = model.generate(ids, max_new_tokens=160, temperature=0.9, top_k=100, eos_id=eot)
print(tok.decode(out[0].tolist()))
```
Try seeding with verse markers: `[verse 1: gza]`, `[hook]`, `[chorus]`, `[intro]`. Higher temperature (0.9-1.1) reads better than the base model's 0.8.
## Architecture
Same as the base. Inherited verbatim during fine-tune.
| | |
|---|---|
| Type | Decoder-only transformer |
| Parameters | 37.8M |
| Layers | 8 |
| Hidden dim | 512 |
| Attention heads | 8 |
| Context length | 1024 |
| Vocab size | 8192 |
| Position encoding | RoPE |
| Norm | RMSNorm (pre-norm) |
| MLP | SwiGLU |
| Attention | PyTorch SDPA, causal |
| Embedding tying | Yes |
## Fine-tune details
| | |
|---|---|
| Base model | [darthcrawl/tiny-30m](https://huggingface.co/darthcrawl/tiny-30m) |
| Fine-tune corpus | OHHLA Wu-Tang archive, 159 songs / ~488K chars / ~90K words |
| Source spec | `tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10` |
| Tokens seen (FT) | 477521740 |
| Best ckpt step | 19950 (continuation from base) |
| Best val loss | 5.0161 |
| Optimizer | AdamW (β=(0.9, 0.95), wd=0.1) |
| LR | Cosine tail of base schedule (low, no fresh warmup) |
| Batch size | 32 × grad_accum 4 |
| Precision | bfloat16 (AMP) |
Tokenizer is unchanged from the base, so token IDs map cleanly between models.
## What it does
Outputs lyric-style text with Wu-Tang-flavored vocabulary, verse/section markers, internal rhyme attempts, and the occasional Shaolin reference. Best with a prompt that primes a verse header.
## What it doesn't do
- Coherent narrative outside the lyrics register (the fine-tune drifts the base toward this distribution).
- Faithful imitation of any specific member's flow. The model averages over all members in the corpus.
- Reproduction of copyrighted lyrics. Output may overlap on common rap idioms (any short n-gram from any rap song); it does not reproduce full verses verbatim.
- Anything actually clever. It's 30M params.
## Data and copyright
Source: lyrics scraped from [OHHLA](http://www.ohhla.com) (the Original Hip-Hop Lyrics Archive) via the [cing/rapwords](https://github.com/cing/rapwords) GitHub mirror. Lyrics belong to their respective writers, performers, and publishers. This model is a learned distribution over the byte-level BPE tokens that compose those lyrics, distributed for educational and research use under Apache 2.0 (the *code* and *weights*, not the underlying source text).
If you are a rights holder and want this taken down, open an issue on the upstream repo.
## Limitations
No safety alignment. No instruction tuning. Outputs include profanity, slurs, and other content present in the corpus. Not for any user-facing or production application.
## Reproducibility
`model.py`, `config.py`, `sample.py` ship in this repo. Fine-tune was run via the upstream training pipeline:
```bash
# data prep with the SAME tokenizer as the base
python data_prep.py --mix "wu_tang:100" --tokenizer data/tokenizer.json --out-dir data_wu
# fine-tune. --reset-best forces the new best.pt to track val-on-wu;
# without it, best.pt would never update (base's TinyStories val < FT's wu val).
python train.py \
--resume <base.pt> --data-dir data_wu \
--reset-best --max-steps <base_step + ~500> \
--save-interval 50 --eval-interval 100
```
This shipped checkpoint is `latest.pt` from a ~500-step FT run. `best.pt` (lowest val) typically produces more grounded but less stylistically distinctive output; `latest.pt` favors style at the cost of generalization.
## License
Apache 2.0 for the code and weights. Underlying source lyrics: see Data and copyright above.