TheArtist Music Transformer β F4 (Pop 1K Mix, jazz-leaning)
Jazz-adapted chord model with a 1,000-sequence pop rehearsal buffer. The jazz-leaning endpoint of the mix-ratio sweep. Highest jazz top-1 in the collection (81.50%) at the cost of 1.22 pop points.
One of six checkpoints released alongside the paper Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation (Lee, 2026). See the collection overview at PearlLeeStudio/TheArtist-MusicTransformer-pop-baseline.
Demo
Watch TheArtist in action on YouTube β interactive staff editor, MIDI input, AI generation with live progress, and per-genre LoRA playback across the 13-genre vocabulary.
Model summary
| Field | Value |
|---|---|
| Architecture | Music Transformer with relative positional attention |
| Parameters | 25,661,440 |
| Vocabulary size | 351 tokens |
| Max sequence length | 256 |
| d_model / heads / FFN / layers | 512 / 8 / 2048 / 8 |
| Fine-tune resumed from | Phase 0 pop baseline |
| Best epoch | 6 |
Training data
All 1,513 jazz training sequences plus 1,000 pop rehearsal sequences (seed 42). Pop:jazz β 0.66:1, that is, less pop than jazz in the mix.
Evaluation (held-out per-genre test sets)
| Metric | Pop test | Jazz test |
|---|---|---|
| Top-1 accuracy | 83.02% | 81.50% |
| Top-5 accuracy | 96.93% | 92.59% |
| Perplexity | 1.81 | 2.26 |
| Ξ vs. Phase 0 baseline | β1.22 | +8.64 |
F4 is the jazz-leaning endpoint of the mix-ratio sweep. It produces the most jazz-flavoured continuations among the released checkpoints, with secondary dominants, tritone substitutions, modal interchange, and II-V chains across distant keys. The cost is roughly one point of pop top-1 accuracy. Qualitative samples (paper Β§6.4) on a minor ii-V prompt show the bebop-style harmonic motion that this checkpoint commits to more strongly than F3.
Intended use
Recommended for jazz-flavoured chord composition where the user is willing to trade some pop fluency for stronger jazz identity. F3 (ft-pop50) is the balanced alternative; F1 (ft-pop80) is the symmetric pop-leaning endpoint.
Out of scope: melody or audio generation; genres outside pop, rock, and jazz; real-time low-latency settings.
Usage
The repo bundles the project's model.py and tokenizer.py at the repo
root, so external users can load the checkpoint end-to-end without
cloning anything from GitHub. snapshot_download materializes the full
repo on disk; sys.path makes the bundled model.py / tokenizer.py
importable.
Required dependencies: torch, huggingface_hub.
import sys
import torch
from huggingface_hub import snapshot_download
# Download the full repo (model.py, tokenizer.py, best.pt, config.json).
ckpt_dir = snapshot_download(repo_id="PearlLeeStudio/TheArtist-MusicTransformer-ft-pop29")
sys.path.insert(0, ckpt_dir) # so the next two imports resolve
from model import MusicTransformer
from tokenizer import ChordTokenizer
tokenizer = ChordTokenizer()
ckpt = torch.load(f"{ckpt_dir}/best.pt", map_location="cpu", weights_only=False)
model = MusicTransformer(
vocab_size=tokenizer.vocab_size,
d_model=512, n_heads=8, d_ff=2048, n_layers=8,
max_seq_len=256, dropout=0.0, pad_id=tokenizer.pad_id,
)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
# Prompt = ii-V-I in C major; ask for a jazz-flavoured continuation.
song = {
"key": "Cmaj", "time_signature": "4/4", "genre": "jazz",
"bars": [["Dm7", "G7"], ["Cmaj7"]],
}
prompt_ids = tokenizer.encode_sequence(song)[:-1]
ids = torch.tensor([prompt_ids])
with torch.no_grad():
for _ in range(32):
logits = model(ids)
next_id = torch.multinomial(
torch.softmax(logits[:, -1, :] / 0.8, dim=-1), 1,
)
ids = torch.cat([ids, next_id], dim=-1)
if next_id.item() == tokenizer.eos_id:
break
print(tokenizer.decode(ids[0].tolist()))
For per-genre adaptation beyond pop and jazz, see the 11 LoRA adapter repos at PearlLeeStudio β they chain on top of this base.
Per-genre real-song eval (held-out 130-song set, 2026-05)
First per-genre evaluation of ft-pop29 beyond the pop/jazz split that the original paper reports.
Eval results
| Genre | n_songs | Top-1 (%) | Top-5 (%) | val_loss |
|---|---|---|---|---|
| pop | 10 | 85.57 | 95.91 | 0.5859 |
| rock | 10 | 87.28 | 97.50 | 0.4677 |
| jazz | 10 | 71.42 | 85.54 | 1.3367 |
| blues | 10 | 81.99 | 93.86 | 0.7970 |
| bossa | 10 | 82.62 | 95.73 | 0.7226 |
| classical | 10 | 49.65 | 81.51 | 2.1079 |
| country | 10 | 86.30 | 98.18 | 0.5191 |
| electronic | 10 | 86.81 | 98.48 | 0.5100 |
| folk | 10 | 84.81 | 98.63 | 0.5337 |
| funk | 10 | 83.39 | 96.13 | 0.6959 |
| gospel | 10 | 80.25 | 96.58 | 0.7343 |
| hip_hop | 10 | 90.34 | 98.58 | 0.3982 |
| rnb_soul | 10 | 85.04 | 97.04 | 0.5907 |
On this eval set F4 peaks on hip_hop (90.34%) and struggles most on classical (49.65%).
This is auxiliary signal β the 11 per-genre LoRAs (sister lora-* repos) are the recommended path for production use on the 9 non-pop, non-jazz genres. F-series cells on those genres show what the base model produces under [GENRE:none] conditioning (the model's [GENRE:X] token does not exist for the 9 new genres in the F-series vocab=351).
Eval dataset composition
130 songs total, 10 per genre Γ 13 genres. Drawn from the same splits/val.jsonl + splits/test.jsonl partitions every F-series model was held out from during training β no train-set leakage. Built by ai/training/build_eval_real_songs.py --seed 42 --per-genre 10 (deterministic).
| Genre | n | Source(s) | Bar range | Avg duration Β· named |
|---|---|---|---|---|
| pop | 10 | billboard | 58β116 | 189s Β· 10/10 named |
| rock | 10 | chordonomicon_rock | 52β87 | 127s Β· 0/10 named |
| jazz | 10 | choco:jazz-corpus, choco:real-book, jazzstandards, jht | 16β89 | 72s Β· 10/10 named |
| blues | 10 | chordonomicon_blues | 24β46 | 93s Β· 0/10 named |
| bossa | 10 | chordonomicon_bossa | 24β78 | 88s Β· 0/10 named |
| classical | 10 | chordonomicon_classical | 11β40 | 60s Β· 10/10 named |
| country | 10 | chordonomicon_country | 30β81 | 110s Β· 0/10 named |
| electronic | 10 | chordonomicon_electronic | 25β84 | 89s Β· 0/10 named |
| folk | 10 | chordonomicon_folk | 33β82 | 114s Β· 0/10 named |
| funk | 10 | chordonomicon_funk | 30β60 | 92s Β· 0/10 named |
| gospel | 10 | chordonomicon_gospel | 24β85 | 98s Β· 0/10 named |
| hip_hop | 10 | chordonomicon_hip_hop | 24β81 | 136s Β· 0/10 named |
| rnb_soul | 10 | chordonomicon_rnb_soul | 34β82 | 128s Β· 0/10 named |
Source license summary: McGill Billboard (CC0, named pop songs), Jazz Harmony Treebank / JazzStandards / WJazzD (Public / community-redistributed, named jazz standards), Bach chorales via music21 (public domain, named pieces), Chordonomicon per-genre subsets (CC BY-NC 4.0; titles are Spotify track IDs by upstream dataset policy β progressions are real songs). See docs/EVAL.md for full breakdown.
Methodology
Teacher-forced next-token cross-entropy / top-1 / top-5 over each song's token sequence (BOS + key + time_sig + genre + bars + EOS, truncated to max_seq_len=256). Same evaluate() call as ai/results/f1_per_genre_baseline.csv, just narrowed to the curated 130-song subset. Token-level metrics; not a generation-quality eval (free-generation comparison with R1 Sethares + R2 theory RAG rerank is documented separately in ai/results/eval_report.md).
Caveats:
classicalval partition is intrinsically small (37 sequences in full eval); the 10-song subset here has even narrower confidence bands. Directional finding (LoRA helps a lot on Bach harmony) is robust, exact pp deltas are noisy.- F-series numbers on the 9 LoRA-only genres are conditioned without genre tag (vocab=351 has no
[GENRE:country]token etc.). This is the realistic "F-series alone" condition, not a controlled ablation.
Source CSV: ai/results/real_song_eval.csv (17 models Γ 130 songs, long format).
Training-data licenses
| Dataset | License |
|---|---|
| Chordonomicon | Public (user-generated) |
| McGill Billboard | CC0 |
| Jazz Harmony Treebank | Public |
| JazzStandards (iReal Pro) | Community redistribution |
| Weimar Jazz Database | ODbL |
| JAAH | Research-use public |
Citation
Cite the original mix-ratio paper. The companion per-genre LoRA paper (chord-symbol time-series adaptation) is now on arXiv: arXiv:2606.07334.
@misc{lee2026chordmix,
title = {Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation},
author = {Lee, Jinju},
year = {2026},
eprint = {2605.04998},
archivePrefix = {arXiv}
}
@misc{lee2026chordtimeseries,
title = {How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity?},
author = {Lee, Jinju},
year = {2026},
eprint = {2606.07334},
archivePrefix = {arXiv}
}
- Downloads last month
- 214