tab-hero / README.md
MattGroho's picture
Add model card
07ce42c verified
metadata
language: en
tags:
  - audio
  - music
  - guitar
  - chart-generation
  - transformer
  - encoder-decoder
license: mit

Tab Hero — ChartTransformer

An encoder-decoder transformer that generates guitar/bass charts from audio. Given a mel spectrogram, the model autoregressively produces a sequence of note tokens compatible with Clone Hero (.mid + song.ini).

Model Description

Property Value
Architecture Encoder-Decoder Transformer
Parameters ~150M (Large config)
Audio input Mel spectrogram (22050 Hz, 128 mels, hop=256)
Output Note token sequence
Vocabulary size 740 tokens
Training precision bf16-mixed
Best validation loss 0.1085
Trained for 65 epochs (195,030 steps)

Architecture

The audio encoder projects mel frames through a linear layer then a Conv1D stack with 4x temporal downsampling (~46ms per frame). The decoder is a causal transformer with:

  • RoPE positional encoding (enables generation beyond training length)
  • Flash Attention 2 via scaled_dot_product_attention
  • SwiGLU feed-forward networks
  • Difficulty and instrument conditioning embeddings
  • Weight-tied input/output embeddings

Full architecture details: docs/architecture.md

Tokenization

Each note is a 4-token quad: [TIME_DELTA] [FRET_COMBINATION] [MODIFIER] [DURATION]

Range Type Count Description
0 PAD 1 Padding
1 BOS 1 Beginning of sequence
2 EOS 1 End of sequence
3–503 TIME_DELTA 501 Time since previous note (10ms bins, 0–5000ms)
504–630 FRET 127 All non-empty subsets of 7 frets
631–638 MODIFIER 8 HOPO / TAP / Star Power combinations
639–739 DURATION 101 Sustain length (50ms bins, 0–5000ms)

Conditioning

The model supports 4 difficulty levels (Easy / Medium / Hard / Expert) and 4 instrument types (lead / bass / rhythm / keys), passed as integer IDs at inference time.

Usage

import torch
from tab_hero.model.chart_transformer import ChartTransformer
from tab_hero.data.tokenizer import ChartTokenizer

tok = ChartTokenizer()

model = ChartTransformer(
    vocab_size=tok.vocab_size,
    audio_input_dim=128,
    encoder_dim=768,
    decoder_dim=768,
    n_decoder_layers=8,
    n_heads=12,
    ffn_dim=3072,
    max_seq_len=8192,
    dropout=0.1,
    audio_downsample=4,
    use_flash=True,
    use_rope=True,
)

ckpt = torch.load("best_model.pt", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# audio_mel: (1, n_frames, 128) mel spectrogram tensor
tokens = model.generate(
    audio_embeddings=audio_mel,
    difficulty_id=torch.tensor([3]),   # 0=Easy 1=Medium 2=Hard 3=Expert
    instrument_id=torch.tensor([0]),   # 0=lead 1=bass 2=rhythm 3=keys
    temperature=1.0,
    top_k=50,
    top_p=0.95,
)

See notebooks/inference_demo.ipynb for a full end-to-end example including audio loading and chart export.

Training

  • Optimizer: AdamW (lr=1e-4, weight_decay=0.01, betas=(0.9, 0.95))
  • Scheduler: Cosine annealing with 1000-step linear warmup
  • Batch size: 16 (effective 32 with gradient accumulation)
  • Gradient clipping: max norm 1.0
  • Early stopping: patience 15 epochs

Limitations

  • Trained on a specific dataset of Clone Hero charts; quality varies by genre and playing style.
  • Source separation (HTDemucs) is recommended for mixed audio but not required.
  • Mel spectrograms are lossy — the model cannot recover audio from its inputs.
  • Output requires post-processing via SongExporter to produce playable chart files.

Repository

https://github.com/MattGroho/tab-hero