| | --- |
| | language: en |
| | tags: |
| | - audio |
| | - music |
| | - guitar |
| | - chart-generation |
| | - transformer |
| | - encoder-decoder |
| | license: mit |
| | --- |
| | |
| | # Tab Hero β ChartTransformer |
| |
|
| | An encoder-decoder transformer that generates guitar/bass charts from audio. Given a mel spectrogram, the model autoregressively produces a sequence of note tokens compatible with Clone Hero (`.mid` + `song.ini`). |
| |
|
| | ## Model Description |
| |
|
| | | Property | Value | |
| | |---|---| |
| | | Architecture | Encoder-Decoder Transformer | |
| | | Parameters | ~150M (Large config) | |
| | | Audio input | Mel spectrogram (22050 Hz, 128 mels, hop=256) | |
| | | Output | Note token sequence | |
| | | Vocabulary size | 740 tokens | |
| | | Training precision | bf16-mixed | |
| | | Best validation loss | 0.1085 | |
| | | Trained for | 65 epochs (195,030 steps) | |
| |
|
| | ### Architecture |
| |
|
| | The **audio encoder** projects mel frames through a linear layer then a Conv1D stack with 4x temporal downsampling (~46ms per frame). The **decoder** is a causal transformer with: |
| | - RoPE positional encoding (enables generation beyond training length) |
| | - Flash Attention 2 via `scaled_dot_product_attention` |
| | - SwiGLU feed-forward networks |
| | - Difficulty and instrument conditioning embeddings |
| | - Weight-tied input/output embeddings |
| |
|
| | Full architecture details: [docs/architecture.md](https://github.com/MattGroho/tab-hero/blob/main/docs/architecture.md) |
| |
|
| | ### Tokenization |
| |
|
| | Each note is a 4-token quad: `[TIME_DELTA] [FRET_COMBINATION] [MODIFIER] [DURATION]` |
| |
|
| | | Range | Type | Count | Description | |
| | |---|---|---|---| |
| | | 0 | PAD | 1 | Padding | |
| | | 1 | BOS | 1 | Beginning of sequence | |
| | | 2 | EOS | 1 | End of sequence | |
| | | 3β503 | TIME_DELTA | 501 | Time since previous note (10ms bins, 0β5000ms) | |
| | | 504β630 | FRET | 127 | All non-empty subsets of 7 frets | |
| | | 631β638 | MODIFIER | 8 | HOPO / TAP / Star Power combinations | |
| | | 639β739 | DURATION | 101 | Sustain length (50ms bins, 0β5000ms) | |
| | |
| | ### Conditioning |
| | |
| | The model supports 4 difficulty levels (Easy / Medium / Hard / Expert) and 4 instrument types (lead / bass / rhythm / keys), passed as integer IDs at inference time. |
| | |
| | ## Usage |
| | |
| | ```python |
| | import torch |
| | from tab_hero.model.chart_transformer import ChartTransformer |
| | from tab_hero.data.tokenizer import ChartTokenizer |
| |
|
| | tok = ChartTokenizer() |
| |
|
| | model = ChartTransformer( |
| | vocab_size=tok.vocab_size, |
| | audio_input_dim=128, |
| | encoder_dim=768, |
| | decoder_dim=768, |
| | n_decoder_layers=8, |
| | n_heads=12, |
| | ffn_dim=3072, |
| | max_seq_len=8192, |
| | dropout=0.1, |
| | audio_downsample=4, |
| | use_flash=True, |
| | use_rope=True, |
| | ) |
| | |
| | ckpt = torch.load("best_model.pt", map_location="cpu", weights_only=False) |
| | model.load_state_dict(ckpt["model_state_dict"]) |
| | model.eval() |
| | |
| | # audio_mel: (1, n_frames, 128) mel spectrogram tensor |
| | tokens = model.generate( |
| | audio_embeddings=audio_mel, |
| | difficulty_id=torch.tensor([3]), # 0=Easy 1=Medium 2=Hard 3=Expert |
| | instrument_id=torch.tensor([0]), # 0=lead 1=bass 2=rhythm 3=keys |
| | temperature=1.0, |
| | top_k=50, |
| | top_p=0.95, |
| | ) |
| | ``` |
| | |
| | See [`notebooks/inference_demo.ipynb`](https://github.com/MattGroho/tab-hero/blob/main/notebooks/inference_demo.ipynb) for a full end-to-end example including audio loading and chart export. |
| |
|
| | ## Training |
| |
|
| | - **Optimizer**: AdamW (lr=1e-4, weight_decay=0.01, betas=(0.9, 0.95)) |
| | - **Scheduler**: Cosine annealing with 1000-step linear warmup |
| | - **Batch size**: 16 (effective 32 with gradient accumulation) |
| | - **Gradient clipping**: max norm 1.0 |
| | - **Early stopping**: patience 15 epochs |
| | |
| | ## Limitations |
| | |
| | - Trained on a specific dataset of Clone Hero charts; quality varies by genre and playing style. |
| | - Source separation (HTDemucs) is recommended for mixed audio but not required. |
| | - Mel spectrograms are lossy β the model cannot recover audio from its inputs. |
| | - Output requires post-processing via `SongExporter` to produce playable chart files. |
| | |
| | ## Repository |
| | |
| | [https://github.com/MattGroho/tab-hero](https://github.com/MattGroho/tab-hero) |
| | |
| | |