MattGroho
/

tab-hero

+---
+language: en
+tags:
+  - audio
+  - music
+  - guitar
+  - chart-generation
+  - transformer
+  - encoder-decoder
+license: mit
+---
+# Tab Hero — ChartTransformer
+An encoder-decoder transformer that generates guitar/bass charts from audio. Given a mel spectrogram, the model autoregressively produces a sequence of note tokens compatible with Clone Hero (`.mid` + `song.ini`).
+## Model Description
+| Property | Value |
+|---|---|
+| Architecture | Encoder-Decoder Transformer |
+| Parameters | ~150M (Large config) |
+| Audio input | Mel spectrogram (22050 Hz, 128 mels, hop=256) |
+| Output | Note token sequence |
+| Vocabulary size | 740 tokens |
+| Training precision | bf16-mixed |
+| Best validation loss | 0.1085 |
+| Trained for | 65 epochs (195,030 steps) |
+### Architecture
+The **audio encoder** projects mel frames through a linear layer then a Conv1D stack with 4x temporal downsampling (~46ms per frame). The **decoder** is a causal transformer with:
+- RoPE positional encoding (enables generation beyond training length)
+- Flash Attention 2 via `scaled_dot_product_attention`
+- SwiGLU feed-forward networks
+- Difficulty and instrument conditioning embeddings
+- Weight-tied input/output embeddings
+Full architecture details: [docs/architecture.md](https://github.com/MattGroho/tab-hero/blob/main/docs/architecture.md)
+### Tokenization
+Each note is a 4-token quad: `[TIME_DELTA] [FRET_COMBINATION] [MODIFIER] [DURATION]`
+| Range | Type | Count | Description |
+|---|---|---|---|
+| 0 | PAD | 1 | Padding |
+| 1 | BOS | 1 | Beginning of sequence |
+| 2 | EOS | 1 | End of sequence |
+| 3–503 | TIME_DELTA | 501 | Time since previous note (10ms bins, 0–5000ms) |
+| 504–630 | FRET | 127 | All non-empty subsets of 7 frets |
+| 631–638 | MODIFIER | 8 | HOPO / TAP / Star Power combinations |
+| 639–739 | DURATION | 101 | Sustain length (50ms bins, 0–5000ms) |
+### Conditioning
+The model supports 4 difficulty levels (Easy / Medium / Hard / Expert) and 4 instrument types (lead / bass / rhythm / keys), passed as integer IDs at inference time.
+## Usage
+```python
+import torch
+from tab_hero.model.chart_transformer import ChartTransformer
+from tab_hero.data.tokenizer import ChartTokenizer
+tok = ChartTokenizer()
+model = ChartTransformer(
+    vocab_size=tok.vocab_size,
+    audio_input_dim=128,
+    encoder_dim=768,
+    decoder_dim=768,
+    n_decoder_layers=8,
+    n_heads=12,
+    ffn_dim=3072,
+    max_seq_len=8192,
+    dropout=0.1,
+    audio_downsample=4,
+    use_flash=True,
+    use_rope=True,
+)
+ckpt = torch.load("best_model.pt", map_location="cpu", weights_only=False)
+model.load_state_dict(ckpt["model_state_dict"])
+model.eval()
+# audio_mel: (1, n_frames, 128) mel spectrogram tensor
+tokens = model.generate(
+    audio_embeddings=audio_mel,
+    difficulty_id=torch.tensor([3]),   # 0=Easy 1=Medium 2=Hard 3=Expert
+    instrument_id=torch.tensor([0]),   # 0=lead 1=bass 2=rhythm 3=keys
+    temperature=1.0,
+    top_k=50,
+    top_p=0.95,
+)
+```
+See [`notebooks/inference_demo.ipynb`](https://github.com/MattGroho/tab-hero/blob/main/notebooks/inference_demo.ipynb) for a full end-to-end example including audio loading and chart export.
+## Training
+- **Optimizer**: AdamW (lr=1e-4, weight_decay=0.01, betas=(0.9, 0.95))
+- **Scheduler**: Cosine annealing with 1000-step linear warmup
+- **Batch size**: 16 (effective 32 with gradient accumulation)
+- **Gradient clipping**: max norm 1.0
+- **Early stopping**: patience 15 epochs
+## Limitations
+- Trained on a specific dataset of Clone Hero charts; quality varies by genre and playing style.
+- Source separation (HTDemucs) is recommended for mixed audio but not required.
+- Mel spectrograms are lossy — the model cannot recover audio from its inputs.
+- Output requires post-processing via `SongExporter` to produce playable chart files.
+## Repository
+[https://github.com/MattGroho/tab-hero](https://github.com/MattGroho/tab-hero)