File size: 3,905 Bytes
07ce42c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
language: en
tags:
  - audio
  - music
  - guitar
  - chart-generation
  - transformer
  - encoder-decoder
license: mit
---

# Tab Hero — ChartTransformer

An encoder-decoder transformer that generates guitar/bass charts from audio. Given a mel spectrogram, the model autoregressively produces a sequence of note tokens compatible with Clone Hero (`.mid` + `song.ini`).

## Model Description

| Property | Value |
|---|---|
| Architecture | Encoder-Decoder Transformer |
| Parameters | ~150M (Large config) |
| Audio input | Mel spectrogram (22050 Hz, 128 mels, hop=256) |
| Output | Note token sequence |
| Vocabulary size | 740 tokens |
| Training precision | bf16-mixed |
| Best validation loss | 0.1085 |
| Trained for | 65 epochs (195,030 steps) |

### Architecture

The **audio encoder** projects mel frames through a linear layer then a Conv1D stack with 4x temporal downsampling (~46ms per frame). The **decoder** is a causal transformer with:
- RoPE positional encoding (enables generation beyond training length)
- Flash Attention 2 via `scaled_dot_product_attention`
- SwiGLU feed-forward networks
- Difficulty and instrument conditioning embeddings
- Weight-tied input/output embeddings

Full architecture details: [docs/architecture.md](https://github.com/MattGroho/tab-hero/blob/main/docs/architecture.md)

### Tokenization

Each note is a 4-token quad: `[TIME_DELTA] [FRET_COMBINATION] [MODIFIER] [DURATION]`

| Range | Type | Count | Description |
|---|---|---|---|
| 0 | PAD | 1 | Padding |
| 1 | BOS | 1 | Beginning of sequence |
| 2 | EOS | 1 | End of sequence |
| 3–503 | TIME_DELTA | 501 | Time since previous note (10ms bins, 0–5000ms) |
| 504–630 | FRET | 127 | All non-empty subsets of 7 frets |
| 631–638 | MODIFIER | 8 | HOPO / TAP / Star Power combinations |
| 639–739 | DURATION | 101 | Sustain length (50ms bins, 0–5000ms) |

### Conditioning

The model supports 4 difficulty levels (Easy / Medium / Hard / Expert) and 4 instrument types (lead / bass / rhythm / keys), passed as integer IDs at inference time.

## Usage

```python
import torch
from tab_hero.model.chart_transformer import ChartTransformer
from tab_hero.data.tokenizer import ChartTokenizer

tok = ChartTokenizer()

model = ChartTransformer(
    vocab_size=tok.vocab_size,
    audio_input_dim=128,
    encoder_dim=768,
    decoder_dim=768,
    n_decoder_layers=8,
    n_heads=12,
    ffn_dim=3072,
    max_seq_len=8192,
    dropout=0.1,
    audio_downsample=4,
    use_flash=True,
    use_rope=True,
)

ckpt = torch.load("best_model.pt", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# audio_mel: (1, n_frames, 128) mel spectrogram tensor
tokens = model.generate(
    audio_embeddings=audio_mel,
    difficulty_id=torch.tensor([3]),   # 0=Easy 1=Medium 2=Hard 3=Expert
    instrument_id=torch.tensor([0]),   # 0=lead 1=bass 2=rhythm 3=keys
    temperature=1.0,
    top_k=50,
    top_p=0.95,
)
```

See [`notebooks/inference_demo.ipynb`](https://github.com/MattGroho/tab-hero/blob/main/notebooks/inference_demo.ipynb) for a full end-to-end example including audio loading and chart export.

## Training

- **Optimizer**: AdamW (lr=1e-4, weight_decay=0.01, betas=(0.9, 0.95))
- **Scheduler**: Cosine annealing with 1000-step linear warmup
- **Batch size**: 16 (effective 32 with gradient accumulation)
- **Gradient clipping**: max norm 1.0
- **Early stopping**: patience 15 epochs

## Limitations

- Trained on a specific dataset of Clone Hero charts; quality varies by genre and playing style.
- Source separation (HTDemucs) is recommended for mixed audio but not required.
- Mel spectrograms are lossy — the model cannot recover audio from its inputs.
- Output requires post-processing via `SongExporter` to produce playable chart files.

## Repository

[https://github.com/MattGroho/tab-hero](https://github.com/MattGroho/tab-hero)