MattGroho commited on
Commit
07ce42c
·
verified ·
1 Parent(s): 5ff2f25

Add model card

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - audio
5
+ - music
6
+ - guitar
7
+ - chart-generation
8
+ - transformer
9
+ - encoder-decoder
10
+ license: mit
11
+ ---
12
+
13
+ # Tab Hero — ChartTransformer
14
+
15
+ An encoder-decoder transformer that generates guitar/bass charts from audio. Given a mel spectrogram, the model autoregressively produces a sequence of note tokens compatible with Clone Hero (`.mid` + `song.ini`).
16
+
17
+ ## Model Description
18
+
19
+ | Property | Value |
20
+ |---|---|
21
+ | Architecture | Encoder-Decoder Transformer |
22
+ | Parameters | ~150M (Large config) |
23
+ | Audio input | Mel spectrogram (22050 Hz, 128 mels, hop=256) |
24
+ | Output | Note token sequence |
25
+ | Vocabulary size | 740 tokens |
26
+ | Training precision | bf16-mixed |
27
+ | Best validation loss | 0.1085 |
28
+ | Trained for | 65 epochs (195,030 steps) |
29
+
30
+ ### Architecture
31
+
32
+ The **audio encoder** projects mel frames through a linear layer then a Conv1D stack with 4x temporal downsampling (~46ms per frame). The **decoder** is a causal transformer with:
33
+ - RoPE positional encoding (enables generation beyond training length)
34
+ - Flash Attention 2 via `scaled_dot_product_attention`
35
+ - SwiGLU feed-forward networks
36
+ - Difficulty and instrument conditioning embeddings
37
+ - Weight-tied input/output embeddings
38
+
39
+ Full architecture details: [docs/architecture.md](https://github.com/MattGroho/tab-hero/blob/main/docs/architecture.md)
40
+
41
+ ### Tokenization
42
+
43
+ Each note is a 4-token quad: `[TIME_DELTA] [FRET_COMBINATION] [MODIFIER] [DURATION]`
44
+
45
+ | Range | Type | Count | Description |
46
+ |---|---|---|---|
47
+ | 0 | PAD | 1 | Padding |
48
+ | 1 | BOS | 1 | Beginning of sequence |
49
+ | 2 | EOS | 1 | End of sequence |
50
+ | 3–503 | TIME_DELTA | 501 | Time since previous note (10ms bins, 0–5000ms) |
51
+ | 504–630 | FRET | 127 | All non-empty subsets of 7 frets |
52
+ | 631–638 | MODIFIER | 8 | HOPO / TAP / Star Power combinations |
53
+ | 639–739 | DURATION | 101 | Sustain length (50ms bins, 0–5000ms) |
54
+
55
+ ### Conditioning
56
+
57
+ The model supports 4 difficulty levels (Easy / Medium / Hard / Expert) and 4 instrument types (lead / bass / rhythm / keys), passed as integer IDs at inference time.
58
+
59
+ ## Usage
60
+
61
+ ```python
62
+ import torch
63
+ from tab_hero.model.chart_transformer import ChartTransformer
64
+ from tab_hero.data.tokenizer import ChartTokenizer
65
+
66
+ tok = ChartTokenizer()
67
+
68
+ model = ChartTransformer(
69
+ vocab_size=tok.vocab_size,
70
+ audio_input_dim=128,
71
+ encoder_dim=768,
72
+ decoder_dim=768,
73
+ n_decoder_layers=8,
74
+ n_heads=12,
75
+ ffn_dim=3072,
76
+ max_seq_len=8192,
77
+ dropout=0.1,
78
+ audio_downsample=4,
79
+ use_flash=True,
80
+ use_rope=True,
81
+ )
82
+
83
+ ckpt = torch.load("best_model.pt", map_location="cpu", weights_only=False)
84
+ model.load_state_dict(ckpt["model_state_dict"])
85
+ model.eval()
86
+
87
+ # audio_mel: (1, n_frames, 128) mel spectrogram tensor
88
+ tokens = model.generate(
89
+ audio_embeddings=audio_mel,
90
+ difficulty_id=torch.tensor([3]), # 0=Easy 1=Medium 2=Hard 3=Expert
91
+ instrument_id=torch.tensor([0]), # 0=lead 1=bass 2=rhythm 3=keys
92
+ temperature=1.0,
93
+ top_k=50,
94
+ top_p=0.95,
95
+ )
96
+ ```
97
+
98
+ See [`notebooks/inference_demo.ipynb`](https://github.com/MattGroho/tab-hero/blob/main/notebooks/inference_demo.ipynb) for a full end-to-end example including audio loading and chart export.
99
+
100
+ ## Training
101
+
102
+ - **Optimizer**: AdamW (lr=1e-4, weight_decay=0.01, betas=(0.9, 0.95))
103
+ - **Scheduler**: Cosine annealing with 1000-step linear warmup
104
+ - **Batch size**: 16 (effective 32 with gradient accumulation)
105
+ - **Gradient clipping**: max norm 1.0
106
+ - **Early stopping**: patience 15 epochs
107
+
108
+ ## Limitations
109
+
110
+ - Trained on a specific dataset of Clone Hero charts; quality varies by genre and playing style.
111
+ - Source separation (HTDemucs) is recommended for mixed audio but not required.
112
+ - Mel spectrograms are lossy — the model cannot recover audio from its inputs.
113
+ - Output requires post-processing via `SongExporter` to produce playable chart files.
114
+
115
+ ## Repository
116
+
117
+ [https://github.com/MattGroho/tab-hero](https://github.com/MattGroho/tab-hero)
118
+