tab-hero / README.md

Add model card

07ce42c verified 4 days ago

3.91 kB

	---
	language: en
	tags:
	- audio
	- music
	- guitar
	- chart-generation
	- transformer
	- encoder-decoder
	license: mit
	---

	# Tab Hero — ChartTransformer

	An encoder-decoder transformer that generates guitar/bass charts from audio. Given a mel spectrogram, the model autoregressively produces a sequence of note tokens compatible with Clone Hero (`.mid` + `song.ini`).

	## Model Description

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| Encoder-Decoder Transformer \|
	\| Parameters \| ~150M (Large config) \|
	\| Audio input \| Mel spectrogram (22050 Hz, 128 mels, hop=256) \|
	\| Output \| Note token sequence \|
	\| Vocabulary size \| 740 tokens \|
	\| Training precision \| bf16-mixed \|
	\| Best validation loss \| 0.1085 \|
	\| Trained for \| 65 epochs (195,030 steps) \|

	### Architecture

	The audio encoder projects mel frames through a linear layer then a Conv1D stack with 4x temporal downsampling (~46ms per frame). The decoder is a causal transformer with:
	- RoPE positional encoding (enables generation beyond training length)
	- Flash Attention 2 via `scaled_dot_product_attention`
	- SwiGLU feed-forward networks
	- Difficulty and instrument conditioning embeddings
	- Weight-tied input/output embeddings

	Full architecture details: [docs/architecture.md](https://github.com/MattGroho/tab-hero/blob/main/docs/architecture.md)

	### Tokenization

	Each note is a 4-token quad: `[TIME_DELTA] [FRET_COMBINATION] [MODIFIER] [DURATION]`

	\| Range \| Type \| Count \| Description \|
	\|---\|---\|---\|---\|
	\| 0 \| PAD \| 1 \| Padding \|
	\| 1 \| BOS \| 1 \| Beginning of sequence \|
	\| 2 \| EOS \| 1 \| End of sequence \|
	\| 3–503 \| TIME_DELTA \| 501 \| Time since previous note (10ms bins, 0–5000ms) \|
	\| 504–630 \| FRET \| 127 \| All non-empty subsets of 7 frets \|
	\| 631–638 \| MODIFIER \| 8 \| HOPO / TAP / Star Power combinations \|
	\| 639–739 \| DURATION \| 101 \| Sustain length (50ms bins, 0–5000ms) \|

	### Conditioning

	The model supports 4 difficulty levels (Easy / Medium / Hard / Expert) and 4 instrument types (lead / bass / rhythm / keys), passed as integer IDs at inference time.

	## Usage

	```python
	import torch
	from tab_hero.model.chart_transformer import ChartTransformer
	from tab_hero.data.tokenizer import ChartTokenizer

	tok = ChartTokenizer()

	model = ChartTransformer(
	vocab_size=tok.vocab_size,
	audio_input_dim=128,
	encoder_dim=768,
	decoder_dim=768,
	n_decoder_layers=8,
	n_heads=12,
	ffn_dim=3072,
	max_seq_len=8192,
	dropout=0.1,
	audio_downsample=4,
	use_flash=True,
	use_rope=True,
	)

	ckpt = torch.load("best_model.pt", map_location="cpu", weights_only=False)
	model.load_state_dict(ckpt["model_state_dict"])
	model.eval()

	# audio_mel: (1, n_frames, 128) mel spectrogram tensor
	tokens = model.generate(
	audio_embeddings=audio_mel,
	difficulty_id=torch.tensor([3]), # 0=Easy 1=Medium 2=Hard 3=Expert
	instrument_id=torch.tensor([0]), # 0=lead 1=bass 2=rhythm 3=keys
	temperature=1.0,
	top_k=50,
	top_p=0.95,
	)
	```

	See [`notebooks/inference_demo.ipynb`](https://github.com/MattGroho/tab-hero/blob/main/notebooks/inference_demo.ipynb) for a full end-to-end example including audio loading and chart export.

	## Training

	- Optimizer: AdamW (lr=1e-4, weight_decay=0.01, betas=(0.9, 0.95))
	- Scheduler: Cosine annealing with 1000-step linear warmup
	- Batch size: 16 (effective 32 with gradient accumulation)
	- Gradient clipping: max norm 1.0
	- Early stopping: patience 15 epochs

	## Limitations

	- Trained on a specific dataset of Clone Hero charts; quality varies by genre and playing style.
	- Source separation (HTDemucs) is recommended for mixed audio but not required.
	- Mel spectrograms are lossy — the model cannot recover audio from its inputs.
	- Output requires post-processing via `SongExporter` to produce playable chart files.

	## Repository

	[https://github.com/MattGroho/tab-hero](https://github.com/MattGroho/tab-hero)