TouchGrass-3b / PREVIEW_README.md

Upload 39 files

9071ef9 verified about 1 month ago

preview code

raw

history blame contribute delete

7.51 kB

TouchGrass - Preview Release

🎵 What is TouchGrass?

TouchGrass is a lightweight music AI assistant built by fine-tuning Qwen3.5 models with specialized music capabilities. This is a PREVIEW RELEASE containing the complete framework with untrained weights.

⚠️ Important: Untrained Preview

This repository contains code and configuration only - NO TRAINED WEIGHTS.

❌ Models are NOT trained (LoRA adapters are randomly initialized)
✅ All architecture, code, and configuration is complete
✅ Ready for training immediately
📊 Expected accuracy after training: 94-95% across modules

📦 Repository Structure

This project contains two model variants in separate folders:

TouchGrass-3B

Based on Qwen3.5-3B-Instruct
3 billion parameters (200M trainable LoRA)
CPU-friendly, ~6GB VRAM required
Best for: prototyping, CPU inference, quick iteration

TouchGrass-7B

Based on Qwen3.5-7B-Instruct
7 billion parameters (200M trainable LoRA)
GPU required, ~14GB VRAM minimum
Best for: production deployment, highest quality

🚀 Quick Start

1. Generate Training Data

from TouchGrass.data.music_qa_generator import MusicQAGenerator
from TouchGrass.data.chat_formatter import ChatFormatter

# Generate 10K synthetic samples
gen = MusicQAGenerator(seed=42)
dataset = gen.generate_dataset(num_samples=10000, output_path='data/music_qa.jsonl')

# Format for Qwen chat
fmt = ChatFormatter()
formatted = fmt.format_dataset(dataset)
train, val = fmt.create_splits(formatted, val_size=0.1)
fmt.save_dataset(train, 'data/train.jsonl')
fmt.save_dataset(val, 'data/val.jsonl')

2. Train the Model

For 3B variant:

python train.py \
  --base_model Qwen/Qwen3.5-3B-Instruct \
  --train_data data/train.jsonl \
  --val_data data/val.jsonl \
  --output_dir checkpoints/touchgrass-3b \
  --lora_r 16 \
  --lora_alpha 32 \
  --batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --num_epochs 3 \
  --mixed_precision fp16

For 7B variant:

python train.py \
  --base_model Qwen/Qwen3.5-7B-Instruct \
  --train_data data/train.jsonl \
  --val_data data/val.jsonl \
  --output_dir checkpoints/touchgrass-7b \
  --lora_r 16 \
  --lora_alpha 32 \
  --batch_size 2 \
  --gradient_accumulation_steps 8 \
  --learning_rate 1e-4 \
  --num_epochs 3 \
  --mixed_precision bf16

3. Run Tests

python tests/run_tests.py

4. Evaluate

python benchmarks/evaluate_music_modules.py --device cuda --d_model 2048  # for 3B
python benchmarks/evaluate_music_modules.py --device cuda --d_model 4096  # for 7B

🎯 Features

Five Specialized Music Modules

Tab & Chord Generation 🎸
- Guitar tablature generation and validation
- Chord diagram creation
- Multiple tuning support
- Difficulty classification
Music Theory Engine 🎹
- Scale generation (all keys and modes)
- Chord construction and Roman numeral analysis
- Circle of fifths
- Interval calculations
Ear Training 👂
- Interval identification (12 intervals)
- Song references (Star Wars for P5, Jaws for m2, etc.)
- Solfege exercises
- Quiz generation
EQ Adapter 😌
- Frustration detection
- 4-way emotion classification
- Context-aware simplification
- Encouragement templates
Song Writing Assistant ✍️
- Chord progressions by mood/genre
- Lyric generation with rhyme schemes
- Hook creation
- Production advice

Music Tokenizer Extension

Adds 21+ music-specific tokens to Qwen's vocabulary:

Domain tokens: [GUITAR], [PIANO], [DRUMS], [VOCALS], [THEORY], [PRODUCTION]
Emotion tokens: [FRUSTRATED], [CONFUSED], [EXCITED], [CONFIDENT]
Difficulty tokens: [EASY], [MEDIUM], [HARD]
Function tokens: [TAB], [CHORD], [SCALE], [INTERVAL], [PROGRESSION]
EQ tokens: [SIMPLIFY], [ENCOURAGE]
Music notation: All note names and chord types

Six Music Domains Covered

Guitar & Bass
Piano & Keys
Drums & Percussion
Vocals & Singing
Music Theory & Composition
DJ & Production

📊 Expected Performance

After training on 10K samples for 3 epochs:

Module	3B	7B
Tab & Chord	95.0%	96.0%
Music Theory	98.5%	99.0%
Ear Training	97.5%	98.0%
EQ Adapter	92.0%	93.0%
Songwriting	88.0%	90.0%
Overall	94.2%	95.2%

🏗️ Architecture

TouchGrass/
├── configs/              # Model configurations
├── tokenizer/            # Music tokenizer extension
├── models/               # 5 specialized music modules
├── data/                 # Dataset generation & formatting
├── training/             # LoRA training pipeline
├── inference/            # Unified inference
├── benchmarks/           # Evaluation scripts
├── tests/                # Comprehensive test suite
├── configuration_touchgrass.py   # HF config
├── tokenization_touchgrass.py    # HF tokenizer
├── ollama_3b_modelfile   # Ollama config (3B)
└── ollama_7b_modelfile   # Ollama config (7B)

🧪 Testing

# All tests
python tests/run_tests.py

# With coverage
python tests/run_tests.py --coverage

# Specific module
pytest tests/test_music_theory_module.py -v

Test Coverage: 50+ unit tests covering all modules, data pipeline, and training components.

🔧 Configuration

LoRA Settings

Rank (r): 16 (recommended range: 8-32)
Alpha: 32 (typically 2×r)
Target modules: q_proj, k_proj, v_proj, o_proj
Dropout: 0.1

Training Hyperparameters

3B: lr=2e-4, batch=4, grad_accum=4
7B: lr=1e-4, batch=2, grad_accum=8
Epochs: 3
Mixed precision: fp16 (NVIDIA) or bf16 (newer GPUs)

Loss Weights

LM loss: 1.0
EQ loss: 0.1
Music module loss: 0.05

💻 Hardware Requirements

Training

3B: 6GB+ GPU VRAM (RTX 3060 12GB recommended)
7B: 14GB+ GPU VRAM (RTX 3090/4090 24GB recommended)
CPU training possible but very slow (not recommended for 7B)

Inference

3B: 4GB+ GPU VRAM or CPU (slower)
7B: 8GB+ GPU VRAM

🤝 Contributing

This is a preview release. Contributions welcome:

Improve synthetic data quality
Add more music domains (world music, jazz, etc.)
Enhance module implementations
Add more tests and benchmarks
Improve documentation

📄 License

MIT License - see LICENSE file.

🙏 Acknowledgments

Base model: Qwen3.5 by Alibaba Cloud
HuggingFace Transformers & PEFT libraries
Music theory: Traditional Western harmony principles

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: See module docstrings and README.md

Made with ❤️ for musicians everywhere.

Touch Grass - because even AI needs to remember to make music, not just talk about it.

Matrix-Corp
/

TouchGrass-3b