π₯ Drum Sample Extractor
Extract individual drum samples (kick, snare, hi-hat, etc.) from any audio file. The pipeline isolates drums, detects individual hits, separates overlapping sounds, clusters identical samples, and exports the best representative from each group.
Pipeline Architecture
song.mp3
β
βΌ [1] HTDemucs v4 (fine-tuned) βββ stem separation
drums.wav
β
βΌ [2] Multi-band onset detection βββ librosa (backtracking, 3-band)
hits/hit_001.wav, hit_002.wav, ...
β
βΌ [3] Spectral band decomposition βββ separate overlapping kick+snare+hihat
hits_separated/{kick/, snare/, hihat/}
β
βΌ [4] Feature embeddings + clustering βββ librosa (58-dim) or CLAP (512-dim)
β Auto-K via silhouette score
β
βΌ [5] Best representative selection βββ 60% centroid-proximity + 40% energy
β
βΌ [6] Optional: weighted synthesis βββ peak-aligned averaging across cluster
β
βΌ EXPORT
samples/kick_0__best.wav # best real sample per cluster
synthesized/kick_0__synthesized.wav # synthetic "ideal" version
manifest.json # metadata for all clusters
Quick Start
pip install demucs librosa soundfile scikit-learn numpy torch transformers
python drum_extractor.py song.mp3 -o ./my_samples
Usage
# Basic - extract from any audio file
python drum_extractor.py song.mp3 -o ./samples
# CPU-only (no GPU required for any stage)
python drum_extractor.py song.wav -o ./samples --no-gpu
# Use CLAP embeddings for semantic clustering (slower but more accurate)
python drum_extractor.py song.wav -o ./samples --clap
# Skip overlap separation (faster, but simultaneous hits stay merged)
python drum_extractor.py song.wav -o ./samples --no-separate
# Skip synthesis (only export real samples, no averaging)
python drum_extractor.py song.wav -o ./samples --no-synthesize
# Tune detection sensitivity
python drum_extractor.py song.wav -o ./samples \
--min-hit-dur 0.05 \
--max-hit-dur 1.0 \
--energy-threshold -35
Output Structure
output_dir/
βββ drums_stem.wav # Isolated drum track from Demucs
βββ all_hits/ # Every detected hit (intermediate)
β βββ hit_0000_kick_0.500s.wav
β βββ hit_0001_snare_1.000s.wav
β βββ ...
βββ samples/ # Best representative per cluster
β βββ kick_0__best.wav
β βββ snare_0__best.wav
β βββ hihat_closed_0__best.wav
β βββ ...
βββ synthesized/ # Synthesized "ideal" samples
β βββ kick_0__synthesized.wav
β βββ ...
βββ manifest.json # Full metadata
How Each Stage Works
Stage 1: Drum Stem Extraction
Uses HTDemucs v4 fine-tuned (htdemucs_ft) β the current SOTA for music source separation at 8.4 dB SDR on drums (MUSDB18-HQ). Falls back to htdemucs if the fine-tuned variant is unavailable.
Stage 2: Onset Detection
Multi-band onset detection using librosa:
- Low band (20β250 Hz): catches kicks
- Mid band (250β4000 Hz): catches snares and toms
- High band (4000+ Hz): catches cymbals and hi-hats
Each band is normalized independently, then combined via element-wise max. Backtracking snaps onsets to the true attack start.
Stage 3: Spectral Classification & Overlap Separation
Each hit is classified by a spectral decision tree:
- Kick: >50% low-band energy, centroid < 800 Hz
- Snare: >40% mid-band energy, high ZCR, centroid > 1000 Hz
- Hi-hat (closed/open): >35% high-band energy, centroid > 4000 Hz
- Cymbal/Tom/Percussion: remaining combinations
When two bands both carry >15% of peak energy, the hit is split into separate sub-hits (one per band).
Stage 4: Embedding & Clustering
Default (librosa, 58-dim): MFCCs (mean+std), spectral centroid/bandwidth/rolloff/contrast/flatness, ZCR, RMS, onset envelope shape, duration. Z-score normalized.
Optional (CLAP, 512-dim): laion/larger_clap_general β semantic audio embeddings via Contrastive Language-Audio Pretraining. Better at distinguishing subtly different drum types but slower.
Clustering is hierarchical: first group by rough spectral label, then sub-cluster within each group using KMeans with auto-K selection via silhouette score.
Stage 5: Best Representative Selection
Each cluster's "best" hit is selected by a weighted score:
- 60% representativeness: closest to cluster centroid in MFCC space
- 40% energy: higher RMS = cleaner transient with less bleed
Stage 6: Synthesis (Optional)
Creates an "ideal" sample by peak-aligned weighted averaging:
- Align all cluster members to their peak transient
- Normalize amplitudes
- Weighted average (best hit gets 2Γ weight)
- This reduces random noise/bleed while preserving the shared transient character
Upgrade Paths
For higher quality on specific stages, these drop-in replacements are available:
| Stage | Current | Upgrade | Benefit |
|---|---|---|---|
| 3 (overlap separation) | Spectral bands | AudioSep | Text-queried separation ("kick drum"), 10.5 dB SDRi |
| 3 (overlap separation) | Spectral bands | SAM Audio | Diffusion-based + temporal span prompts (gated, Meta license) |
| 4 (clustering) | librosa features | CLAP embeddings (--clap) |
Semantic similarity, better cross-genre generalization |
| 2 (onset detection) | librosa | madmom RNNOnsetProcessor | 0.89 F1 on ENST-drums (needs Python β€3.10) |
Requirements
demucs>=4.0
librosa>=0.10
soundfile
scikit-learn
numpy
torch
transformers # only needed with --clap
License
MIT