File size: 5,892 Bytes
b1ba411 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | # π₯ Drum Sample Extractor
Extract individual drum samples (kick, snare, hi-hat, etc.) from any audio file. The pipeline isolates drums, detects individual hits, separates overlapping sounds, clusters identical samples, and exports the best representative from each group.
## Pipeline Architecture
```
song.mp3
β
βΌ [1] HTDemucs v4 (fine-tuned) βββ stem separation
drums.wav
β
βΌ [2] Multi-band onset detection βββ librosa (backtracking, 3-band)
hits/hit_001.wav, hit_002.wav, ...
β
βΌ [3] Spectral band decomposition βββ separate overlapping kick+snare+hihat
hits_separated/{kick/, snare/, hihat/}
β
βΌ [4] Feature embeddings + clustering βββ librosa (58-dim) or CLAP (512-dim)
β Auto-K via silhouette score
β
βΌ [5] Best representative selection βββ 60% centroid-proximity + 40% energy
β
βΌ [6] Optional: weighted synthesis βββ peak-aligned averaging across cluster
β
βΌ EXPORT
samples/kick_0__best.wav # best real sample per cluster
synthesized/kick_0__synthesized.wav # synthetic "ideal" version
manifest.json # metadata for all clusters
```
## Quick Start
```bash
pip install demucs librosa soundfile scikit-learn numpy torch transformers
python drum_extractor.py song.mp3 -o ./my_samples
```
## Usage
```bash
# Basic - extract from any audio file
python drum_extractor.py song.mp3 -o ./samples
# CPU-only (no GPU required for any stage)
python drum_extractor.py song.wav -o ./samples --no-gpu
# Use CLAP embeddings for semantic clustering (slower but more accurate)
python drum_extractor.py song.wav -o ./samples --clap
# Skip overlap separation (faster, but simultaneous hits stay merged)
python drum_extractor.py song.wav -o ./samples --no-separate
# Skip synthesis (only export real samples, no averaging)
python drum_extractor.py song.wav -o ./samples --no-synthesize
# Tune detection sensitivity
python drum_extractor.py song.wav -o ./samples \
--min-hit-dur 0.05 \
--max-hit-dur 1.0 \
--energy-threshold -35
```
## Output Structure
```
output_dir/
βββ drums_stem.wav # Isolated drum track from Demucs
βββ all_hits/ # Every detected hit (intermediate)
β βββ hit_0000_kick_0.500s.wav
β βββ hit_0001_snare_1.000s.wav
β βββ ...
βββ samples/ # Best representative per cluster
β βββ kick_0__best.wav
β βββ snare_0__best.wav
β βββ hihat_closed_0__best.wav
β βββ ...
βββ synthesized/ # Synthesized "ideal" samples
β βββ kick_0__synthesized.wav
β βββ ...
βββ manifest.json # Full metadata
```
## How Each Stage Works
### Stage 1: Drum Stem Extraction
Uses [HTDemucs v4 fine-tuned](https://github.com/facebookresearch/demucs) (`htdemucs_ft`) β the current SOTA for music source separation at **8.4 dB SDR** on drums (MUSDB18-HQ). Falls back to `htdemucs` if the fine-tuned variant is unavailable.
### Stage 2: Onset Detection
Multi-band onset detection using librosa:
- **Low band** (20β250 Hz): catches kicks
- **Mid band** (250β4000 Hz): catches snares and toms
- **High band** (4000+ Hz): catches cymbals and hi-hats
Each band is normalized independently, then combined via element-wise max. Backtracking snaps onsets to the true attack start.
### Stage 3: Spectral Classification & Overlap Separation
Each hit is classified by a spectral decision tree:
- **Kick**: >50% low-band energy, centroid < 800 Hz
- **Snare**: >40% mid-band energy, high ZCR, centroid > 1000 Hz
- **Hi-hat (closed/open)**: >35% high-band energy, centroid > 4000 Hz
- **Cymbal/Tom/Percussion**: remaining combinations
When two bands both carry >15% of peak energy, the hit is split into separate sub-hits (one per band).
### Stage 4: Embedding & Clustering
**Default (librosa, 58-dim)**: MFCCs (mean+std), spectral centroid/bandwidth/rolloff/contrast/flatness, ZCR, RMS, onset envelope shape, duration. Z-score normalized.
**Optional (CLAP, 512-dim)**: `laion/larger_clap_general` β semantic audio embeddings via Contrastive Language-Audio Pretraining. Better at distinguishing subtly different drum types but slower.
Clustering is hierarchical: first group by rough spectral label, then sub-cluster within each group using KMeans with auto-K selection via silhouette score.
### Stage 5: Best Representative Selection
Each cluster's "best" hit is selected by a weighted score:
- **60% representativeness**: closest to cluster centroid in MFCC space
- **40% energy**: higher RMS = cleaner transient with less bleed
### Stage 6: Synthesis (Optional)
Creates an "ideal" sample by peak-aligned weighted averaging:
1. Align all cluster members to their peak transient
2. Normalize amplitudes
3. Weighted average (best hit gets 2Γ weight)
4. This reduces random noise/bleed while preserving the shared transient character
## Upgrade Paths
For higher quality on specific stages, these drop-in replacements are available:
| Stage | Current | Upgrade | Benefit |
|-------|---------|---------|---------|
| 3 (overlap separation) | Spectral bands | [AudioSep](https://huggingface.co/spaces/Audio-AGI/AudioSep) | Text-queried separation ("kick drum"), 10.5 dB SDRi |
| 3 (overlap separation) | Spectral bands | [SAM Audio](https://huggingface.co/facebook/sam-audio-large) | Diffusion-based + temporal span prompts (gated, Meta license) |
| 4 (clustering) | librosa features | CLAP embeddings (`--clap`) | Semantic similarity, better cross-genre generalization |
| 2 (onset detection) | librosa | [madmom](https://github.com/CPJKU/madmom) RNNOnsetProcessor | 0.89 F1 on ENST-drums (needs Python β€3.10) |
## Requirements
```
demucs>=4.0
librosa>=0.10
soundfile
scikit-learn
numpy
torch
transformers # only needed with --clap
```
## License
MIT
|