rikhoffbauer2
/

drum-sample-extractor

Model card Files Files and versions

xet

Community

rikhoffbauer2 commited on 27 days ago

Commit

b1ba411

verified ·

1 Parent(s): 3806379

Add README with architecture docs and upgrade paths

Browse files

Files changed (1) hide show

README.md +151 -0

README.md ADDED Viewed

	@@ -0,0 +1,151 @@

+# 🥁 Drum Sample Extractor
+Extract individual drum samples (kick, snare, hi-hat, etc.) from any audio file. The pipeline isolates drums, detects individual hits, separates overlapping sounds, clusters identical samples, and exports the best representative from each group.
+## Pipeline Architecture
+```
+song.mp3
+  │
+  ▼ [1] HTDemucs v4 (fine-tuned) ─── stem separation
+drums.wav
+  │
+  ▼ [2] Multi-band onset detection ─── librosa (backtracking, 3-band)
+hits/hit_001.wav, hit_002.wav, ...
+  │
+  ▼ [3] Spectral band decomposition ─── separate overlapping kick+snare+hihat
+hits_separated/{kick/, snare/, hihat/}
+  │
+  ▼ [4] Feature embeddings + clustering ─── librosa (58-dim) or CLAP (512-dim)
+  │     Auto-K via silhouette score
+  │
+  ▼ [5] Best representative selection ─── 60% centroid-proximity + 40% energy
+  │
+  ▼ [6] Optional: weighted synthesis ─── peak-aligned averaging across cluster
+  │
+  ▼ EXPORT
+samples/kick_0__best.wav          # best real sample per cluster
+synthesized/kick_0__synthesized.wav  # synthetic "ideal" version
+manifest.json                     # metadata for all clusters
+```
+## Quick Start
+```bash
+pip install demucs librosa soundfile scikit-learn numpy torch transformers
+python drum_extractor.py song.mp3 -o ./my_samples
+```
+## Usage
+```bash
+# Basic - extract from any audio file
+python drum_extractor.py song.mp3 -o ./samples
+# CPU-only (no GPU required for any stage)
+python drum_extractor.py song.wav -o ./samples --no-gpu
+# Use CLAP embeddings for semantic clustering (slower but more accurate)
+python drum_extractor.py song.wav -o ./samples --clap
+# Skip overlap separation (faster, but simultaneous hits stay merged)
+python drum_extractor.py song.wav -o ./samples --no-separate
+# Skip synthesis (only export real samples, no averaging)
+python drum_extractor.py song.wav -o ./samples --no-synthesize
+# Tune detection sensitivity
+python drum_extractor.py song.wav -o ./samples \
+    --min-hit-dur 0.05 \
+    --max-hit-dur 1.0 \
+    --energy-threshold -35
+```
+## Output Structure
+```
+output_dir/
+├── drums_stem.wav              # Isolated drum track from Demucs
+├── all_hits/                   # Every detected hit (intermediate)
+│   ├── hit_0000_kick_0.500s.wav
+│   ├── hit_0001_snare_1.000s.wav
+│   └── ...
+├── samples/                    # Best representative per cluster
+│   ├── kick_0__best.wav
+│   ├── snare_0__best.wav
+│   ├── hihat_closed_0__best.wav
+│   └── ...
+├── synthesized/                # Synthesized "ideal" samples
+│   ├── kick_0__synthesized.wav
+│   └── ...
+└── manifest.json               # Full metadata
+```
+## How Each Stage Works
+### Stage 1: Drum Stem Extraction
+Uses [HTDemucs v4 fine-tuned](https://github.com/facebookresearch/demucs) (`htdemucs_ft`) — the current SOTA for music source separation at **8.4 dB SDR** on drums (MUSDB18-HQ). Falls back to `htdemucs` if the fine-tuned variant is unavailable.
+### Stage 2: Onset Detection
+Multi-band onset detection using librosa:
+- **Low band** (20–250 Hz): catches kicks
+- **Mid band** (250–4000 Hz): catches snares and toms
+- **High band** (4000+ Hz): catches cymbals and hi-hats
+Each band is normalized independently, then combined via element-wise max. Backtracking snaps onsets to the true attack start.
+### Stage 3: Spectral Classification & Overlap Separation
+Each hit is classified by a spectral decision tree:
+- **Kick**: >50% low-band energy, centroid < 800 Hz
+- **Snare**: >40% mid-band energy, high ZCR, centroid > 1000 Hz
+- **Hi-hat (closed/open)**: >35% high-band energy, centroid > 4000 Hz
+- **Cymbal/Tom/Percussion**: remaining combinations
+When two bands both carry >15% of peak energy, the hit is split into separate sub-hits (one per band).
+### Stage 4: Embedding & Clustering
+**Default (librosa, 58-dim)**: MFCCs (mean+std), spectral centroid/bandwidth/rolloff/contrast/flatness, ZCR, RMS, onset envelope shape, duration. Z-score normalized.
+**Optional (CLAP, 512-dim)**: `laion/larger_clap_general` — semantic audio embeddings via Contrastive Language-Audio Pretraining. Better at distinguishing subtly different drum types but slower.
+Clustering is hierarchical: first group by rough spectral label, then sub-cluster within each group using KMeans with auto-K selection via silhouette score.
+### Stage 5: Best Representative Selection
+Each cluster's "best" hit is selected by a weighted score:
+- **60% representativeness**: closest to cluster centroid in MFCC space
+- **40% energy**: higher RMS = cleaner transient with less bleed
+### Stage 6: Synthesis (Optional)
+Creates an "ideal" sample by peak-aligned weighted averaging:
+1. Align all cluster members to their peak transient
+2. Normalize amplitudes
+3. Weighted average (best hit gets 2× weight)
+4. This reduces random noise/bleed while preserving the shared transient character
+## Upgrade Paths
+For higher quality on specific stages, these drop-in replacements are available:
+| Stage | Current | Upgrade | Benefit |
+|-------|---------|---------|---------|
+| 3 (overlap separation) | Spectral bands | [AudioSep](https://huggingface.co/spaces/Audio-AGI/AudioSep) | Text-queried separation ("kick drum"), 10.5 dB SDRi |
+| 3 (overlap separation) | Spectral bands | [SAM Audio](https://huggingface.co/facebook/sam-audio-large) | Diffusion-based + temporal span prompts (gated, Meta license) |
+| 4 (clustering) | librosa features | CLAP embeddings (`--clap`) | Semantic similarity, better cross-genre generalization |
+| 2 (onset detection) | librosa | [madmom](https://github.com/CPJKU/madmom) RNNOnsetProcessor | 0.89 F1 on ENST-drums (needs Python ≤3.10) |
+## Requirements
+```
+demucs>=4.0
+librosa>=0.10
+soundfile
+scikit-learn
+numpy
+torch
+transformers  # only needed with --clap
+```
+## License
+MIT