rikhoffbauer2 commited on
Commit
b1ba411
Β·
verified Β·
1 Parent(s): 3806379

Add README with architecture docs and upgrade paths

Browse files
Files changed (1) hide show
  1. README.md +151 -0
README.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ₯ Drum Sample Extractor
2
+
3
+ Extract individual drum samples (kick, snare, hi-hat, etc.) from any audio file. The pipeline isolates drums, detects individual hits, separates overlapping sounds, clusters identical samples, and exports the best representative from each group.
4
+
5
+ ## Pipeline Architecture
6
+
7
+ ```
8
+ song.mp3
9
+ β”‚
10
+ β–Ό [1] HTDemucs v4 (fine-tuned) ─── stem separation
11
+ drums.wav
12
+ β”‚
13
+ β–Ό [2] Multi-band onset detection ─── librosa (backtracking, 3-band)
14
+ hits/hit_001.wav, hit_002.wav, ...
15
+ β”‚
16
+ β–Ό [3] Spectral band decomposition ─── separate overlapping kick+snare+hihat
17
+ hits_separated/{kick/, snare/, hihat/}
18
+ β”‚
19
+ β–Ό [4] Feature embeddings + clustering ─── librosa (58-dim) or CLAP (512-dim)
20
+ β”‚ Auto-K via silhouette score
21
+ β”‚
22
+ β–Ό [5] Best representative selection ─── 60% centroid-proximity + 40% energy
23
+ β”‚
24
+ β–Ό [6] Optional: weighted synthesis ─── peak-aligned averaging across cluster
25
+ β”‚
26
+ β–Ό EXPORT
27
+ samples/kick_0__best.wav # best real sample per cluster
28
+ synthesized/kick_0__synthesized.wav # synthetic "ideal" version
29
+ manifest.json # metadata for all clusters
30
+ ```
31
+
32
+ ## Quick Start
33
+
34
+ ```bash
35
+ pip install demucs librosa soundfile scikit-learn numpy torch transformers
36
+
37
+ python drum_extractor.py song.mp3 -o ./my_samples
38
+ ```
39
+
40
+ ## Usage
41
+
42
+ ```bash
43
+ # Basic - extract from any audio file
44
+ python drum_extractor.py song.mp3 -o ./samples
45
+
46
+ # CPU-only (no GPU required for any stage)
47
+ python drum_extractor.py song.wav -o ./samples --no-gpu
48
+
49
+ # Use CLAP embeddings for semantic clustering (slower but more accurate)
50
+ python drum_extractor.py song.wav -o ./samples --clap
51
+
52
+ # Skip overlap separation (faster, but simultaneous hits stay merged)
53
+ python drum_extractor.py song.wav -o ./samples --no-separate
54
+
55
+ # Skip synthesis (only export real samples, no averaging)
56
+ python drum_extractor.py song.wav -o ./samples --no-synthesize
57
+
58
+ # Tune detection sensitivity
59
+ python drum_extractor.py song.wav -o ./samples \
60
+ --min-hit-dur 0.05 \
61
+ --max-hit-dur 1.0 \
62
+ --energy-threshold -35
63
+ ```
64
+
65
+ ## Output Structure
66
+
67
+ ```
68
+ output_dir/
69
+ β”œβ”€β”€ drums_stem.wav # Isolated drum track from Demucs
70
+ β”œβ”€β”€ all_hits/ # Every detected hit (intermediate)
71
+ β”‚ β”œβ”€β”€ hit_0000_kick_0.500s.wav
72
+ β”‚ β”œβ”€β”€ hit_0001_snare_1.000s.wav
73
+ β”‚ └── ...
74
+ β”œβ”€β”€ samples/ # Best representative per cluster
75
+ β”‚ β”œβ”€β”€ kick_0__best.wav
76
+ β”‚ β”œβ”€β”€ snare_0__best.wav
77
+ β”‚ β”œβ”€β”€ hihat_closed_0__best.wav
78
+ β”‚ └── ...
79
+ β”œβ”€β”€ synthesized/ # Synthesized "ideal" samples
80
+ β”‚ β”œβ”€β”€ kick_0__synthesized.wav
81
+ β”‚ └── ...
82
+ └── manifest.json # Full metadata
83
+ ```
84
+
85
+ ## How Each Stage Works
86
+
87
+ ### Stage 1: Drum Stem Extraction
88
+ Uses [HTDemucs v4 fine-tuned](https://github.com/facebookresearch/demucs) (`htdemucs_ft`) β€” the current SOTA for music source separation at **8.4 dB SDR** on drums (MUSDB18-HQ). Falls back to `htdemucs` if the fine-tuned variant is unavailable.
89
+
90
+ ### Stage 2: Onset Detection
91
+ Multi-band onset detection using librosa:
92
+ - **Low band** (20–250 Hz): catches kicks
93
+ - **Mid band** (250–4000 Hz): catches snares and toms
94
+ - **High band** (4000+ Hz): catches cymbals and hi-hats
95
+
96
+ Each band is normalized independently, then combined via element-wise max. Backtracking snaps onsets to the true attack start.
97
+
98
+ ### Stage 3: Spectral Classification & Overlap Separation
99
+ Each hit is classified by a spectral decision tree:
100
+ - **Kick**: >50% low-band energy, centroid < 800 Hz
101
+ - **Snare**: >40% mid-band energy, high ZCR, centroid > 1000 Hz
102
+ - **Hi-hat (closed/open)**: >35% high-band energy, centroid > 4000 Hz
103
+ - **Cymbal/Tom/Percussion**: remaining combinations
104
+
105
+ When two bands both carry >15% of peak energy, the hit is split into separate sub-hits (one per band).
106
+
107
+ ### Stage 4: Embedding & Clustering
108
+ **Default (librosa, 58-dim)**: MFCCs (mean+std), spectral centroid/bandwidth/rolloff/contrast/flatness, ZCR, RMS, onset envelope shape, duration. Z-score normalized.
109
+
110
+ **Optional (CLAP, 512-dim)**: `laion/larger_clap_general` β€” semantic audio embeddings via Contrastive Language-Audio Pretraining. Better at distinguishing subtly different drum types but slower.
111
+
112
+ Clustering is hierarchical: first group by rough spectral label, then sub-cluster within each group using KMeans with auto-K selection via silhouette score.
113
+
114
+ ### Stage 5: Best Representative Selection
115
+ Each cluster's "best" hit is selected by a weighted score:
116
+ - **60% representativeness**: closest to cluster centroid in MFCC space
117
+ - **40% energy**: higher RMS = cleaner transient with less bleed
118
+
119
+ ### Stage 6: Synthesis (Optional)
120
+ Creates an "ideal" sample by peak-aligned weighted averaging:
121
+ 1. Align all cluster members to their peak transient
122
+ 2. Normalize amplitudes
123
+ 3. Weighted average (best hit gets 2Γ— weight)
124
+ 4. This reduces random noise/bleed while preserving the shared transient character
125
+
126
+ ## Upgrade Paths
127
+
128
+ For higher quality on specific stages, these drop-in replacements are available:
129
+
130
+ | Stage | Current | Upgrade | Benefit |
131
+ |-------|---------|---------|---------|
132
+ | 3 (overlap separation) | Spectral bands | [AudioSep](https://huggingface.co/spaces/Audio-AGI/AudioSep) | Text-queried separation ("kick drum"), 10.5 dB SDRi |
133
+ | 3 (overlap separation) | Spectral bands | [SAM Audio](https://huggingface.co/facebook/sam-audio-large) | Diffusion-based + temporal span prompts (gated, Meta license) |
134
+ | 4 (clustering) | librosa features | CLAP embeddings (`--clap`) | Semantic similarity, better cross-genre generalization |
135
+ | 2 (onset detection) | librosa | [madmom](https://github.com/CPJKU/madmom) RNNOnsetProcessor | 0.89 F1 on ENST-drums (needs Python ≀3.10) |
136
+
137
+ ## Requirements
138
+
139
+ ```
140
+ demucs>=4.0
141
+ librosa>=0.10
142
+ soundfile
143
+ scikit-learn
144
+ numpy
145
+ torch
146
+ transformers # only needed with --clap
147
+ ```
148
+
149
+ ## License
150
+
151
+ MIT