geronimobasso/drone-audio-detection-samples
Viewer β’ Updated β’ 180k β’ 2.8k β’ 30
A lightweight CRNN (Convolutional Recurrent Neural Network) for binary drone audio detection. Runs in real time on a Raspberry Pi CM4 and produces a detection decision every 500 ms.
Trained on geronimobasso/drone-audio-detection-samples β 180 320 clips, 16 kHz mono.
Three checkpoint variants are included:
| File | Augmentation | Recommended use |
|---|---|---|
drone_classifier_aug_mixed.pt |
Brown noise + real ESC-50 wind (50 / 50) | General purpose (recommended) |
drone_classifier_aug_pw10.pt |
Brown noise only | Best at extreme real-wind SNR (β5 dB) |
drone_classifier_baseline.pt |
None | Clean-audio reference baseline |
| Variant | Accuracy | Precision | Recall | F1 | ROC-AUC |
|---|---|---|---|---|---|
| aug_mixed (recommended) | 0.9993 | 0.9998 | 0.9995 | 0.9996 | 1.000 |
| aug_pw10 | 0.9989 | 0.9996 | 0.9992 | 0.9994 | 0.9999 |
| baseline | 0.9980 | 1.0000 | 0.9978 | 0.9989 | 0.9997 |
| SNR | Recall | F1 | Notes |
|---|---|---|---|
| +20 dB | 1.000 | 1.000 | |
| +10 dB | 0.996 | 0.998 | |
| +5 dB | 0.996 | 0.998 | |
| 0 dB | 0.996 | 0.998 | |
| β5 dB | 0.860 | 0.925 | wind 1.8Γ louder than drone |
The baseline model collapses at 0 dB SNR (F1 = 0.353); aug_mixed stays at F1 β₯ 0.998 down to 0 dB.
Evaluation images are in the eval/ folder of this repo.
import torch
import torchaudio
import torchaudio.transforms as T
from huggingface_hub import hf_hub_download
# ββ 1. Download weights ββββββββββββββββββββββββββββββββββββββββββββββββββββ
ckpt_path = hf_hub_download(
repo_id="AntoineNaccache/drone-audio-detector",
filename="drone_classifier_aug_mixed.pt",
)
# ββ 2. Define or import the model βββββββββββββββββββββββββββββββββββββββββ
# Option A: download model.py from the repo and place it next to your script,
# then: from model import DroneClassifier, load_classifier
#
# Option B: inline definition (copy from model.py in this repo)
import torch.nn as nn
def _conv_block(in_ch, out_ch):
return nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1, bias=False), nn.BatchNorm2d(out_ch), nn.ReLU(True),
nn.Conv2d(out_ch, out_ch, 3, padding=1, bias=False), nn.BatchNorm2d(out_ch), nn.ReLU(True),
)
class DroneClassifier(nn.Module):
def __init__(self):
super().__init__()
self.enc1 = _conv_block(1, 32); self.pool1 = nn.MaxPool2d(2, 2)
self.enc2 = _conv_block(32, 64); self.pool2 = nn.MaxPool2d(2, 2)
self.enc3 = _conv_block(64, 128); self.pool3 = nn.MaxPool2d((2, 1), (2, 1))
self.gru = nn.GRU(1024, 128, num_layers=2, batch_first=True,
bidirectional=True, dropout=0.2)
self.head = nn.Sequential(nn.Linear(256, 64), nn.ReLU(True),
nn.Dropout(0.3), nn.Linear(64, 1))
def forward(self, x):
x = self.pool1(self.enc1(x))
x = self.pool2(self.enc2(x))
x = self.pool3(self.enc3(x))
B, C, F, T = x.shape
x, _ = self.gru(x.permute(0, 3, 1, 2).reshape(B, T, C * F))
return self.head(x.mean(1))
# ββ 3. Load checkpoint βββββββββββββββββββββββββββββββββββββββββββββββββββββ
model = DroneClassifier()
state = torch.load(ckpt_path, map_location="cpu", weights_only=False)
if "model_state_dict" in state:
state = state["model_state_dict"]
model.load_state_dict(state, strict=False)
model.eval()
# ββ 4. Prepare a 1-second audio chunk (16 kHz mono) βββββββββββββββββββββββ
waveform, sr = torchaudio.load("drone.wav")
if sr != 16_000:
waveform = torchaudio.functional.resample(waveform, sr, 16_000)
waveform = waveform.mean(0, keepdim=True)[:, :16_000] # mono, 1 s
mel_transform = T.MelSpectrogram(
sample_rate=16_000, n_fft=512, hop_length=160,
n_mels=64, f_min=50, f_max=5_500,
)
log_mel = (T.AmplitudeToDB()(mel_transform(waveform)) + 40) / 40 # β [-1, 1]
x = log_mel.unsqueeze(0) # (1, 1, 64, T)
# ββ 5. Infer βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
with torch.no_grad():
prob = torch.sigmoid(model(x)).item()
print(f"Drone probability: {prob:.3f}")
print("DRONE DETECTED" if prob >= 0.5 else "No drone")
For streaming (live microphone β GPIO trigger) and full inference pipelines, see the source repository.
Input: log-mel spectrogram (B, 1, 64, T) T β 101 frames for 1 second
β
SharedEncoder
ββ ConvBlock(1β32) MaxPool(2Γ2) β (B, 32, 32, T/2)
ββ ConvBlock(32β64) MaxPool(2Γ2) β (B, 64, 16, T/2)
ββ ConvBlock(64β128) MaxPool(2Γ1) β (B, 128, 8, T/2)
ββ reshape β (B, T/2, 1024)
BiGRU(1024 β 256, 2 layers, bidirectional, dropout=0.2)
β (B, T/2, 256)
β
ClassifierHead
GlobalAvgPool(time) β (B, 256)
FC(256β64) β ReLU β Dropout(0.3) β FC(64β1) β logit
| Component | Parameters | Share |
|---|---|---|
| SharedEncoder β CNN | 286 880 | 19.3% |
| SharedEncoder β BiGRU | 1 182 720 | 79.6% |
| ClassifierHead | 16 513 | 1.1% |
| Total | 1 486 113 |
Checkpoint size: ~5.94 MB (FP32). Quantised INT8 ONNX: ~1.49 MB.
| Parameter | Value |
|---|---|
| Sample rate | 16 kHz |
| FFT window | Hann, 512 samples (32 ms) |
| Hop length | 160 samples (10 ms) |
| Mel bins | 64 |
| Frequency range | 50 β 5 500 Hz |
| Normalisation | (AmplitudeToDB + 40) / 40 |
| Chunk duration | 1 second |
| Detection cadence | 500 ms (50% overlap) |
BCEWithLogitsLoss(pos_weight=0.102) β compensates 10:1 drone-heavy imbalancegeronimobasso/drone-audio-detection-samples distribution@misc{naccache2025droneclassifier,
author = {Antoine Naccache},
title = {DroneClassifier: Real-Time Drone Audio Detection with CRNN},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/AntoineNaccache/drone-audio-detector}
}