DroneClassifier — Real-Time Drone Audio Detection

A lightweight CRNN (Convolutional Recurrent Neural Network) for binary drone audio detection. Runs in real time on a Raspberry Pi CM4 and produces a detection decision every 500 ms.

Trained on geronimobasso/drone-audio-detection-samples — 180 320 clips, 16 kHz mono.

Model Variants

Three checkpoint variants are included:

File	Augmentation	Recommended use
`drone_classifier_aug_mixed.pt`	Brown noise + real ESC-50 wind (50 / 50)	General purpose (recommended)
`drone_classifier_aug_pw10.pt`	Brown noise only	Best at extreme real-wind SNR (−5 dB)
`drone_classifier_baseline.pt`	None	Clean-audio reference baseline

Performance

Clean test set (27 050 held-out samples)

Variant	Accuracy	Precision	Recall	F1	ROC-AUC
aug_mixed (recommended)	0.9993	0.9998	0.9995	0.9996	1.000
aug_pw10	0.9989	0.9996	0.9992	0.9994	0.9999
baseline	0.9980	1.0000	0.9978	0.9989	0.9997

Noise robustness — real ESC-50 wind clips (aug_mixed, never seen during training)

SNR	Recall	F1	Notes
+20 dB	1.000	1.000
+10 dB	0.996	0.998
+5 dB	0.996	0.998
0 dB	0.996	0.998
−5 dB	0.860	0.925	wind 1.8× louder than drone

The baseline model collapses at 0 dB SNR (F1 = 0.353); aug_mixed stays at F1 ≥ 0.998 down to 0 dB.

Evaluation images are in the eval/ folder of this repo.

Usage

import torch
import torchaudio
import torchaudio.transforms as T
from huggingface_hub import hf_hub_download

# ── 1. Download weights ────────────────────────────────────────────────────
ckpt_path = hf_hub_download(
    repo_id="AntoineNaccache/drone-audio-detector",
    filename="drone_classifier_aug_mixed.pt",
)

# ── 2. Define or import the model ─────────────────────────────────────────
# Option A: download model.py from the repo and place it next to your script,
#           then:  from model import DroneClassifier, load_classifier
#
# Option B: inline definition (copy from model.py in this repo)
import torch.nn as nn

def _conv_block(in_ch, out_ch):
    return nn.Sequential(
        nn.Conv2d(in_ch, out_ch, 3, padding=1, bias=False), nn.BatchNorm2d(out_ch), nn.ReLU(True),
        nn.Conv2d(out_ch, out_ch, 3, padding=1, bias=False), nn.BatchNorm2d(out_ch), nn.ReLU(True),
    )

class DroneClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.enc1  = _conv_block(1, 32);   self.pool1 = nn.MaxPool2d(2, 2)
        self.enc2  = _conv_block(32, 64);  self.pool2 = nn.MaxPool2d(2, 2)
        self.enc3  = _conv_block(64, 128); self.pool3 = nn.MaxPool2d((2, 1), (2, 1))
        self.gru   = nn.GRU(1024, 128, num_layers=2, batch_first=True,
                            bidirectional=True, dropout=0.2)
        self.head  = nn.Sequential(nn.Linear(256, 64), nn.ReLU(True),
                                   nn.Dropout(0.3), nn.Linear(64, 1))
    def forward(self, x):
        x = self.pool1(self.enc1(x))
        x = self.pool2(self.enc2(x))
        x = self.pool3(self.enc3(x))
        B, C, F, T = x.shape
        x, _ = self.gru(x.permute(0, 3, 1, 2).reshape(B, T, C * F))
        return self.head(x.mean(1))

# ── 3. Load checkpoint ─────────────────────────────────────────────────────
model = DroneClassifier()
state = torch.load(ckpt_path, map_location="cpu", weights_only=False)
if "model_state_dict" in state:
    state = state["model_state_dict"]
model.load_state_dict(state, strict=False)
model.eval()

# ── 4. Prepare a 1-second audio chunk (16 kHz mono) ───────────────────────
waveform, sr = torchaudio.load("drone.wav")
if sr != 16_000:
    waveform = torchaudio.functional.resample(waveform, sr, 16_000)
waveform = waveform.mean(0, keepdim=True)[:, :16_000]   # mono, 1 s

mel_transform = T.MelSpectrogram(
    sample_rate=16_000, n_fft=512, hop_length=160,
    n_mels=64, f_min=50, f_max=5_500,
)
log_mel = (T.AmplitudeToDB()(mel_transform(waveform)) + 40) / 40  # ≈ [-1, 1]
x = log_mel.unsqueeze(0)   # (1, 1, 64, T)

# ── 5. Infer ───────────────────────────────────────────────────────────────
with torch.no_grad():
    prob = torch.sigmoid(model(x)).item()

print(f"Drone probability: {prob:.3f}")
print("DRONE DETECTED" if prob >= 0.5 else "No drone")

For streaming (live microphone → GPIO trigger) and full inference pipelines, see the source repository.

Architecture

Input: log-mel spectrogram (B, 1, 64, T)   T ≈ 101 frames for 1 second
  │
SharedEncoder
  ├─ ConvBlock(1→32)   MaxPool(2×2)  →  (B, 32,  32, T/2)
  ├─ ConvBlock(32→64)  MaxPool(2×2)  →  (B, 64,  16, T/2)
  ├─ ConvBlock(64→128) MaxPool(2×1)  →  (B, 128,  8, T/2)
  └─ reshape → (B, T/2, 1024)
       BiGRU(1024 → 256, 2 layers, bidirectional, dropout=0.2)
       → (B, T/2, 256)
  │
ClassifierHead
  GlobalAvgPool(time) → (B, 256)
  FC(256→64) → ReLU → Dropout(0.3) → FC(64→1)  →  logit

Component	Parameters	Share
SharedEncoder — CNN	286 880	19.3%
SharedEncoder — BiGRU	1 182 720	79.6%
ClassifierHead	16 513	1.1%
Total	1 486 113

Checkpoint size: ~5.94 MB (FP32). Quantised INT8 ONNX: ~1.49 MB.

Audio Front-End

Parameter	Value
Sample rate	16 kHz
FFT window	Hann, 512 samples (32 ms)
Hop length	160 samples (10 ms)
Mel bins	64
Frequency range	50 – 5 500 Hz
Normalisation	`(AmplitudeToDB + 40) / 40`
Chunk duration	1 second
Detection cadence	500 ms (50% overlap)

Training Details

Dataset: 180 320 clips, 16 kHz mono, stratified 70 / 15 / 15 split
Loss: BCEWithLogitsLoss(pos_weight=0.102) — compensates 10:1 drone-heavy imbalance
Optimiser: Adam, lr = 1e-3, weight decay = 1e-4, 30 epochs
Schedule: CosineAnnealingLR
Augmentation (aug_mixed): Brown noise (1/f²) or real ESC-50 wind clips, random SNR 0 – 20 dB, p = 0.5

Limitations

Evaluated only on the geronimobasso/drone-audio-detection-samples distribution
Not validated against helicopters, fixed-wing aircraft, or other rotary-wing craft
Recall drops to ~86 – 89% when wind is 1.8× louder than the drone (SNR < −5 dB)
Designed for 16 kHz microphone input; other sample rates require resampling

Citation

@misc{naccache2025droneclassifier,
  author    = {Antoine Naccache},
  title     = {DroneClassifier: Real-Time Drone Audio Detection with CRNN},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/AntoineNaccache/drone-audio-detector}
}

Downloads last month: 9

Dataset used to train AntoineNaccache/drone-audio-detector

Evaluation results

F1 Score (clean test set, 27 050 samples) on Drone Audio Detection Samples
self-reported

1.000
Accuracy (clean test set) on Drone Audio Detection Samples
self-reported

0.999
ROC AUC (clean test set) on Drone Audio Detection Samples
self-reported

1.000