Coffee First Crack Detection
An Audio Spectrogram Transformer (AST) fine-tuned to detect first crack — the critical moment during coffee roasting when beans begin to pop — from raw audio.
Source code & training: github.com/syamaner/coffee-first-crack-detection
Original prototype series (the earlier end-to-end build that led to this rewrite):
Model Description
Fine-tuned from MIT/ast-finetuned-audioset-10-10-0.4593 with partial backbone freeze (~14M trainable / 72M frozen) for binary audio classification:
| Label | ID | Description |
|---|---|---|
no_first_crack |
0 | Background roast noise, no cracking |
first_crack |
1 | First crack popping/cracking sounds |
Feature extractor: ASTFeatureExtractor at 16 kHz mono, 128 mel bins, max_length=1024, mean=-4.2677, std=4.5689 (AudioSet calibration).
Intended Use
- Roasting automation: trigger events (reduce heat, start timer) when first crack is detected
- Roast logging: timestamp first crack onset for reproducibility
- MCP server integration: embedded in the
coffee-roastingroaster control system
Not intended for: other food processing sounds, non-coffee audio, commercial food safety systems.
How to Use
Python (transformers)
from transformers import ASTForAudioClassification, ASTFeatureExtractor
import torch, librosa
model = ASTForAudioClassification.from_pretrained("syamaner/coffee-first-crack-detection")
extractor = ASTFeatureExtractor.from_pretrained("syamaner/coffee-first-crack-detection")
model.eval()
audio, _ = librosa.load("roast.wav", sr=16000, mono=True)
inputs = extractor(audio.tolist(), sampling_rate=16000, return_tensors="pt")
with torch.inference_mode():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)
label = model.config.id2label[probs.argmax().item()]
print(f"{label}: {probs.max().item():.3f}")
Sliding window (long audio files)
from coffee_first_crack.inference import SlidingWindowInference
detector = SlidingWindowInference(
model_name_or_path="syamaner/coffee-first-crack-detection",
window_size=10.0,
overlap=0.7,
threshold=0.6,
min_pops=5,
)
events = detector.process_file("roast.wav")
for event in events:
print(f"First crack at {event.timestamp_str}")
Live microphone
from coffee_first_crack.inference import FirstCrackDetector
detector = FirstCrackDetector(
use_microphone=True,
model_name_or_path="syamaner/coffee-first-crack-detection",
)
detector.start()
# poll detector.is_first_crack() in your roast loop
ONNX (Raspberry Pi 5)
ONNX models are published on HuggingFace Hub under onnx/fp32/ and onnx/int8/:
- INT8 (recommended):
onnx/int8/model_quantized.onnx— 90MB, 2x faster, zero quality loss - FP32:
onnx/fp32/model.onnx— 345MB
import onnxruntime as rt
import numpy as np
from huggingface_hub import hf_hub_download
from transformers import ASTFeatureExtractor
import librosa
# Download INT8 model and preprocessor config from HF Hub
model_path = hf_hub_download(
repo_id="syamaner/coffee-first-crack-detection",
filename="onnx/int8/model_quantized.onnx",
)
extractor = ASTFeatureExtractor.from_pretrained(
"syamaner/coffee-first-crack-detection", subfolder="onnx/int8",
)
# Thread-limit for RPi5 (2 threads — leaves cores free for MCP server + UI)
sess_options = rt.SessionOptions()
sess_options.intra_op_num_threads = 2
sess_options.inter_op_num_threads = 1
sess = rt.InferenceSession(
model_path, sess_options=sess_options, providers=["CPUExecutionProvider"]
)
audio, _ = librosa.load("roast.wav", sr=16000, mono=True)
inputs = extractor([audio.tolist()], sampling_rate=16000, return_tensors="np")
logits = sess.run(None, {sess.get_inputs()[0].name: inputs["input_values"]})[0]
# Softmax for probabilities
exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
probs = exp_logits / exp_logits.sum(axis=-1, keepdims=True)
label_id = int(np.argmax(probs))
print(f"first_crack prob: {probs[0, 1]:.3f}")
Note: RPi5 requires adequate PSU (5V/5A recommended) and active cooling. Default is 2 ONNX threads to leave CPU headroom for MCP server and agent UI. See Hardware Requirements.
Training
| Parameter | Value |
|---|---|
| Base model | MIT/ast-finetuned-audioset-10-10-0.4593 |
| Freeze strategy | Last 2 transformer layers + layernorm unfrozen (~14M / 86M params) |
| Optimizer | AdamW |
| Learning rate | 5e-5 |
| Batch size | 8 |
| Epochs | 5 (early stop patience=3, best epoch 2) |
| Weight decay | 0.1 |
| Loss | Class-weighted CrossEntropyLoss |
| Augmentation | Random amplitude scaling (±30%) + Gaussian noise injection |
| Crop mode (train) | Random |
| Crop mode (eval) | Center |
Training hardware: Apple M3+ Mac (MPS). Dataset: 1,435 fixed 10s chunks from 21 recordings (9 legacy mic-1 + 6 amplified mic-2 + 3 mic-1-panama + 3 mic-2-panama).
Evaluation
baseline_v5 — 21 recordings (9 legacy + 6 amplified mic-2 + 3 mic-1-panama + 3 mic-2-panama), Apple M3 Mac (MPS).
| Metric | Test |
|---|---|
| Accuracy | 97.7% |
| F1 (macro) | 0.921 |
Precision (first_crack) |
87.2% |
Recall (first_crack) |
97.6% |
| ROC-AUC | 0.997 |
| Confusion matrix | 1 FN, 6 FP (303 test samples) |
Full dataset: 1,435 × 10s chunks (fixed sliding window), 922 / 210 / 303 train / val / test split (recording-level, no data leakage).
Full-file detection on test recordings (sliding window, threshold=0.6, min_pops=5):
| Recording | Mic | Ground Truth | Detected | Delta |
|---|---|---|---|---|
| mic1-panama-roast2 | mic-1 | 13:09 | 13:03 | -6s |
| mic2-brazil-roast3-amplified | mic-2 | 10:39 | 10:33 | -6s |
| mic2-panama-roast1 | mic-2 | 11:05 | 10:57 | -8s |
| roast-3-costarica-hermosa-hp-a | mic-1 | 07:19 | MISSED | — |
Limitations
- Dataset of 1,435 chunks from 21 roasts — generalisation to very different roasters/environments is uncertain
- Trained on Costa Rica Hermosa HP, Brazil, Brazil Santos, and Panama Hortigal Estate origins — other origins may vary
- Mic gain variation affects detection — older uncalibrated mic-2 recordings required amplification
- Microphone quality matters: model trained on two different microphones (FIFINE K669B condenser, Audio-Technica ATR2100x dynamic)
- No second crack detection — model is binary only
- Not validated in commercial roasting environments
- AST model (87M params) is too large for real-time (<500ms) inference on Raspberry Pi 5 — achieves ~2.07s per 10s window with INT8 quantization at 4 threads (with fan)
Hardware Requirements
| Platform | Inference | Latency (10s window) | Model Size | Notes |
|---|---|---|---|---|
| Apple M3+ Mac | PyTorch (MPS) | ~100ms | 345MB | Auto-detected device |
| Apple M3+ Mac | ONNX Runtime (CPU) | ~197ms (INT8) / ~375ms (FP32) | 90MB / 345MB | No GPU needed |
| NVIDIA RTX 4090 | PyTorch (CUDA) | ~30ms | 345MB | fp16/bf16, num_workers=4 |
| Raspberry Pi 5 (16GB) | ONNX Runtime (CPU) | ~2.45s (INT8, 2 threads) | 90MB | ⭐ Recommended Pi config |
Raspberry Pi 5 Notes
- Use
model_quantized.onnx(INT8, 90MB) — 2x faster than FP32 with zero quality loss - Recommended config: INT8, 2 threads, adequate PSU + active cooler → p50 = 2,452ms
- Why 2 threads: the Pi also runs an MCP server and agent UI — 2 ONNX threads leaves 2 cores free for those services
- Detection threshold: 0.90 (precision=0.952, recall=0.909, F1=0.930) — minimises false positives
- Power: adequate PSU (5V/5A recommended) required for multi-thread. Standard chargers (5V/3A) cause under-voltage crashes under load
- Cooling: active cooler recommended — sustained 2-thread load without fan reaches 77°C+ and triggers thermal throttling
- Threads: 2 threads with fan (2,452ms), 4 threads with fan (2,070ms), 1 thread on any PSU (4,441ms)
- Latency target: current AST model (87M params) does not meet the <500ms target on RPi5. Consider a lighter model for real-time edge use
- Install:
pip install -r requirements-pi.txtthenpip install torch --index-url https://download.pytorch.org/whl/cpu
Dataset
Training data: syamaner/coffee-first-crack-audio
10-second WAV chunks at 16 kHz mono, labelled first_crack / no_first_crack. Includes per-sample metadata: microphone, coffee origin, annotation source.
Citation
@misc{yamaner2025coffeefc,
author = {Yamaner, Sertan},
title = {Coffee First Crack Detection},
year = {2025},
url = {https://huggingface.co/syamaner/coffee-first-crack-detection}
}
- Downloads last month
- 253
Model tree for syamaner/coffee-first-crack-detection
Base model
MIT/ast-finetuned-audioset-10-10-0.4593Dataset used to train syamaner/coffee-first-crack-detection
Space using syamaner/coffee-first-crack-detection 1
Evaluation results
- Test Accuracy on Coffee First Crack Audiotest set self-reported0.977
- Test F1 (macro) on Coffee First Crack Audiotest set self-reported0.921
- Test Precision (first_crack) on Coffee First Crack Audiotest set self-reported0.872
- Test Recall (first_crack) on Coffee First Crack Audiotest set self-reported0.976
- Test ROC-AUC on Coffee First Crack Audiotest set self-reported0.997