Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.8.0
title: MSD
emoji: 💻
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: let it work
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Near–Far Field Speech Analysis Tool (Gradio)
Overview
This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet
“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”
The tool performs a frame-wise comparative analysis of near-field and far-field speech recordings to study how microphone distance affects speech quality, spectral structure, and perceptually relevant features. The focus of this project is analysis and interpretation, not speech enhancement.
Motivation
In real-world speech systems, microphones are often placed at a distance from the speaker. Compared to near-field recordings, far-field speech suffers from:
- Frequency-dependent attenuation
- Reverberation and temporal smearing
- Increased background noise
- Compression of cepstral and spectral features
Most speech recognition and enhancement models do not explicitly analyze how these effects appear at a frame level. This tool was built to visually and quantitatively understand distant speech degradation before attempting model-based solutions.
What This Tool Does
The tool takes two recordings of the same speech content:
- A near-field (reference) audio file
- A far-field (target) audio file
It then performs the following steps:
- Temporal alignment of both signals
- Frame-wise segmentation with overlap
- Multi-domain feature extraction
- Frame-level similarity and degradation analysis
- Unsupervised clustering of acoustic features
- Feature–quality correlation analysis
- Interactive visualization and CSV export
Signal Processing Pipeline
1. Signal Alignment
Near-field and far-field signals are temporally aligned using cross-correlation. This compensates for microphone delay and propagation differences and ensures that corresponding speech events are compared frame by frame.
2. Frame Segmentation
The aligned signals are segmented into short-time overlapping frames using user-defined frame length and hop size. This enables localized analysis of acoustic degradation instead of global averages.
3. Multi-Domain Feature Extraction
For each frame, features are extracted across multiple acoustic domains:
Time-domain features
- RMS energy (loudness and dynamic range)
Frequency-domain features
- Spectral centroid (brightness)
- Spectral flatness (tonal vs noise-like)
- Zero-crossing rate (signal texture)
Cepstral features
- 13 MFCC coefficients representing the perceptual spectral envelope
Band-wise spectral energy
- Low-frequency band (≤ 2 kHz)
- Mid-frequency band (2–4 kHz)
- High-frequency band (> 4 kHz)
This multi-domain representation helps isolate attenuation, reverberation, and noise effects caused by distance.
4. Frame-wise Near–Far Comparison
Each near-field frame is compared with its aligned far-field frame using:
- Cosine similarity between feature vectors
- Spectral overlap between STFT magnitudes
- High-frequency energy loss (in dB)
These measures are combined into a match quality score, which indicates how closely far-field speech resembles near-field speech at each frame.
5. Clustering and Feature Correlation
Unsupervised clustering (K-Means or Agglomerative) is applied independently to near-field and far-field features to explore separability.
In addition, correlation analysis is performed to study:
- How near-field and far-field features relate to each other
- Which features are most correlated with match quality
This helps identify features that are most sensitive to distance-induced degradation.
Visualizations Provided
The Gradio interface includes the following visual outputs:
- Frame-wise similarity and degradation plots
- Spectral difference heatmaps
- Cluster scatter plots (near-field and far-field)
- Feature–quality overlay plots
- Feature correlation heatmaps
- Scatter matrices for inter-feature relationships
All results can be exported as CSV files for offline analysis.
Key Observations from the Tool
Based on experiments conducted using the AMI Meeting Corpus:
- Far-field speech consistently loses low-frequency energy
- Mid-frequency bands often show reinforcement due to room reverberation
- High-frequency bands remain relatively stable but are noise-dominated
- MFCCs in far-field speech are compressed, indicating muffled spectral envelopes
- Temporal structure is largely preserved, but quality degrades
- Unsupervised clustering struggles due to overlapping feature distributions
These observations motivated the exploration of neural difference encoders in later stages of the project.
Limitations
- Processing time is high for long audio files on CPU-only environments
- Clustering does not reliably separate near-field and far-field speech
- Some visualizations require domain knowledge to interpret correctly
Intended Use
This tool is intended for:
- Academic analysis of distant speech degradation
- Feature-level inspection before model design
- Supporting research in far-field ASR and speech enhancement
It is not intended to be used as a real-time or production-level system.
Dataset
Experiments were conducted using the AMI Meeting Corpus, specifically synchronized near-field headset and far-field microphone recordings.
Acknowledgements
This project was developed as part of the Samsung PRISM Worklet Program at R. V. College of Engineering.
License
This project is intended for academic and research use only.