--- title: MSD emoji: 💻 colorFrom: indigo colorTo: purple sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false short_description: let it work --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # Near–Far Field Speech Analysis Tool (Gradio) ## Overview This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet **“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”** The tool performs a **frame-wise comparative analysis of near-field and far-field speech recordings** to study how microphone distance affects speech quality, spectral structure, and perceptually relevant features. The focus of this project is analysis and interpretation, not speech enhancement. --- ## Motivation In real-world speech systems, microphones are often placed at a distance from the speaker. Compared to near-field recordings, far-field speech suffers from: - Frequency-dependent attenuation - Reverberation and temporal smearing - Increased background noise - Compression of cepstral and spectral features Most speech recognition and enhancement models do not explicitly analyze how these effects appear at a frame level. This tool was built to **visually and quantitatively understand distant speech degradation** before attempting model-based solutions. --- ## What This Tool Does The tool takes two recordings of the **same speech content**: - A near-field (reference) audio file - A far-field (target) audio file It then performs the following steps: 1. Temporal alignment of both signals 2. Frame-wise segmentation with overlap 3. Multi-domain feature extraction 4. Frame-level similarity and degradation analysis 5. Unsupervised clustering of acoustic features 6. Feature–quality correlation analysis 7. Interactive visualization and CSV export --- ## Signal Processing Pipeline ### 1. Signal Alignment Near-field and far-field signals are temporally aligned using **cross-correlation**. This compensates for microphone delay and propagation differences and ensures that corresponding speech events are compared frame by frame. --- ### 2. Frame Segmentation The aligned signals are segmented into short-time overlapping frames using user-defined frame length and hop size. This enables localized analysis of acoustic degradation instead of global averages. --- ### 3. Multi-Domain Feature Extraction For each frame, features are extracted across multiple acoustic domains: **Time-domain features** - RMS energy (loudness and dynamic range) **Frequency-domain features** - Spectral centroid (brightness) - Spectral flatness (tonal vs noise-like) - Zero-crossing rate (signal texture) **Cepstral features** - 13 MFCC coefficients representing the perceptual spectral envelope **Band-wise spectral energy** - Low-frequency band (≤ 2 kHz) - Mid-frequency band (2–4 kHz) - High-frequency band (> 4 kHz) This multi-domain representation helps isolate attenuation, reverberation, and noise effects caused by distance. --- ### 4. Frame-wise Near–Far Comparison Each near-field frame is compared with its aligned far-field frame using: - Cosine similarity between feature vectors - Spectral overlap between STFT magnitudes - High-frequency energy loss (in dB) These measures are combined into a **match quality score**, which indicates how closely far-field speech resembles near-field speech at each frame. --- ### 5. Clustering and Feature Correlation Unsupervised clustering (K-Means or Agglomerative) is applied independently to near-field and far-field features to explore separability. In addition, correlation analysis is performed to study: - How near-field and far-field features relate to each other - Which features are most correlated with match quality This helps identify features that are most sensitive to distance-induced degradation. --- ## Visualizations Provided The Gradio interface includes the following visual outputs: - Frame-wise similarity and degradation plots - Spectral difference heatmaps - Cluster scatter plots (near-field and far-field) - Feature–quality overlay plots - Feature correlation heatmaps - Scatter matrices for inter-feature relationships All results can be exported as CSV files for offline analysis. --- ## Key Observations from the Tool Based on experiments conducted using the AMI Meeting Corpus: - Far-field speech consistently loses low-frequency energy - Mid-frequency bands often show reinforcement due to room reverberation - High-frequency bands remain relatively stable but are noise-dominated - MFCCs in far-field speech are compressed, indicating muffled spectral envelopes - Temporal structure is largely preserved, but quality degrades - Unsupervised clustering struggles due to overlapping feature distributions These observations motivated the exploration of neural difference encoders in later stages of the project. --- ## Limitations - Processing time is high for long audio files on CPU-only environments - Clustering does not reliably separate near-field and far-field speech - Some visualizations require domain knowledge to interpret correctly --- ## Intended Use This tool is intended for: - Academic analysis of distant speech degradation - Feature-level inspection before model design - Supporting research in far-field ASR and speech enhancement It is not intended to be used as a real-time or production-level system. --- ## Dataset Experiments were conducted using the **AMI Meeting Corpus**, specifically synchronized near-field headset and far-field microphone recordings. --- ## Acknowledgements This project was developed as part of the **Samsung PRISM Worklet Program** at R. V. College of Engineering. --- ## License This project is intended for academic and research use only.