Spaces:
Sleeping
Sleeping
| title: MSD | |
| emoji: 💻 | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| short_description: let it work | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
| # Near–Far Field Speech Analysis Tool (Gradio) | |
| ## Overview | |
| This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet | |
| **“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”** | |
| The tool performs a **frame-wise comparative analysis of near-field and far-field speech recordings** to | |
| study how microphone distance affects speech quality, spectral structure, and perceptually relevant | |
| features. The focus of this project is analysis and interpretation, not speech enhancement. | |
| --- | |
| ## Motivation | |
| In real-world speech systems, microphones are often placed at a distance from the speaker. | |
| Compared to near-field recordings, far-field speech suffers from: | |
| - Frequency-dependent attenuation | |
| - Reverberation and temporal smearing | |
| - Increased background noise | |
| - Compression of cepstral and spectral features | |
| Most speech recognition and enhancement models do not explicitly analyze how these effects appear | |
| at a frame level. This tool was built to **visually and quantitatively understand distant speech | |
| degradation** before attempting model-based solutions. | |
| --- | |
| ## What This Tool Does | |
| The tool takes two recordings of the **same speech content**: | |
| - A near-field (reference) audio file | |
| - A far-field (target) audio file | |
| It then performs the following steps: | |
| 1. Temporal alignment of both signals | |
| 2. Frame-wise segmentation with overlap | |
| 3. Multi-domain feature extraction | |
| 4. Frame-level similarity and degradation analysis | |
| 5. Unsupervised clustering of acoustic features | |
| 6. Feature–quality correlation analysis | |
| 7. Interactive visualization and CSV export | |
| --- | |
| ## Signal Processing Pipeline | |
| ### 1. Signal Alignment | |
| Near-field and far-field signals are temporally aligned using **cross-correlation**. | |
| This compensates for microphone delay and propagation differences and ensures that | |
| corresponding speech events are compared frame by frame. | |
| --- | |
| ### 2. Frame Segmentation | |
| The aligned signals are segmented into short-time overlapping frames using | |
| user-defined frame length and hop size. This enables localized analysis of | |
| acoustic degradation instead of global averages. | |
| --- | |
| ### 3. Multi-Domain Feature Extraction | |
| For each frame, features are extracted across multiple acoustic domains: | |
| **Time-domain features** | |
| - RMS energy (loudness and dynamic range) | |
| **Frequency-domain features** | |
| - Spectral centroid (brightness) | |
| - Spectral flatness (tonal vs noise-like) | |
| - Zero-crossing rate (signal texture) | |
| **Cepstral features** | |
| - 13 MFCC coefficients representing the perceptual spectral envelope | |
| **Band-wise spectral energy** | |
| - Low-frequency band (≤ 2 kHz) | |
| - Mid-frequency band (2–4 kHz) | |
| - High-frequency band (> 4 kHz) | |
| This multi-domain representation helps isolate attenuation, reverberation, | |
| and noise effects caused by distance. | |
| --- | |
| ### 4. Frame-wise Near–Far Comparison | |
| Each near-field frame is compared with its aligned far-field frame using: | |
| - Cosine similarity between feature vectors | |
| - Spectral overlap between STFT magnitudes | |
| - High-frequency energy loss (in dB) | |
| These measures are combined into a **match quality score**, which indicates | |
| how closely far-field speech resembles near-field speech at each frame. | |
| --- | |
| ### 5. Clustering and Feature Correlation | |
| Unsupervised clustering (K-Means or Agglomerative) is applied independently | |
| to near-field and far-field features to explore separability. | |
| In addition, correlation analysis is performed to study: | |
| - How near-field and far-field features relate to each other | |
| - Which features are most correlated with match quality | |
| This helps identify features that are most sensitive to distance-induced degradation. | |
| --- | |
| ## Visualizations Provided | |
| The Gradio interface includes the following visual outputs: | |
| - Frame-wise similarity and degradation plots | |
| - Spectral difference heatmaps | |
| - Cluster scatter plots (near-field and far-field) | |
| - Feature–quality overlay plots | |
| - Feature correlation heatmaps | |
| - Scatter matrices for inter-feature relationships | |
| All results can be exported as CSV files for offline analysis. | |
| --- | |
| ## Key Observations from the Tool | |
| Based on experiments conducted using the AMI Meeting Corpus: | |
| - Far-field speech consistently loses low-frequency energy | |
| - Mid-frequency bands often show reinforcement due to room reverberation | |
| - High-frequency bands remain relatively stable but are noise-dominated | |
| - MFCCs in far-field speech are compressed, indicating muffled spectral envelopes | |
| - Temporal structure is largely preserved, but quality degrades | |
| - Unsupervised clustering struggles due to overlapping feature distributions | |
| These observations motivated the exploration of neural difference encoders | |
| in later stages of the project. | |
| --- | |
| ## Limitations | |
| - Processing time is high for long audio files on CPU-only environments | |
| - Clustering does not reliably separate near-field and far-field speech | |
| - Some visualizations require domain knowledge to interpret correctly | |
| --- | |
| ## Intended Use | |
| This tool is intended for: | |
| - Academic analysis of distant speech degradation | |
| - Feature-level inspection before model design | |
| - Supporting research in far-field ASR and speech enhancement | |
| It is not intended to be used as a real-time or production-level system. | |
| --- | |
| ## Dataset | |
| Experiments were conducted using the **AMI Meeting Corpus**, specifically | |
| synchronized near-field headset and far-field microphone recordings. | |
| --- | |
| ## Acknowledgements | |
| This project was developed as part of the **Samsung PRISM Worklet Program** | |
| at R. V. College of Engineering. | |
| --- | |
| ## License | |
| This project is intended for academic and research use only. | |