Spaces:
Sleeping
Sleeping
File size: 5,869 Bytes
a579f92 a6b87be acb3e2f a6b87be acb3e2f a6b87be | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | ---
title: MSD
emoji: 💻
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: let it work
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Near–Far Field Speech Analysis Tool (Gradio)
## Overview
This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet
**“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”**
The tool performs a **frame-wise comparative analysis of near-field and far-field speech recordings** to
study how microphone distance affects speech quality, spectral structure, and perceptually relevant
features. The focus of this project is analysis and interpretation, not speech enhancement.
---
## Motivation
In real-world speech systems, microphones are often placed at a distance from the speaker.
Compared to near-field recordings, far-field speech suffers from:
- Frequency-dependent attenuation
- Reverberation and temporal smearing
- Increased background noise
- Compression of cepstral and spectral features
Most speech recognition and enhancement models do not explicitly analyze how these effects appear
at a frame level. This tool was built to **visually and quantitatively understand distant speech
degradation** before attempting model-based solutions.
---
## What This Tool Does
The tool takes two recordings of the **same speech content**:
- A near-field (reference) audio file
- A far-field (target) audio file
It then performs the following steps:
1. Temporal alignment of both signals
2. Frame-wise segmentation with overlap
3. Multi-domain feature extraction
4. Frame-level similarity and degradation analysis
5. Unsupervised clustering of acoustic features
6. Feature–quality correlation analysis
7. Interactive visualization and CSV export
---
## Signal Processing Pipeline
### 1. Signal Alignment
Near-field and far-field signals are temporally aligned using **cross-correlation**.
This compensates for microphone delay and propagation differences and ensures that
corresponding speech events are compared frame by frame.
---
### 2. Frame Segmentation
The aligned signals are segmented into short-time overlapping frames using
user-defined frame length and hop size. This enables localized analysis of
acoustic degradation instead of global averages.
---
### 3. Multi-Domain Feature Extraction
For each frame, features are extracted across multiple acoustic domains:
**Time-domain features**
- RMS energy (loudness and dynamic range)
**Frequency-domain features**
- Spectral centroid (brightness)
- Spectral flatness (tonal vs noise-like)
- Zero-crossing rate (signal texture)
**Cepstral features**
- 13 MFCC coefficients representing the perceptual spectral envelope
**Band-wise spectral energy**
- Low-frequency band (≤ 2 kHz)
- Mid-frequency band (2–4 kHz)
- High-frequency band (> 4 kHz)
This multi-domain representation helps isolate attenuation, reverberation,
and noise effects caused by distance.
---
### 4. Frame-wise Near–Far Comparison
Each near-field frame is compared with its aligned far-field frame using:
- Cosine similarity between feature vectors
- Spectral overlap between STFT magnitudes
- High-frequency energy loss (in dB)
These measures are combined into a **match quality score**, which indicates
how closely far-field speech resembles near-field speech at each frame.
---
### 5. Clustering and Feature Correlation
Unsupervised clustering (K-Means or Agglomerative) is applied independently
to near-field and far-field features to explore separability.
In addition, correlation analysis is performed to study:
- How near-field and far-field features relate to each other
- Which features are most correlated with match quality
This helps identify features that are most sensitive to distance-induced degradation.
---
## Visualizations Provided
The Gradio interface includes the following visual outputs:
- Frame-wise similarity and degradation plots
- Spectral difference heatmaps
- Cluster scatter plots (near-field and far-field)
- Feature–quality overlay plots
- Feature correlation heatmaps
- Scatter matrices for inter-feature relationships
All results can be exported as CSV files for offline analysis.
---
## Key Observations from the Tool
Based on experiments conducted using the AMI Meeting Corpus:
- Far-field speech consistently loses low-frequency energy
- Mid-frequency bands often show reinforcement due to room reverberation
- High-frequency bands remain relatively stable but are noise-dominated
- MFCCs in far-field speech are compressed, indicating muffled spectral envelopes
- Temporal structure is largely preserved, but quality degrades
- Unsupervised clustering struggles due to overlapping feature distributions
These observations motivated the exploration of neural difference encoders
in later stages of the project.
---
## Limitations
- Processing time is high for long audio files on CPU-only environments
- Clustering does not reliably separate near-field and far-field speech
- Some visualizations require domain knowledge to interpret correctly
---
## Intended Use
This tool is intended for:
- Academic analysis of distant speech degradation
- Feature-level inspection before model design
- Supporting research in far-field ASR and speech enhancement
It is not intended to be used as a real-time or production-level system.
---
## Dataset
Experiments were conducted using the **AMI Meeting Corpus**, specifically
synchronized near-field headset and far-field microphone recordings.
---
## Acknowledgements
This project was developed as part of the **Samsung PRISM Worklet Program**
at R. V. College of Engineering.
---
## License
This project is intended for academic and research use only.
|