MSD / README.md
AdityaK007's picture
Update README.md
a579f92 verified
---
title: MSD
emoji: 💻
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: let it work
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Near–Far Field Speech Analysis Tool (Gradio)
## Overview
This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet
**“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”**
The tool performs a **frame-wise comparative analysis of near-field and far-field speech recordings** to
study how microphone distance affects speech quality, spectral structure, and perceptually relevant
features. The focus of this project is analysis and interpretation, not speech enhancement.
---
## Motivation
In real-world speech systems, microphones are often placed at a distance from the speaker.
Compared to near-field recordings, far-field speech suffers from:
- Frequency-dependent attenuation
- Reverberation and temporal smearing
- Increased background noise
- Compression of cepstral and spectral features
Most speech recognition and enhancement models do not explicitly analyze how these effects appear
at a frame level. This tool was built to **visually and quantitatively understand distant speech
degradation** before attempting model-based solutions.
---
## What This Tool Does
The tool takes two recordings of the **same speech content**:
- A near-field (reference) audio file
- A far-field (target) audio file
It then performs the following steps:
1. Temporal alignment of both signals
2. Frame-wise segmentation with overlap
3. Multi-domain feature extraction
4. Frame-level similarity and degradation analysis
5. Unsupervised clustering of acoustic features
6. Feature–quality correlation analysis
7. Interactive visualization and CSV export
---
## Signal Processing Pipeline
### 1. Signal Alignment
Near-field and far-field signals are temporally aligned using **cross-correlation**.
This compensates for microphone delay and propagation differences and ensures that
corresponding speech events are compared frame by frame.
---
### 2. Frame Segmentation
The aligned signals are segmented into short-time overlapping frames using
user-defined frame length and hop size. This enables localized analysis of
acoustic degradation instead of global averages.
---
### 3. Multi-Domain Feature Extraction
For each frame, features are extracted across multiple acoustic domains:
**Time-domain features**
- RMS energy (loudness and dynamic range)
**Frequency-domain features**
- Spectral centroid (brightness)
- Spectral flatness (tonal vs noise-like)
- Zero-crossing rate (signal texture)
**Cepstral features**
- 13 MFCC coefficients representing the perceptual spectral envelope
**Band-wise spectral energy**
- Low-frequency band (≤ 2 kHz)
- Mid-frequency band (2–4 kHz)
- High-frequency band (> 4 kHz)
This multi-domain representation helps isolate attenuation, reverberation,
and noise effects caused by distance.
---
### 4. Frame-wise Near–Far Comparison
Each near-field frame is compared with its aligned far-field frame using:
- Cosine similarity between feature vectors
- Spectral overlap between STFT magnitudes
- High-frequency energy loss (in dB)
These measures are combined into a **match quality score**, which indicates
how closely far-field speech resembles near-field speech at each frame.
---
### 5. Clustering and Feature Correlation
Unsupervised clustering (K-Means or Agglomerative) is applied independently
to near-field and far-field features to explore separability.
In addition, correlation analysis is performed to study:
- How near-field and far-field features relate to each other
- Which features are most correlated with match quality
This helps identify features that are most sensitive to distance-induced degradation.
---
## Visualizations Provided
The Gradio interface includes the following visual outputs:
- Frame-wise similarity and degradation plots
- Spectral difference heatmaps
- Cluster scatter plots (near-field and far-field)
- Feature–quality overlay plots
- Feature correlation heatmaps
- Scatter matrices for inter-feature relationships
All results can be exported as CSV files for offline analysis.
---
## Key Observations from the Tool
Based on experiments conducted using the AMI Meeting Corpus:
- Far-field speech consistently loses low-frequency energy
- Mid-frequency bands often show reinforcement due to room reverberation
- High-frequency bands remain relatively stable but are noise-dominated
- MFCCs in far-field speech are compressed, indicating muffled spectral envelopes
- Temporal structure is largely preserved, but quality degrades
- Unsupervised clustering struggles due to overlapping feature distributions
These observations motivated the exploration of neural difference encoders
in later stages of the project.
---
## Limitations
- Processing time is high for long audio files on CPU-only environments
- Clustering does not reliably separate near-field and far-field speech
- Some visualizations require domain knowledge to interpret correctly
---
## Intended Use
This tool is intended for:
- Academic analysis of distant speech degradation
- Feature-level inspection before model design
- Supporting research in far-field ASR and speech enhancement
It is not intended to be used as a real-time or production-level system.
---
## Dataset
Experiments were conducted using the **AMI Meeting Corpus**, specifically
synchronized near-field headset and far-field microphone recordings.
---
## Acknowledgements
This project was developed as part of the **Samsung PRISM Worklet Program**
at R. V. College of Engineering.
---
## License
This project is intended for academic and research use only.