Spaces:

AdityaK007
/

MSD

Sleeping

App Files Files Community

MSD / README.md

AdityaK007

Update README.md

a579f92 verified 30 days ago

preview code

raw

history blame contribute delete

5.87 kB

A newer version of the Gradio SDK is available: 6.8.0

Upgrade

metadata

title: MSD
emoji: 💻
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: let it work

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Near–Far Field Speech Analysis Tool (Gradio)

Overview

This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet
“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”

The tool performs a frame-wise comparative analysis of near-field and far-field speech recordings to study how microphone distance affects speech quality, spectral structure, and perceptually relevant features. The focus of this project is analysis and interpretation, not speech enhancement.

Motivation

In real-world speech systems, microphones are often placed at a distance from the speaker. Compared to near-field recordings, far-field speech suffers from:

Frequency-dependent attenuation
Reverberation and temporal smearing
Increased background noise
Compression of cepstral and spectral features

Most speech recognition and enhancement models do not explicitly analyze how these effects appear at a frame level. This tool was built to visually and quantitatively understand distant speech degradation before attempting model-based solutions.

What This Tool Does

The tool takes two recordings of the same speech content:

A near-field (reference) audio file
A far-field (target) audio file

It then performs the following steps:

Temporal alignment of both signals
Frame-wise segmentation with overlap
Multi-domain feature extraction
Frame-level similarity and degradation analysis
Unsupervised clustering of acoustic features
Feature–quality correlation analysis
Interactive visualization and CSV export

Signal Processing Pipeline

1. Signal Alignment

Near-field and far-field signals are temporally aligned using cross-correlation. This compensates for microphone delay and propagation differences and ensures that corresponding speech events are compared frame by frame.

2. Frame Segmentation

The aligned signals are segmented into short-time overlapping frames using user-defined frame length and hop size. This enables localized analysis of acoustic degradation instead of global averages.

3. Multi-Domain Feature Extraction

For each frame, features are extracted across multiple acoustic domains:

Time-domain features

RMS energy (loudness and dynamic range)

Frequency-domain features

Spectral centroid (brightness)
Spectral flatness (tonal vs noise-like)
Zero-crossing rate (signal texture)

Cepstral features

13 MFCC coefficients representing the perceptual spectral envelope

Band-wise spectral energy

Low-frequency band (≤ 2 kHz)
Mid-frequency band (2–4 kHz)
High-frequency band (> 4 kHz)

This multi-domain representation helps isolate attenuation, reverberation, and noise effects caused by distance.

4. Frame-wise Near–Far Comparison

Each near-field frame is compared with its aligned far-field frame using:

Cosine similarity between feature vectors
Spectral overlap between STFT magnitudes
High-frequency energy loss (in dB)

These measures are combined into a match quality score, which indicates how closely far-field speech resembles near-field speech at each frame.

5. Clustering and Feature Correlation

Unsupervised clustering (K-Means or Agglomerative) is applied independently to near-field and far-field features to explore separability.

In addition, correlation analysis is performed to study:

How near-field and far-field features relate to each other
Which features are most correlated with match quality

This helps identify features that are most sensitive to distance-induced degradation.

Visualizations Provided

The Gradio interface includes the following visual outputs:

Frame-wise similarity and degradation plots
Spectral difference heatmaps
Cluster scatter plots (near-field and far-field)
Feature–quality overlay plots
Feature correlation heatmaps
Scatter matrices for inter-feature relationships

All results can be exported as CSV files for offline analysis.

Key Observations from the Tool

Based on experiments conducted using the AMI Meeting Corpus:

Far-field speech consistently loses low-frequency energy
Mid-frequency bands often show reinforcement due to room reverberation
High-frequency bands remain relatively stable but are noise-dominated
MFCCs in far-field speech are compressed, indicating muffled spectral envelopes
Temporal structure is largely preserved, but quality degrades
Unsupervised clustering struggles due to overlapping feature distributions

These observations motivated the exploration of neural difference encoders in later stages of the project.

Limitations

Processing time is high for long audio files on CPU-only environments
Clustering does not reliably separate near-field and far-field speech
Some visualizations require domain knowledge to interpret correctly

Intended Use

This tool is intended for:

Academic analysis of distant speech degradation
Feature-level inspection before model design
Supporting research in far-field ASR and speech enhancement

It is not intended to be used as a real-time or production-level system.

Dataset

Experiments were conducted using the AMI Meeting Corpus, specifically synchronized near-field headset and far-field microphone recordings.

Acknowledgements

This project was developed as part of the Samsung PRISM Worklet Program at R. V. College of Engineering.

License

This project is intended for academic and research use only.