MSD / README.md
AdityaK007's picture
Update README.md
a579f92 verified

A newer version of the Gradio SDK is available: 6.8.0

Upgrade
metadata
title: MSD
emoji: 💻
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: let it work

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Near–Far Field Speech Analysis Tool (Gradio)

Overview

This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet
“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”

The tool performs a frame-wise comparative analysis of near-field and far-field speech recordings to study how microphone distance affects speech quality, spectral structure, and perceptually relevant features. The focus of this project is analysis and interpretation, not speech enhancement.


Motivation

In real-world speech systems, microphones are often placed at a distance from the speaker. Compared to near-field recordings, far-field speech suffers from:

  • Frequency-dependent attenuation
  • Reverberation and temporal smearing
  • Increased background noise
  • Compression of cepstral and spectral features

Most speech recognition and enhancement models do not explicitly analyze how these effects appear at a frame level. This tool was built to visually and quantitatively understand distant speech degradation before attempting model-based solutions.


What This Tool Does

The tool takes two recordings of the same speech content:

  • A near-field (reference) audio file
  • A far-field (target) audio file

It then performs the following steps:

  1. Temporal alignment of both signals
  2. Frame-wise segmentation with overlap
  3. Multi-domain feature extraction
  4. Frame-level similarity and degradation analysis
  5. Unsupervised clustering of acoustic features
  6. Feature–quality correlation analysis
  7. Interactive visualization and CSV export

Signal Processing Pipeline

1. Signal Alignment

Near-field and far-field signals are temporally aligned using cross-correlation. This compensates for microphone delay and propagation differences and ensures that corresponding speech events are compared frame by frame.


2. Frame Segmentation

The aligned signals are segmented into short-time overlapping frames using user-defined frame length and hop size. This enables localized analysis of acoustic degradation instead of global averages.


3. Multi-Domain Feature Extraction

For each frame, features are extracted across multiple acoustic domains:

Time-domain features

  • RMS energy (loudness and dynamic range)

Frequency-domain features

  • Spectral centroid (brightness)
  • Spectral flatness (tonal vs noise-like)
  • Zero-crossing rate (signal texture)

Cepstral features

  • 13 MFCC coefficients representing the perceptual spectral envelope

Band-wise spectral energy

  • Low-frequency band (≤ 2 kHz)
  • Mid-frequency band (2–4 kHz)
  • High-frequency band (> 4 kHz)

This multi-domain representation helps isolate attenuation, reverberation, and noise effects caused by distance.


4. Frame-wise Near–Far Comparison

Each near-field frame is compared with its aligned far-field frame using:

  • Cosine similarity between feature vectors
  • Spectral overlap between STFT magnitudes
  • High-frequency energy loss (in dB)

These measures are combined into a match quality score, which indicates how closely far-field speech resembles near-field speech at each frame.


5. Clustering and Feature Correlation

Unsupervised clustering (K-Means or Agglomerative) is applied independently to near-field and far-field features to explore separability.

In addition, correlation analysis is performed to study:

  • How near-field and far-field features relate to each other
  • Which features are most correlated with match quality

This helps identify features that are most sensitive to distance-induced degradation.


Visualizations Provided

The Gradio interface includes the following visual outputs:

  • Frame-wise similarity and degradation plots
  • Spectral difference heatmaps
  • Cluster scatter plots (near-field and far-field)
  • Feature–quality overlay plots
  • Feature correlation heatmaps
  • Scatter matrices for inter-feature relationships

All results can be exported as CSV files for offline analysis.


Key Observations from the Tool

Based on experiments conducted using the AMI Meeting Corpus:

  • Far-field speech consistently loses low-frequency energy
  • Mid-frequency bands often show reinforcement due to room reverberation
  • High-frequency bands remain relatively stable but are noise-dominated
  • MFCCs in far-field speech are compressed, indicating muffled spectral envelopes
  • Temporal structure is largely preserved, but quality degrades
  • Unsupervised clustering struggles due to overlapping feature distributions

These observations motivated the exploration of neural difference encoders in later stages of the project.


Limitations

  • Processing time is high for long audio files on CPU-only environments
  • Clustering does not reliably separate near-field and far-field speech
  • Some visualizations require domain knowledge to interpret correctly

Intended Use

This tool is intended for:

  • Academic analysis of distant speech degradation
  • Feature-level inspection before model design
  • Supporting research in far-field ASR and speech enhancement

It is not intended to be used as a real-time or production-level system.


Dataset

Experiments were conducted using the AMI Meeting Corpus, specifically synchronized near-field headset and far-field microphone recordings.


Acknowledgements

This project was developed as part of the Samsung PRISM Worklet Program at R. V. College of Engineering.


License

This project is intended for academic and research use only.