Spaces:

AdityaK007
/

MSD

Sleeping

App Files Files Community

MSD / README.md

AdityaK007

Update README.md

a579f92 verified 30 days ago

preview code

raw

history blame contribute delete

5.87 kB

	---
	title: MSD
	emoji: 💻
	colorFrom: indigo
	colorTo: purple
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	short_description: let it work
	---
	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference






	# Near–Far Field Speech Analysis Tool (Gradio)

	## Overview

	This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet
	“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”

	The tool performs a frame-wise comparative analysis of near-field and far-field speech recordings to
	study how microphone distance affects speech quality, spectral structure, and perceptually relevant
	features. The focus of this project is analysis and interpretation, not speech enhancement.

	---

	## Motivation

	In real-world speech systems, microphones are often placed at a distance from the speaker.
	Compared to near-field recordings, far-field speech suffers from:

	- Frequency-dependent attenuation
	- Reverberation and temporal smearing
	- Increased background noise
	- Compression of cepstral and spectral features

	Most speech recognition and enhancement models do not explicitly analyze how these effects appear
	at a frame level. This tool was built to **visually and quantitatively understand distant speech
	degradation** before attempting model-based solutions.

	---

	## What This Tool Does

	The tool takes two recordings of the same speech content:
	- A near-field (reference) audio file
	- A far-field (target) audio file

	It then performs the following steps:

	1. Temporal alignment of both signals
	2. Frame-wise segmentation with overlap
	3. Multi-domain feature extraction
	4. Frame-level similarity and degradation analysis
	5. Unsupervised clustering of acoustic features
	6. Feature–quality correlation analysis
	7. Interactive visualization and CSV export

	---

	## Signal Processing Pipeline

	### 1. Signal Alignment

	Near-field and far-field signals are temporally aligned using cross-correlation.
	This compensates for microphone delay and propagation differences and ensures that
	corresponding speech events are compared frame by frame.

	---

	### 2. Frame Segmentation

	The aligned signals are segmented into short-time overlapping frames using
	user-defined frame length and hop size. This enables localized analysis of
	acoustic degradation instead of global averages.

	---

	### 3. Multi-Domain Feature Extraction

	For each frame, features are extracted across multiple acoustic domains:

	Time-domain features
	- RMS energy (loudness and dynamic range)

	Frequency-domain features
	- Spectral centroid (brightness)
	- Spectral flatness (tonal vs noise-like)
	- Zero-crossing rate (signal texture)

	Cepstral features
	- 13 MFCC coefficients representing the perceptual spectral envelope

	Band-wise spectral energy
	- Low-frequency band (≤ 2 kHz)
	- Mid-frequency band (2–4 kHz)
	- High-frequency band (> 4 kHz)

	This multi-domain representation helps isolate attenuation, reverberation,
	and noise effects caused by distance.

	---

	### 4. Frame-wise Near–Far Comparison

	Each near-field frame is compared with its aligned far-field frame using:

	- Cosine similarity between feature vectors
	- Spectral overlap between STFT magnitudes
	- High-frequency energy loss (in dB)

	These measures are combined into a match quality score, which indicates
	how closely far-field speech resembles near-field speech at each frame.

	---

	### 5. Clustering and Feature Correlation

	Unsupervised clustering (K-Means or Agglomerative) is applied independently
	to near-field and far-field features to explore separability.

	In addition, correlation analysis is performed to study:
	- How near-field and far-field features relate to each other
	- Which features are most correlated with match quality

	This helps identify features that are most sensitive to distance-induced degradation.

	---

	## Visualizations Provided

	The Gradio interface includes the following visual outputs:

	- Frame-wise similarity and degradation plots
	- Spectral difference heatmaps
	- Cluster scatter plots (near-field and far-field)
	- Feature–quality overlay plots
	- Feature correlation heatmaps
	- Scatter matrices for inter-feature relationships

	All results can be exported as CSV files for offline analysis.

	---

	## Key Observations from the Tool

	Based on experiments conducted using the AMI Meeting Corpus:

	- Far-field speech consistently loses low-frequency energy
	- Mid-frequency bands often show reinforcement due to room reverberation
	- High-frequency bands remain relatively stable but are noise-dominated
	- MFCCs in far-field speech are compressed, indicating muffled spectral envelopes
	- Temporal structure is largely preserved, but quality degrades
	- Unsupervised clustering struggles due to overlapping feature distributions

	These observations motivated the exploration of neural difference encoders
	in later stages of the project.

	---

	## Limitations

	- Processing time is high for long audio files on CPU-only environments
	- Clustering does not reliably separate near-field and far-field speech
	- Some visualizations require domain knowledge to interpret correctly

	---

	## Intended Use

	This tool is intended for:
	- Academic analysis of distant speech degradation
	- Feature-level inspection before model design
	- Supporting research in far-field ASR and speech enhancement

	It is not intended to be used as a real-time or production-level system.

	---

	## Dataset

	Experiments were conducted using the AMI Meeting Corpus, specifically
	synchronized near-field headset and far-field microphone recordings.

	---

	## Acknowledgements

	This project was developed as part of the Samsung PRISM Worklet Program
	at R. V. College of Engineering.

	---

	## License

	This project is intended for academic and research use only.