Spaces:

AdityaK007
/

MSD

Sleeping

File size: 5,869 Bytes

---
title: MSD
emoji: 💻
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: let it work
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference






# Near–Far Field Speech Analysis Tool (Gradio)

## Overview

This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet  
**“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”**

The tool performs a **frame-wise comparative analysis of near-field and far-field speech recordings** to
study how microphone distance affects speech quality, spectral structure, and perceptually relevant
features. The focus of this project is analysis and interpretation, not speech enhancement.

---

## Motivation

In real-world speech systems, microphones are often placed at a distance from the speaker.
Compared to near-field recordings, far-field speech suffers from:

- Frequency-dependent attenuation
- Reverberation and temporal smearing
- Increased background noise
- Compression of cepstral and spectral features

Most speech recognition and enhancement models do not explicitly analyze how these effects appear
at a frame level. This tool was built to **visually and quantitatively understand distant speech
degradation** before attempting model-based solutions.

---

## What This Tool Does

The tool takes two recordings of the **same speech content**:
- A near-field (reference) audio file
- A far-field (target) audio file

It then performs the following steps:

1. Temporal alignment of both signals
2. Frame-wise segmentation with overlap
3. Multi-domain feature extraction
4. Frame-level similarity and degradation analysis
5. Unsupervised clustering of acoustic features
6. Feature–quality correlation analysis
7. Interactive visualization and CSV export

---

## Signal Processing Pipeline

### 1. Signal Alignment

Near-field and far-field signals are temporally aligned using **cross-correlation**.
This compensates for microphone delay and propagation differences and ensures that
corresponding speech events are compared frame by frame.

---

### 2. Frame Segmentation

The aligned signals are segmented into short-time overlapping frames using
user-defined frame length and hop size. This enables localized analysis of
acoustic degradation instead of global averages.

---

### 3. Multi-Domain Feature Extraction

For each frame, features are extracted across multiple acoustic domains:

**Time-domain features**
- RMS energy (loudness and dynamic range)

**Frequency-domain features**
- Spectral centroid (brightness)
- Spectral flatness (tonal vs noise-like)
- Zero-crossing rate (signal texture)

**Cepstral features**
- 13 MFCC coefficients representing the perceptual spectral envelope

**Band-wise spectral energy**
- Low-frequency band (≤ 2 kHz)
- Mid-frequency band (2–4 kHz)
- High-frequency band (> 4 kHz)

This multi-domain representation helps isolate attenuation, reverberation,
and noise effects caused by distance.

---

### 4. Frame-wise Near–Far Comparison

Each near-field frame is compared with its aligned far-field frame using:

- Cosine similarity between feature vectors
- Spectral overlap between STFT magnitudes
- High-frequency energy loss (in dB)

These measures are combined into a **match quality score**, which indicates
how closely far-field speech resembles near-field speech at each frame.

---

### 5. Clustering and Feature Correlation

Unsupervised clustering (K-Means or Agglomerative) is applied independently
to near-field and far-field features to explore separability.

In addition, correlation analysis is performed to study:
- How near-field and far-field features relate to each other
- Which features are most correlated with match quality

This helps identify features that are most sensitive to distance-induced degradation.

---

## Visualizations Provided

The Gradio interface includes the following visual outputs:

- Frame-wise similarity and degradation plots
- Spectral difference heatmaps
- Cluster scatter plots (near-field and far-field)
- Feature–quality overlay plots
- Feature correlation heatmaps
- Scatter matrices for inter-feature relationships

All results can be exported as CSV files for offline analysis.

---

## Key Observations from the Tool

Based on experiments conducted using the AMI Meeting Corpus:

- Far-field speech consistently loses low-frequency energy
- Mid-frequency bands often show reinforcement due to room reverberation
- High-frequency bands remain relatively stable but are noise-dominated
- MFCCs in far-field speech are compressed, indicating muffled spectral envelopes
- Temporal structure is largely preserved, but quality degrades
- Unsupervised clustering struggles due to overlapping feature distributions

These observations motivated the exploration of neural difference encoders
in later stages of the project.

---

## Limitations

- Processing time is high for long audio files on CPU-only environments
- Clustering does not reliably separate near-field and far-field speech
- Some visualizations require domain knowledge to interpret correctly

---

## Intended Use

This tool is intended for:
- Academic analysis of distant speech degradation
- Feature-level inspection before model design
- Supporting research in far-field ASR and speech enhancement

It is not intended to be used as a real-time or production-level system.

---

## Dataset

Experiments were conducted using the **AMI Meeting Corpus**, specifically
synchronized near-field headset and far-field microphone recordings.

---

## Acknowledgements

This project was developed as part of the **Samsung PRISM Worklet Program**
at R. V. College of Engineering.

---

## License

This project is intended for academic and research use only.