Spaces:

AdityaK007
/

MSD

Sleeping

App Files Files Community

AdityaK007 commited on about 1 month ago

Commit

a6b87be

verified ·

1 Parent(s): 85f67e6

Update README.md

Browse files

Files changed (1) hide show

README.md +180 -10

README.md CHANGED Viewed

@@ -1,13 +1,183 @@
 ---
-title: MSD
-emoji: 💻
-colorFrom: indigo
-colorTo: purple
-sdk: gradio
-sdk_version: 5.49.1
-app_file: app.py
-pinned: false
-short_description: let it work
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Near–Far Field Speech Analysis Tool (Gradio)
+## Overview
+This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet
+**“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”**
+The tool performs a **frame-wise comparative analysis of near-field and far-field speech recordings** to
+study how microphone distance affects speech quality, spectral structure, and perceptually relevant
+features. The focus of this project is analysis and interpretation, not speech enhancement.
+---
+## Motivation
+In real-world speech systems, microphones are often placed at a distance from the speaker.
+Compared to near-field recordings, far-field speech suffers from:
+- Frequency-dependent attenuation
+- Reverberation and temporal smearing
+- Increased background noise
+- Compression of cepstral and spectral features
+Most speech recognition and enhancement models do not explicitly analyze how these effects appear
+at a frame level. This tool was built to **visually and quantitatively understand distant speech
+degradation** before attempting model-based solutions.
+---
+## What This Tool Does
+The tool takes two recordings of the **same speech content**:
+- A near-field (reference) audio file
+- A far-field (target) audio file
+It then performs the following steps:
+1. Temporal alignment of both signals
+2. Frame-wise segmentation with overlap
+3. Multi-domain feature extraction
+4. Frame-level similarity and degradation analysis
+5. Unsupervised clustering of acoustic features
+6. Feature–quality correlation analysis
+7. Interactive visualization and CSV export
+---
+## Signal Processing Pipeline
+### 1. Signal Alignment
+Near-field and far-field signals are temporally aligned using **cross-correlation**.
+This compensates for microphone delay and propagation differences and ensures that
+corresponding speech events are compared frame by frame.
 ---
+### 2. Frame Segmentation
+The aligned signals are segmented into short-time overlapping frames using
+user-defined frame length and hop size. This enables localized analysis of
+acoustic degradation instead of global averages.
 ---
+### 3. Multi-Domain Feature Extraction
+For each frame, features are extracted across multiple acoustic domains:
+**Time-domain features**
+- RMS energy (loudness and dynamic range)
+**Frequency-domain features**
+- Spectral centroid (brightness)
+- Spectral flatness (tonal vs noise-like)
+- Zero-crossing rate (signal texture)
+**Cepstral features**
+- 13 MFCC coefficients representing the perceptual spectral envelope
+**Band-wise spectral energy**
+- Low-frequency band (≤ 2 kHz)
+- Mid-frequency band (2–4 kHz)
+- High-frequency band (> 4 kHz)
+This multi-domain representation helps isolate attenuation, reverberation,
+and noise effects caused by distance.
+---
+### 4. Frame-wise Near–Far Comparison
+Each near-field frame is compared with its aligned far-field frame using:
+- Cosine similarity between feature vectors
+- Spectral overlap between STFT magnitudes
+- High-frequency energy loss (in dB)
+These measures are combined into a **match quality score**, which indicates
+how closely far-field speech resembles near-field speech at each frame.
+---
+### 5. Clustering and Feature Correlation
+Unsupervised clustering (K-Means or Agglomerative) is applied independently
+to near-field and far-field features to explore separability.
+In addition, correlation analysis is performed to study:
+- How near-field and far-field features relate to each other
+- Which features are most correlated with match quality
+This helps identify features that are most sensitive to distance-induced degradation.
+---
+## Visualizations Provided
+The Gradio interface includes the following visual outputs:
+- Frame-wise similarity and degradation plots
+- Spectral difference heatmaps
+- Cluster scatter plots (near-field and far-field)
+- Feature–quality overlay plots
+- Feature correlation heatmaps
+- Scatter matrices for inter-feature relationships
+All results can be exported as CSV files for offline analysis.
+---
+## Key Observations from the Tool
+Based on experiments conducted using the AMI Meeting Corpus:
+- Far-field speech consistently loses low-frequency energy
+- Mid-frequency bands often show reinforcement due to room reverberation
+- High-frequency bands remain relatively stable but are noise-dominated
+- MFCCs in far-field speech are compressed, indicating muffled spectral envelopes
+- Temporal structure is largely preserved, but quality degrades
+- Unsupervised clustering struggles due to overlapping feature distributions
+These observations motivated the exploration of neural difference encoders
+in later stages of the project.
+---
+## Limitations
+- Processing time is high for long audio files on CPU-only environments
+- Clustering does not reliably separate near-field and far-field speech
+- Some visualizations require domain knowledge to interpret correctly
+---
+## Intended Use
+This tool is intended for:
+- Academic analysis of distant speech degradation
+- Feature-level inspection before model design
+- Supporting research in far-field ASR and speech enhancement
+It is not intended to be used as a real-time or production-level system.
+---
+## Dataset
+Experiments were conducted using the **AMI Meeting Corpus**, specifically
+synchronized near-field headset and far-field microphone recordings.
+---
+## Acknowledgements
+This project was developed as part of the **Samsung PRISM Worklet Program**
+at R. V. College of Engineering.
+---
+## License
+This project is intended for academic and research use only.