File size: 5,869 Bytes
a579f92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a6b87be
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
acb3e2f
a6b87be
 
 
 
 
 
 
acb3e2f
 
a6b87be
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: MSD
emoji: 💻
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: let it work
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference






# Near–Far Field Speech Analysis Tool (Gradio)

## Overview

This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet  
**“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”**

The tool performs a **frame-wise comparative analysis of near-field and far-field speech recordings** to
study how microphone distance affects speech quality, spectral structure, and perceptually relevant
features. The focus of this project is analysis and interpretation, not speech enhancement.

---

## Motivation

In real-world speech systems, microphones are often placed at a distance from the speaker.
Compared to near-field recordings, far-field speech suffers from:

- Frequency-dependent attenuation
- Reverberation and temporal smearing
- Increased background noise
- Compression of cepstral and spectral features

Most speech recognition and enhancement models do not explicitly analyze how these effects appear
at a frame level. This tool was built to **visually and quantitatively understand distant speech
degradation** before attempting model-based solutions.

---

## What This Tool Does

The tool takes two recordings of the **same speech content**:
- A near-field (reference) audio file
- A far-field (target) audio file

It then performs the following steps:

1. Temporal alignment of both signals
2. Frame-wise segmentation with overlap
3. Multi-domain feature extraction
4. Frame-level similarity and degradation analysis
5. Unsupervised clustering of acoustic features
6. Feature–quality correlation analysis
7. Interactive visualization and CSV export

---

## Signal Processing Pipeline

### 1. Signal Alignment

Near-field and far-field signals are temporally aligned using **cross-correlation**.
This compensates for microphone delay and propagation differences and ensures that
corresponding speech events are compared frame by frame.

---

### 2. Frame Segmentation

The aligned signals are segmented into short-time overlapping frames using
user-defined frame length and hop size. This enables localized analysis of
acoustic degradation instead of global averages.

---

### 3. Multi-Domain Feature Extraction

For each frame, features are extracted across multiple acoustic domains:

**Time-domain features**
- RMS energy (loudness and dynamic range)

**Frequency-domain features**
- Spectral centroid (brightness)
- Spectral flatness (tonal vs noise-like)
- Zero-crossing rate (signal texture)

**Cepstral features**
- 13 MFCC coefficients representing the perceptual spectral envelope

**Band-wise spectral energy**
- Low-frequency band (≤ 2 kHz)
- Mid-frequency band (2–4 kHz)
- High-frequency band (> 4 kHz)

This multi-domain representation helps isolate attenuation, reverberation,
and noise effects caused by distance.

---

### 4. Frame-wise Near–Far Comparison

Each near-field frame is compared with its aligned far-field frame using:

- Cosine similarity between feature vectors
- Spectral overlap between STFT magnitudes
- High-frequency energy loss (in dB)

These measures are combined into a **match quality score**, which indicates
how closely far-field speech resembles near-field speech at each frame.

---

### 5. Clustering and Feature Correlation

Unsupervised clustering (K-Means or Agglomerative) is applied independently
to near-field and far-field features to explore separability.

In addition, correlation analysis is performed to study:
- How near-field and far-field features relate to each other
- Which features are most correlated with match quality

This helps identify features that are most sensitive to distance-induced degradation.

---

## Visualizations Provided

The Gradio interface includes the following visual outputs:

- Frame-wise similarity and degradation plots
- Spectral difference heatmaps
- Cluster scatter plots (near-field and far-field)
- Feature–quality overlay plots
- Feature correlation heatmaps
- Scatter matrices for inter-feature relationships

All results can be exported as CSV files for offline analysis.

---

## Key Observations from the Tool

Based on experiments conducted using the AMI Meeting Corpus:

- Far-field speech consistently loses low-frequency energy
- Mid-frequency bands often show reinforcement due to room reverberation
- High-frequency bands remain relatively stable but are noise-dominated
- MFCCs in far-field speech are compressed, indicating muffled spectral envelopes
- Temporal structure is largely preserved, but quality degrades
- Unsupervised clustering struggles due to overlapping feature distributions

These observations motivated the exploration of neural difference encoders
in later stages of the project.

---

## Limitations

- Processing time is high for long audio files on CPU-only environments
- Clustering does not reliably separate near-field and far-field speech
- Some visualizations require domain knowledge to interpret correctly

---

## Intended Use

This tool is intended for:
- Academic analysis of distant speech degradation
- Feature-level inspection before model design
- Supporting research in far-field ASR and speech enhancement

It is not intended to be used as a real-time or production-level system.

---

## Dataset

Experiments were conducted using the **AMI Meeting Corpus**, specifically
synchronized near-field headset and far-field microphone recordings.

---

## Acknowledgements

This project was developed as part of the **Samsung PRISM Worklet Program**
at R. V. College of Engineering.

---

## License

This project is intended for academic and research use only.