AdityaK007 commited on
Commit
a6b87be
·
verified ·
1 Parent(s): 85f67e6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -10
README.md CHANGED
@@ -1,13 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: MSD
3
- emoji: 💻
4
- colorFrom: indigo
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 5.49.1
8
- app_file: app.py
9
- pinned: false
10
- short_description: let it work
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Near–Far Field Speech Analysis Tool (Gradio)
2
+
3
+ ## Overview
4
+
5
+ This repository contains an interactive Hugging Face Gradio tool developed as part of the PRISM worklet
6
+ **“Distant Acoustic Speech Investigations using Signal Processing and Neural Networks.”**
7
+
8
+ The tool performs a **frame-wise comparative analysis of near-field and far-field speech recordings** to
9
+ study how microphone distance affects speech quality, spectral structure, and perceptually relevant
10
+ features. The focus of this project is analysis and interpretation, not speech enhancement.
11
+
12
+ ---
13
+
14
+ ## Motivation
15
+
16
+ In real-world speech systems, microphones are often placed at a distance from the speaker.
17
+ Compared to near-field recordings, far-field speech suffers from:
18
+
19
+ - Frequency-dependent attenuation
20
+ - Reverberation and temporal smearing
21
+ - Increased background noise
22
+ - Compression of cepstral and spectral features
23
+
24
+ Most speech recognition and enhancement models do not explicitly analyze how these effects appear
25
+ at a frame level. This tool was built to **visually and quantitatively understand distant speech
26
+ degradation** before attempting model-based solutions.
27
+
28
+ ---
29
+
30
+ ## What This Tool Does
31
+
32
+ The tool takes two recordings of the **same speech content**:
33
+ - A near-field (reference) audio file
34
+ - A far-field (target) audio file
35
+
36
+ It then performs the following steps:
37
+
38
+ 1. Temporal alignment of both signals
39
+ 2. Frame-wise segmentation with overlap
40
+ 3. Multi-domain feature extraction
41
+ 4. Frame-level similarity and degradation analysis
42
+ 5. Unsupervised clustering of acoustic features
43
+ 6. Feature–quality correlation analysis
44
+ 7. Interactive visualization and CSV export
45
+
46
+ ---
47
+
48
+ ## Signal Processing Pipeline
49
+
50
+ ### 1. Signal Alignment
51
+
52
+ Near-field and far-field signals are temporally aligned using **cross-correlation**.
53
+ This compensates for microphone delay and propagation differences and ensures that
54
+ corresponding speech events are compared frame by frame.
55
+
56
  ---
57
+
58
+ ### 2. Frame Segmentation
59
+
60
+ The aligned signals are segmented into short-time overlapping frames using
61
+ user-defined frame length and hop size. This enables localized analysis of
62
+ acoustic degradation instead of global averages.
63
+
 
 
64
  ---
65
 
66
+ ### 3. Multi-Domain Feature Extraction
67
+
68
+ For each frame, features are extracted across multiple acoustic domains:
69
+
70
+ **Time-domain features**
71
+ - RMS energy (loudness and dynamic range)
72
+
73
+ **Frequency-domain features**
74
+ - Spectral centroid (brightness)
75
+ - Spectral flatness (tonal vs noise-like)
76
+ - Zero-crossing rate (signal texture)
77
+
78
+ **Cepstral features**
79
+ - 13 MFCC coefficients representing the perceptual spectral envelope
80
+
81
+ **Band-wise spectral energy**
82
+ - Low-frequency band (≤ 2 kHz)
83
+ - Mid-frequency band (2–4 kHz)
84
+ - High-frequency band (> 4 kHz)
85
+
86
+ This multi-domain representation helps isolate attenuation, reverberation,
87
+ and noise effects caused by distance.
88
+
89
+ ---
90
+
91
+ ### 4. Frame-wise Near–Far Comparison
92
+
93
+ Each near-field frame is compared with its aligned far-field frame using:
94
+
95
+ - Cosine similarity between feature vectors
96
+ - Spectral overlap between STFT magnitudes
97
+ - High-frequency energy loss (in dB)
98
+
99
+ These measures are combined into a **match quality score**, which indicates
100
+ how closely far-field speech resembles near-field speech at each frame.
101
+
102
+ ---
103
+
104
+ ### 5. Clustering and Feature Correlation
105
+
106
+ Unsupervised clustering (K-Means or Agglomerative) is applied independently
107
+ to near-field and far-field features to explore separability.
108
+
109
+ In addition, correlation analysis is performed to study:
110
+ - How near-field and far-field features relate to each other
111
+ - Which features are most correlated with match quality
112
+
113
+ This helps identify features that are most sensitive to distance-induced degradation.
114
+
115
+ ---
116
+
117
+ ## Visualizations Provided
118
+
119
+ The Gradio interface includes the following visual outputs:
120
+
121
+ - Frame-wise similarity and degradation plots
122
+ - Spectral difference heatmaps
123
+ - Cluster scatter plots (near-field and far-field)
124
+ - Feature–quality overlay plots
125
+ - Feature correlation heatmaps
126
+ - Scatter matrices for inter-feature relationships
127
+
128
+ All results can be exported as CSV files for offline analysis.
129
+
130
+ ---
131
+
132
+ ## Key Observations from the Tool
133
+
134
+ Based on experiments conducted using the AMI Meeting Corpus:
135
+
136
+ - Far-field speech consistently loses low-frequency energy
137
+ - Mid-frequency bands often show reinforcement due to room reverberation
138
+ - High-frequency bands remain relatively stable but are noise-dominated
139
+ - MFCCs in far-field speech are compressed, indicating muffled spectral envelopes
140
+ - Temporal structure is largely preserved, but quality degrades
141
+ - Unsupervised clustering struggles due to overlapping feature distributions
142
+
143
+ These observations motivated the exploration of neural difference encoders
144
+ in later stages of the project.
145
+
146
+ ---
147
+
148
+ ## Limitations
149
+
150
+ - Processing time is high for long audio files on CPU-only environments
151
+ - Clustering does not reliably separate near-field and far-field speech
152
+ - Some visualizations require domain knowledge to interpret correctly
153
+
154
+ ---
155
+
156
+ ## Intended Use
157
+
158
+ This tool is intended for:
159
+ - Academic analysis of distant speech degradation
160
+ - Feature-level inspection before model design
161
+ - Supporting research in far-field ASR and speech enhancement
162
+
163
+ It is not intended to be used as a real-time or production-level system.
164
+
165
+ ---
166
+
167
+ ## Dataset
168
+
169
+ Experiments were conducted using the **AMI Meeting Corpus**, specifically
170
+ synchronized near-field headset and far-field microphone recordings.
171
+
172
+ ---
173
+
174
+ ## Acknowledgements
175
+
176
+ This project was developed as part of the **Samsung PRISM Worklet Program**
177
+ at R. V. College of Engineering.
178
+
179
+ ---
180
+
181
+ ## License
182
+
183
+ This project is intended for academic and research use only.