juice500 commited on
Commit
d74d90c
·
1 Parent(s): ea7c90b

Initial commit

Browse files
README.md CHANGED
@@ -1,13 +1,64 @@
1
  ---
2
  title: Orthogonal Subspace
3
- emoji: 📉
4
- colorFrom: blue
5
- colorTo: red
6
  sdk: gradio
7
- sdk_version: 6.11.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Orthogonal Subspace
3
+ emoji: 🚀
4
+ colorFrom: pink
5
+ colorTo: pink
6
  sdk: gradio
7
+ sdk_version: 6.10.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
  ---
12
 
13
+ # Phonological representation demo based on orthogonal subspaces
14
+
15
+ Interactive demo for [**Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces**](https://arxiv.org/abs/2603.12642).
16
+
17
+ You can load an audio file, pick a time span and a learned phonological vector within WavLM representation, and hear how adding that vector changes the resynthesized audio, alongside spectrograms for before and after.
18
+
19
+ | Resource | Link |
20
+ |----------|------|
21
+ | Full codebase | [github.com/juice500ml/phonetic-arithmetic](https://github.com/juice500ml/phonetic-arithmetic) |
22
+ | Example audio / alignments | [LDC93S1](https://catalog.ldc.upenn.edu/LDC93S1W) (TIMIT single-utterance sample from LDC) |
23
+
24
+ ## Phonological vectors
25
+
26
+ The UI exposes three vector families (for TIMIT and VoxAngeles):
27
+
28
+ | Preset | Idea |
29
+ |--------|------|
30
+ | **Original** | Directions from the paper’s setup. |
31
+ | **Unconstrained** | Center pooling only; no separate consonant/vowel subspaces. |
32
+ | **Extended** | Unconstrained pooling, with positive and negative poles modeled as separate vectors. |
33
+
34
+ ## Run locally
35
+
36
+ From this directory (`demos/orthogonal-subspace`):
37
+
38
+ ```bash
39
+ pip install -r requirements.txt
40
+ GRADIO_TEMP_DIR=$PWD/.gradio_tmp python app.py
41
+ ```
42
+
43
+ Gradio will start a local URL; paths assume the working directory is the folder that contains `examples/` and `app.py`.
44
+
45
+ ## Reproducing phonological vectors
46
+
47
+ Run from the **repository root** (`phonetic-arithmetic`), after you have the feature pickles and `dump_vectors.py` wired to your data. Replace `timit` with `voxangeles` if you want the other corpus.
48
+
49
+ ## Code for calculating contextual phonological vectors
50
+ ```bash
51
+ dataset=timit # or voxangeles
52
+ python3 dump_vectors.py \
53
+ --feat-path feats/timit-wavlm-large-24-center-featslice.pkl \
54
+ --output-path demos/orthogonal-subspace/examples/original-${dataset}.pkl \
55
+ --vector-type original --vector ctx
56
+ python3 dump_vectors.py \
57
+ --feat-path feats/timit-wavlm-large-24-center-featslice.pkl \
58
+ --output-path demos/orthogonal-subspace/examples/unconstrained-${dataset}.pkl \
59
+ --vector-type full --vector ctx
60
+ python3 dump_vectors.py \
61
+ --feat-path feats/timit-wavlm-large-24-center-featslice.pkl \
62
+ --output-path demos/orthogonal-subspace/examples/extended-${dataset}.pkl \
63
+ --vector-type extended --vector ctx
64
+ ```
app.py ADDED
@@ -0,0 +1,265 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pickle
2
+ from pathlib import Path
3
+
4
+ import librosa
5
+ import numpy as np
6
+ import gradio as gr
7
+
8
+ import matplotlib
9
+ matplotlib.use("Agg")
10
+
11
+ import matplotlib.pyplot as plt
12
+ from specplotter import SpecPlotter
13
+
14
+ from transformers import Wav2Vec2FeatureExtractor, AutoModel
15
+ import torch
16
+
17
+
18
+ def cos_sim(XA, XB):
19
+ XA_norm = XA / np.linalg.norm(XA, axis=1, keepdims=True)
20
+ XB_norm = XB / np.linalg.norm(XB, axis=1, keepdims=True)
21
+ return (XA_norm @ XB_norm.T)
22
+
23
+
24
+ def _read_pkl(path):
25
+ with open(path, "rb") as f:
26
+ vectors = pickle.load(f)["vectors"]
27
+ feats = [key.split()[0] for key in vectors.keys() if "(0)" in key]
28
+ return {
29
+ feat: {
30
+ loc: vectors.get(f"{feat} ({loc})")
31
+ for loc in ["-4", "-3", "-2", "-1", "0", "+1", "+2", "+3", "+4"]
32
+ }
33
+ for feat in feats
34
+ }
35
+
36
+ def _read_alignment(fname):
37
+ data = []
38
+ with open(fname, "r") as f:
39
+ for line in f:
40
+ start, end, text = line.strip().split()
41
+ data.append({
42
+ "start": int(start),
43
+ "end": int(end),
44
+ "text": text,
45
+ })
46
+ return data
47
+
48
+ print("Loading model...")
49
+ processor = Wav2Vec2FeatureExtractor.from_pretrained("microsoft/wavlm-large")
50
+ ssl = AutoModel.from_pretrained("microsoft/wavlm-large")
51
+ print("Model loaded!")
52
+
53
+ print("Loading vectors...")
54
+ PHON_VECTORS = {
55
+ "TIMIT (original)": _read_pkl("examples/original-timit.pkl"),
56
+ "TIMIT (unconstrained)": _read_pkl("examples/unconstrained-timit.pkl"),
57
+ "TIMIT (extended)": _read_pkl("examples/extended-timit.pkl"),
58
+ "VoxAngeles (original)": _read_pkl("examples/original-voxangeles.pkl"),
59
+ "VoxAngeles (unconstrained)": _read_pkl("examples/unconstrained-voxangeles.pkl"),
60
+ "VoxAngeles (extended)": _read_pkl("examples/extended-voxangeles.pkl"),
61
+ }
62
+ DEFAULT_KEY = next(iter(PHON_VECTORS.keys()))
63
+ print("Vectors loaded!")
64
+
65
+ EXAMPLE_AUDIO = Path("examples/LDC93S1.wav")
66
+ EXAMPLE_PHN = _read_alignment("examples/LDC93S1.phn")
67
+ with open("examples/LDC93S1.pkl", "rb") as f:
68
+ EXAMPLE_FEATS = pickle.load(f)
69
+
70
+
71
+ def run_orthogonal_subspace(path, vector_type, features, context_size, similarity_range):
72
+ audio, _ = librosa.load(path, sr=16000, mono=True)
73
+ if Path(path).name == EXAMPLE_AUDIO.name:
74
+ feats = EXAMPLE_FEATS
75
+ alignments = EXAMPLE_PHN
76
+ else:
77
+ inputs = processor(
78
+ raw_speech=[audio],
79
+ sampling_rate=16000,
80
+ padding=False,
81
+ return_tensors="pt",
82
+ )
83
+ out = ssl(**inputs)
84
+ feats = out.last_hidden_state[0].detach().numpy()
85
+ alignments = []
86
+
87
+ keys, vectors = [], []
88
+ for f in features:
89
+ for i in ["-4", "-3", "-2", "-1", "0", "+1", "+2", "+3", "+4"][4-context_size:5+context_size]:
90
+ if (PHON_VECTORS[vector_type][f] is not None) and (PHON_VECTORS[vector_type][f][i] is not None):
91
+ keys.append(f"{f} ({i})")
92
+ vectors.append(PHON_VECTORS[vector_type][f][i])
93
+ vectors = np.stack(vectors)
94
+ sims = cos_sim(vectors, feats)
95
+
96
+ fig, ax = plt.subplots(1, figsize=(10, 2 + len(keys) // 5), constrained_layout=True)
97
+ ax.axis("off")
98
+
99
+ gs = fig.add_gridspec(
100
+ nrows=1 + len(keys), ncols=1,
101
+ height_ratios=[3] + [0.2] * len(keys) # spectrogram taller than heatmaps
102
+ )
103
+
104
+ # Spectrogram plotting
105
+ ax_spec = fig.add_subplot(gs[0, 0])
106
+
107
+ sp = SpecPlotter()
108
+ sp.plot_spectrogram(audio, ax=ax_spec, show_annotation=False)
109
+ ax_spec.get_xaxis().set_visible(False)
110
+
111
+ for row in alignments:
112
+ start, end, label = row["start"] / 16000, row["end"] / 16000, row["text"]
113
+
114
+ ax_spec.axvline(start, color="black", linestyle="-", alpha=0.7)
115
+ ax_spec.axvline(end, color="black", linestyle="-", alpha=0.7)
116
+ ax_spec.add_patch(
117
+ plt.Rectangle(
118
+ (start, 7),
119
+ end - start,
120
+ 1,
121
+ color="black",
122
+ alpha=0.4,
123
+ clip_on=False
124
+ )
125
+ )
126
+ ax_spec.text(
127
+ (start + end) / 2,
128
+ 7.5,
129
+ label,
130
+ ha="center",
131
+ va="center",
132
+ color="white",
133
+ fontsize=9
134
+ )
135
+
136
+ x0, x1 = ax_spec.get_xlim()
137
+ ims = []
138
+ axes_hm = []
139
+ for i, (hm, lab) in enumerate(zip(sims, keys), start=1):
140
+ ax = fig.add_subplot(gs[i, 0], sharex=ax_spec)
141
+ axes_hm.append(ax)
142
+
143
+ hm = np.asarray(hm)
144
+ if hm.ndim == 1:
145
+ hm = hm[None, :] # make it (1, T) so it looks like a single-row heatmap
146
+
147
+ # Use extent so the heatmap x-axis is in seconds (aligned with spectrogram)
148
+ im = ax.imshow(
149
+ hm,
150
+ origin="lower",
151
+ aspect="auto",
152
+ interpolation="nearest",
153
+ extent=[x0, x1, 0, 1],
154
+ vmin=-similarity_range,
155
+ vmax=+similarity_range,
156
+ cmap=plt.cm.PuOr,
157
+ )
158
+ ims.append(im)
159
+
160
+ for row in alignments:
161
+ start, end, label = row["start"] / 16000, row["end"] / 16000, row["text"]
162
+ ax.axvline(start, color="black", linestyle="-", alpha=0.7)
163
+ ax.axvline(end, color="black", linestyle="-", alpha=0.7)
164
+
165
+ ax.set_yticks([])
166
+ ax.tick_params(axis='x', length=0)
167
+
168
+ feat, loc = lab.split()
169
+ if loc == "(0)":
170
+ if context_size == 0:
171
+ label = f"[+{feat}]"
172
+ else:
173
+ label = f"[+{feat}] 0"
174
+ else:
175
+ label = loc[1:-1]
176
+ ax.set_ylabel(label, rotation=0, ha="right", va="center", fontweight="bold" if loc == "(0)" else "normal")
177
+ ax.yaxis.set_label_coords(-0.02, 0.5)
178
+
179
+ ax.spines["top"].set_visible(False)
180
+ ax.spines["right"].set_visible(False)
181
+ ax.spines["left"].set_visible(False)
182
+
183
+ # Only show x tick labels on the bottom-most axis
184
+ plt.setp(ax_spec.get_xticklabels(), visible=False)
185
+ for ax in axes_hm[:-1]:
186
+ plt.setp(ax.get_xticklabels(), visible=False)
187
+ axes_hm[-1].set_xlabel("Time [s]")
188
+
189
+ ax_spec.set_xlim(x0, x1)
190
+
191
+ cbar = fig.colorbar(ims[-1], ax=axes_hm, pad=0.01, fraction=0.03)
192
+ cbar.set_label("Cosine similarity")
193
+
194
+ return fig
195
+
196
+
197
+ with gr.Blocks(title="Orthogonal Subspace Demo") as demo:
198
+ with gr.Row():
199
+ gr.Markdown("""
200
+ ## 🎙️ Orthogonal Subspace Demo
201
+
202
+ Demonstration for the paper [Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces](https://arxiv.org/abs/2603.12642).
203
+ This demo reproduces Figure 10: cosine similarity between frame-level S3M representations and position-dependent phonological vectors over time, illustrating how each relative phone position occupies a distinct orthogonal subspace.
204
+
205
+ Upload, record, or use the example audio, configure the parameters, and click **Run**.
206
+ """)
207
+
208
+ with gr.Row():
209
+ with gr.Column(scale=1):
210
+ audio = gr.Audio(
211
+ label="Input Audio",
212
+ type="filepath",
213
+ sources=["upload", "microphone"],
214
+ recording=True,
215
+ value=str(EXAMPLE_AUDIO),
216
+ )
217
+ gr.Markdown("""
218
+ ### Parameters
219
+ - **Vector extraction method**: How phonological vectors are estimated from S3M representations. Different options correspond to different training dataset/calculating the vectors.
220
+ - **Phonological features**: Which phonological features to include in the plot. Deselect features to reduce clutter or isolate a single dimension of contrast.
221
+ - **Context size**: Number of relative phone positions. 0 = vectors from current phone only; k = vectors from relative positions −k through +k. Larger values reveal how far phonological features extend beyond current (or immediately adjacent) phones.
222
+ - **Cosine similarity range**: Upper bound of the cosine similarity (default +/- 0.4). Adjust to zoom in on fine-grained differences or accommodate low-similarity outputs.
223
+ """)
224
+
225
+ with gr.Column(scale=1):
226
+ vector_dropdown = gr.Dropdown(
227
+ label="Vector extraction method",
228
+ choices=list(PHON_VECTORS.keys()),
229
+ value=DEFAULT_KEY,
230
+ interactive=True,
231
+ )
232
+ feature_checkbox = gr.CheckboxGroup(
233
+ choices=list(PHON_VECTORS[DEFAULT_KEY].keys()),
234
+ value=list(PHON_VECTORS[DEFAULT_KEY].keys()),
235
+ label="Phonological features",
236
+ show_select_all=True,
237
+ interactive=True,
238
+ )
239
+ context_size_slider = gr.Slider(label="Context size", value=2, minimum=0, maximum=4, step=1, interactive=True)
240
+ similarity_slider = gr.Slider(label="Cosine similarity range", value=0.4, minimum=0.1, maximum=1.0, step=0.01, interactive=True)
241
+ run_btn = gr.Button("▶ Run", variant="primary", scale=1)
242
+
243
+ with gr.Row():
244
+ plot = gr.Plot(
245
+ label="Output Spectrogram and Phonological Representations",
246
+ show_label=False,
247
+ )
248
+
249
+ # Connectors
250
+ vector_dropdown.change(
251
+ fn=lambda key: gr.CheckboxGroup(
252
+ choices=list(PHON_VECTORS[key].keys()),
253
+ value=list(PHON_VECTORS[key].keys()),
254
+ ),
255
+ inputs=vector_dropdown,
256
+ outputs=feature_checkbox,
257
+ )
258
+ run_btn.click(
259
+ fn=run_orthogonal_subspace,
260
+ inputs=[audio, vector_dropdown, feature_checkbox, context_size_slider, similarity_slider],
261
+ outputs=plot,
262
+ )
263
+
264
+ if __name__ == "__main__":
265
+ demo.launch()
examples/LDC93S1.phn ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 0 3050 h#
2
+ 3050 4559 sh
3
+ 4559 5723 ix
4
+ 5723 6642 hv
5
+ 6642 8772 eh
6
+ 8772 9190 dcl
7
+ 9190 10337 jh
8
+ 10337 11517 ih
9
+ 11517 12500 dcl
10
+ 12500 12640 d
11
+ 12640 14714 ah
12
+ 14714 15870 kcl
13
+ 15870 16334 k
14
+ 16334 18088 s
15
+ 18088 20417 ux
16
+ 20417 21199 q
17
+ 21199 22560 en
18
+ 22560 22920 gcl
19
+ 22920 23271 g
20
+ 23271 24229 r
21
+ 24229 25566 ix
22
+ 25566 27156 s
23
+ 27156 28064 ix
24
+ 28064 29660 w
25
+ 29660 31719 ao
26
+ 31719 33360 sh
27
+ 33360 33754 epi
28
+ 33754 34715 w
29
+ 34715 36080 ao
30
+ 36080 36326 dx
31
+ 36326 37556 axr
32
+ 37556 39561 ao
33
+ 39561 40313 l
34
+ 40313 42059 y
35
+ 42059 43479 ih
36
+ 43479 44586 axr
37
+ 44586 46720 h#
examples/LDC93S1.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:126a2fb63e03f2567d9f67e3a795d89a52b5beb99cff4530d2543f039309c7ef
3
+ size 594082
examples/LDC93S1.wav ADDED
Binary file (93.6 kB). View file
 
examples/LDC93S1.wrd ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 3050 5723 she
2
+ 5723 10337 had
3
+ 9190 11517 your
4
+ 11517 16334 dark
5
+ 16334 21199 suit
6
+ 21199 22560 in
7
+ 22560 28064 greasy
8
+ 28064 33360 wash
9
+ 33754 37556 water
10
+ 37556 40313 all
11
+ 40313 44586 year
examples/extended-timit.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:db61233d72c815ff302d3b5388a060ea72c118e52631f41133452e06b6ff6276
3
+ size 1417220
examples/extended-voxangeles.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:308faea3193c79fd86fcc27d3d920e17305cd923e17c8f34e3d5dda06862cb95
3
+ size 1566356
examples/original-timit.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8fc61747b6bd3f1ff0805e25f94a90e9f51597a8e4c189417d77ebebfb05e08a
3
+ size 165972
examples/original-voxangeles.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4b6c3da67a53e5a91a18c08c8c39d16519f50e0d631e2edb6153b643fde20b44
3
+ size 165977
examples/unconstrained-timit.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5be47ef278bd08d1b85486b6cdaaaa70ef8f5eac478b01ed245798925d2f125
3
+ size 708589
examples/unconstrained-voxangeles.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b6c587990f041b0bfa01ef42ca13d8d3004038fbbf5f19d64c0271159036df87
3
+ size 708594
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ transformers
2
+ torch
3
+ librosa
4
+ numpy
5
+ gradio
6
+ specplotter
7
+ matplotlib