Spaces:

juice500
/

orthogonal-subspace

Sleeping

App Files Files Community

juice500 commited on 24 days ago

Commit

d74d90c

1 Parent(s): ea7c90b

Initial commit

Browse files

Files changed (13) hide show

README.md +56 -5
app.py +265 -0
examples/LDC93S1.phn +37 -0
examples/LDC93S1.pkl +3 -0
examples/LDC93S1.wav +0 -0
examples/LDC93S1.wrd +11 -0
examples/extended-timit.pkl +3 -0
examples/extended-voxangeles.pkl +3 -0
examples/original-timit.pkl +3 -0
examples/original-voxangeles.pkl +3 -0
examples/unconstrained-timit.pkl +3 -0
examples/unconstrained-voxangeles.pkl +3 -0
requirements.txt +7 -0

README.md CHANGED Viewed

@@ -1,13 +1,64 @@
 ---
 title: Orthogonal Subspace
-emoji: 📉
-colorFrom: blue
-colorTo: red
 sdk: gradio
-sdk_version: 6.11.0
 app_file: app.py
 pinned: false
 license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Orthogonal Subspace
+emoji: 🚀
+colorFrom: pink
+colorTo: pink
 sdk: gradio
+sdk_version: 6.10.0
 app_file: app.py
 pinned: false
 license: mit
 ---
+# Phonological representation demo based on orthogonal subspaces
+Interactive demo for [**Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces**](https://arxiv.org/abs/2603.12642).
+You can load an audio file, pick a time span and a learned phonological vector within WavLM representation, and hear how adding that vector changes the resynthesized audio, alongside spectrograms for before and after.
+| Resource | Link |
+|----------|------|
+| Full codebase | [github.com/juice500ml/phonetic-arithmetic](https://github.com/juice500ml/phonetic-arithmetic) |
+| Example audio / alignments | [LDC93S1](https://catalog.ldc.upenn.edu/LDC93S1W) (TIMIT single-utterance sample from LDC) |
+## Phonological vectors
+The UI exposes three vector families (for TIMIT and VoxAngeles):
+| Preset | Idea |
+|--------|------|
+| **Original** | Directions from the paper’s setup. |
+| **Unconstrained** | Center pooling only; no separate consonant/vowel subspaces. |
+| **Extended** | Unconstrained pooling, with positive and negative poles modeled as separate vectors. |
+## Run locally
+From this directory (`demos/orthogonal-subspace`):
+```bash
+pip install -r requirements.txt
+GRADIO_TEMP_DIR=$PWD/.gradio_tmp python app.py
+```
+Gradio will start a local URL; paths assume the working directory is the folder that contains `examples/` and `app.py`.
+## Reproducing phonological vectors
+Run from the **repository root** (`phonetic-arithmetic`), after you have the feature pickles and `dump_vectors.py` wired to your data. Replace `timit` with `voxangeles` if you want the other corpus.
+## Code for calculating contextual phonological vectors
+```bash
+dataset=timit # or voxangeles
+python3 dump_vectors.py \
+    --feat-path feats/timit-wavlm-large-24-center-featslice.pkl \
+    --output-path demos/orthogonal-subspace/examples/original-${dataset}.pkl \
+    --vector-type original --vector ctx
+python3 dump_vectors.py \
+    --feat-path feats/timit-wavlm-large-24-center-featslice.pkl \
+    --output-path demos/orthogonal-subspace/examples/unconstrained-${dataset}.pkl \
+    --vector-type full --vector ctx
+python3 dump_vectors.py \
+    --feat-path feats/timit-wavlm-large-24-center-featslice.pkl \
+    --output-path demos/orthogonal-subspace/examples/extended-${dataset}.pkl \
+    --vector-type extended --vector ctx
+```

app.py ADDED Viewed

	@@ -0,0 +1,265 @@

+import pickle
+from pathlib import Path
+import librosa
+import numpy as np
+import gradio as gr
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+from specplotter import SpecPlotter
+from transformers import Wav2Vec2FeatureExtractor, AutoModel
+import torch
+def cos_sim(XA, XB):
+    XA_norm = XA / np.linalg.norm(XA, axis=1, keepdims=True)
+    XB_norm = XB / np.linalg.norm(XB, axis=1, keepdims=True)
+    return (XA_norm @ XB_norm.T)
+def _read_pkl(path):
+    with open(path, "rb") as f:
+        vectors = pickle.load(f)["vectors"]
+    feats = [key.split()[0] for key in vectors.keys() if "(0)" in key]
+    return {
+        feat: {
+            loc: vectors.get(f"{feat} ({loc})")
+            for loc in ["-4", "-3", "-2", "-1", "0", "+1", "+2", "+3", "+4"]
+        }
+        for feat in feats
+    }
+def _read_alignment(fname):
+    data = []
+    with open(fname, "r") as f:
+        for line in f:
+            start, end, text = line.strip().split()
+            data.append({
+                "start": int(start),
+                "end": int(end),
+                "text": text,
+            })
+    return data
+print("Loading model...")
+processor = Wav2Vec2FeatureExtractor.from_pretrained("microsoft/wavlm-large")
+ssl = AutoModel.from_pretrained("microsoft/wavlm-large")
+print("Model loaded!")
+print("Loading vectors...")
+PHON_VECTORS = {
+    "TIMIT (original)": _read_pkl("examples/original-timit.pkl"),
+    "TIMIT (unconstrained)": _read_pkl("examples/unconstrained-timit.pkl"),
+    "TIMIT (extended)": _read_pkl("examples/extended-timit.pkl"),
+    "VoxAngeles (original)": _read_pkl("examples/original-voxangeles.pkl"),
+    "VoxAngeles (unconstrained)": _read_pkl("examples/unconstrained-voxangeles.pkl"),
+    "VoxAngeles (extended)": _read_pkl("examples/extended-voxangeles.pkl"),
+}
+DEFAULT_KEY = next(iter(PHON_VECTORS.keys()))
+print("Vectors loaded!")
+EXAMPLE_AUDIO = Path("examples/LDC93S1.wav")
+EXAMPLE_PHN = _read_alignment("examples/LDC93S1.phn")
+with open("examples/LDC93S1.pkl", "rb") as f:
+    EXAMPLE_FEATS = pickle.load(f)
+def run_orthogonal_subspace(path, vector_type, features, context_size, similarity_range):
+    audio, _ = librosa.load(path, sr=16000, mono=True)
+    if Path(path).name == EXAMPLE_AUDIO.name:
+        feats = EXAMPLE_FEATS
+        alignments = EXAMPLE_PHN
+    else:
+        inputs = processor(
+            raw_speech=[audio],
+            sampling_rate=16000,
+            padding=False,
+            return_tensors="pt",
+        )
+        out = ssl(**inputs)
+        feats = out.last_hidden_state[0].detach().numpy()
+        alignments = []
+    keys, vectors = [], []
+    for f in features:
+        for i in ["-4", "-3", "-2", "-1", "0", "+1", "+2", "+3", "+4"][4-context_size:5+context_size]:
+            if (PHON_VECTORS[vector_type][f] is not None) and (PHON_VECTORS[vector_type][f][i] is not None):
+                keys.append(f"{f} ({i})")
+                vectors.append(PHON_VECTORS[vector_type][f][i])
+    vectors = np.stack(vectors)
+    sims = cos_sim(vectors, feats)
+    fig, ax = plt.subplots(1, figsize=(10, 2 + len(keys) // 5), constrained_layout=True)
+    ax.axis("off")
+    gs = fig.add_gridspec(
+        nrows=1 + len(keys), ncols=1,
+        height_ratios=[3] + [0.2] * len(keys)  # spectrogram taller than heatmaps
+    )
+    # Spectrogram plotting
+    ax_spec = fig.add_subplot(gs[0, 0])
+    sp = SpecPlotter()
+    sp.plot_spectrogram(audio, ax=ax_spec, show_annotation=False)
+    ax_spec.get_xaxis().set_visible(False)
+    for row in alignments:
+        start, end, label = row["start"] / 16000, row["end"] / 16000, row["text"]
+        ax_spec.axvline(start, color="black", linestyle="-", alpha=0.7)
+        ax_spec.axvline(end, color="black", linestyle="-", alpha=0.7)
+        ax_spec.add_patch(
+            plt.Rectangle(
+                (start, 7),
+                end - start,
+                1,
+                color="black",
+                alpha=0.4,
+                clip_on=False
+            )
+        )
+        ax_spec.text(
+            (start + end) / 2,
+            7.5,
+            label,
+            ha="center",
+            va="center",
+            color="white",
+            fontsize=9
+        )
+    x0, x1 = ax_spec.get_xlim()
+    ims = []
+    axes_hm = []
+    for i, (hm, lab) in enumerate(zip(sims, keys), start=1):
+        ax = fig.add_subplot(gs[i, 0], sharex=ax_spec)
+        axes_hm.append(ax)
+        hm = np.asarray(hm)
+        if hm.ndim == 1:
+            hm = hm[None, :]  # make it (1, T) so it looks like a single-row heatmap
+        # Use extent so the heatmap x-axis is in seconds (aligned with spectrogram)
+        im = ax.imshow(
+            hm,
+            origin="lower",
+            aspect="auto",
+            interpolation="nearest",
+            extent=[x0, x1, 0, 1],
+            vmin=-similarity_range,
+            vmax=+similarity_range,
+            cmap=plt.cm.PuOr,
+        )
+        ims.append(im)
+        for row in alignments:
+            start, end, label = row["start"] / 16000, row["end"] / 16000, row["text"]
+            ax.axvline(start, color="black", linestyle="-", alpha=0.7)
+            ax.axvline(end, color="black", linestyle="-", alpha=0.7)
+        ax.set_yticks([])
+        ax.tick_params(axis='x', length=0)
+        feat, loc = lab.split()
+        if loc == "(0)":
+            if context_size == 0:
+                label = f"[+{feat}]"
+            else:
+                label = f"[+{feat}] 0"
+        else:
+            label = loc[1:-1]
+        ax.set_ylabel(label, rotation=0, ha="right", va="center", fontweight="bold" if loc == "(0)" else "normal")
+        ax.yaxis.set_label_coords(-0.02, 0.5)
+        ax.spines["top"].set_visible(False)
+        ax.spines["right"].set_visible(False)
+        ax.spines["left"].set_visible(False)
+    # Only show x tick labels on the bottom-most axis
+    plt.setp(ax_spec.get_xticklabels(), visible=False)
+    for ax in axes_hm[:-1]:
+        plt.setp(ax.get_xticklabels(), visible=False)
+    axes_hm[-1].set_xlabel("Time [s]")
+    ax_spec.set_xlim(x0, x1)
+    cbar = fig.colorbar(ims[-1], ax=axes_hm, pad=0.01, fraction=0.03)
+    cbar.set_label("Cosine similarity")
+    return fig
+with gr.Blocks(title="Orthogonal Subspace Demo") as demo:
+    with gr.Row():
+        gr.Markdown("""
+## 🎙️ Orthogonal Subspace Demo
+Demonstration for the paper [Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces](https://arxiv.org/abs/2603.12642).
+This demo reproduces Figure 10: cosine similarity between frame-level S3M representations and position-dependent phonological vectors over time, illustrating how each relative phone position occupies a distinct orthogonal subspace.
+Upload, record, or use the example audio, configure the parameters, and click **Run**.
+""")
+    with gr.Row():
+        with gr.Column(scale=1):
+            audio = gr.Audio(
+                label="Input Audio",
+                type="filepath",
+                sources=["upload", "microphone"],
+                recording=True,
+                value=str(EXAMPLE_AUDIO),
+            )
+            gr.Markdown("""
+### Parameters
+- **Vector extraction method**: How phonological vectors are estimated from S3M representations. Different options correspond to different training dataset/calculating the vectors.
+- **Phonological features**: Which phonological features to include in the plot. Deselect features to reduce clutter or isolate a single dimension of contrast.
+- **Context size**: Number of relative phone positions. 0 = vectors from current phone only; k = vectors from relative positions −k through +k. Larger values reveal how far phonological features extend beyond current (or immediately adjacent) phones.
+- **Cosine similarity range**: Upper bound of the cosine similarity (default +/- 0.4). Adjust to zoom in on fine-grained differences or accommodate low-similarity outputs.
+""")
+        with gr.Column(scale=1):
+            vector_dropdown = gr.Dropdown(
+                label="Vector extraction method",
+                choices=list(PHON_VECTORS.keys()),
+                value=DEFAULT_KEY,
+                interactive=True,
+            )
+            feature_checkbox = gr.CheckboxGroup(
+                choices=list(PHON_VECTORS[DEFAULT_KEY].keys()),
+                value=list(PHON_VECTORS[DEFAULT_KEY].keys()),
+                label="Phonological features",
+                show_select_all=True,
+                interactive=True,
+            )
+            context_size_slider = gr.Slider(label="Context size", value=2, minimum=0, maximum=4, step=1, interactive=True)
+            similarity_slider = gr.Slider(label="Cosine similarity range", value=0.4, minimum=0.1, maximum=1.0, step=0.01, interactive=True)
+            run_btn = gr.Button("▶ Run", variant="primary", scale=1)
+    with gr.Row():
+        plot = gr.Plot(
+            label="Output Spectrogram and Phonological Representations",
+            show_label=False,
+        )
+    # Connectors
+    vector_dropdown.change(
+        fn=lambda key: gr.CheckboxGroup(
+            choices=list(PHON_VECTORS[key].keys()),
+            value=list(PHON_VECTORS[key].keys()),
+        ),
+        inputs=vector_dropdown,
+        outputs=feature_checkbox,
+    )
+    run_btn.click(
+        fn=run_orthogonal_subspace,
+        inputs=[audio, vector_dropdown, feature_checkbox, context_size_slider, similarity_slider],
+        outputs=plot,
+    )
+if __name__ == "__main__":
+    demo.launch()

examples/LDC93S1.phn ADDED Viewed

	@@ -0,0 +1,37 @@

+0 3050 h#
+3050 4559 sh
+4559 5723 ix
+5723 6642 hv
+6642 8772 eh
+8772 9190 dcl
+9190 10337 jh
+10337 11517 ih
+11517 12500 dcl
+12500 12640 d
+12640 14714 ah
+14714 15870 kcl
+15870 16334 k
+16334 18088 s
+18088 20417 ux
+20417 21199 q
+21199 22560 en
+22560 22920 gcl
+22920 23271 g
+23271 24229 r
+24229 25566 ix
+25566 27156 s
+27156 28064 ix
+28064 29660 w
+29660 31719 ao
+31719 33360 sh
+33360 33754 epi
+33754 34715 w
+34715 36080 ao
+36080 36326 dx
+36326 37556 axr
+37556 39561 ao
+39561 40313 l
+40313 42059 y
+42059 43479 ih
+43479 44586 axr
+44586 46720 h#

examples/LDC93S1.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:126a2fb63e03f2567d9f67e3a795d89a52b5beb99cff4530d2543f039309c7ef
+size 594082

examples/LDC93S1.wav ADDED Viewed

Binary file (93.6 kB). View file

examples/LDC93S1.wrd ADDED Viewed

	@@ -0,0 +1,11 @@

+3050 5723 she
+5723 10337 had
+9190 11517 your
+11517 16334 dark
+16334 21199 suit
+21199 22560 in
+22560 28064 greasy
+28064 33360 wash
+33754 37556 water
+37556 40313 all
+40313 44586 year

examples/extended-timit.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:db61233d72c815ff302d3b5388a060ea72c118e52631f41133452e06b6ff6276
+size 1417220

examples/extended-voxangeles.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:308faea3193c79fd86fcc27d3d920e17305cd923e17c8f34e3d5dda06862cb95
+size 1566356

examples/original-timit.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8fc61747b6bd3f1ff0805e25f94a90e9f51597a8e4c189417d77ebebfb05e08a
+size 165972

examples/original-voxangeles.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4b6c3da67a53e5a91a18c08c8c39d16519f50e0d631e2edb6153b643fde20b44
+size 165977

examples/unconstrained-timit.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a5be47ef278bd08d1b85486b6cdaaaa70ef8f5eac478b01ed245798925d2f125
+size 708589

examples/unconstrained-voxangeles.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b6c587990f041b0bfa01ef42ca13d8d3004038fbbf5f19d64c0271159036df87
+size 708594

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+transformers
+torch
+librosa
+numpy
+gradio
+specplotter
+matplotlib