Spaces:

throgletworld
/

MultiModalSpeechDisfluencyDetectionSystem

Running

App Files Files Community

throgletworld commited on Jan 30

Commit

0f70449

verified ·

1 Parent(s): 0b7e787

Upload 3 files

Browse files

Files changed (1) hide show

app.py +74 -1

app.py CHANGED Viewed

@@ -9,6 +9,79 @@ import torch.nn as nn
 print(f"APP STARTUP: {datetime.now()}")
 class WaveLmStutterClassification(nn.Module):
     def __init__(self, num_labels=5):
         super().__init__()
@@ -152,7 +225,7 @@ def analyze_audio(audio_input, threshold, progress=gr.Progress()):
             detected, _ = analyze_chunk(chunk, threshold)
             for l in detected:
                 stutter_counts[l] += 1
-            timeline.append({"time": f"{start/sr:.1f}-{end/sr:.1f}s", "detected": detected or ["Clear"]})
         progress(0.75, desc="🗣️ Transcribing with Whisper...")
         print("Running Whisper...")

 print(f"APP STARTUP: {datetime.now()}")
+# =============================================================================
+# WHY SIGMOID INSTEAD OF SOFTMAX? - A DETAILED EXPLANATION
+# =============================================================================
+"""
+MULTI-LABEL vs MULTI-CLASS CLASSIFICATION
+==========================================
+Our stutter detection is a MULTI-LABEL problem:
+- A single 3-second audio chunk can have MULTIPLE stutters simultaneously
+- Example: Someone might have a "Block" AND a "SoundRep" in the same chunk
+- Each of the 5 stutter types is INDEPENDENT of the others
+SOFTMAX (❌ NOT suitable for us):
+---------------------------------
+- Used for MULTI-CLASS problems where classes are MUTUALLY EXCLUSIVE
+- Example: "Is this image a Cat OR a Dog?" (can't be both)
+- Formula: softmax(x_i) = exp(x_i) / sum(exp(x_j)) for all j
+- All probabilities MUST sum to 1.0
+- Problem: If we used softmax and got [0.7, 0.1, 0.1, 0.05, 0.05]:
+  - It would say "70% Prolongation" but FORCE other classes to be low
+  - We couldn't detect multiple stutters in one chunk!
+SIGMOID (✅ CORRECT for us):
+----------------------------
+- Used for MULTI-LABEL problems where classes are INDEPENDENT
+- Each class gets its own independent probability (0 to 1)
+- Formula: sigmoid(x) = 1 / (1 + exp(-x))
+- Probabilities DON'T need to sum to 1
+- Example output: [0.8, 0.7, 0.2, 0.1, 0.05]
+  - 80% chance of Prolongation
+  - 70% chance of Block
+  - Both can be detected simultaneously!
+THE TRAINING & INFERENCE FLOW:
+==============================
+TRAINING:
+---------
+1. Model outputs: LOGITS (raw scores from -∞ to +∞)
+   Example: [2.5, -3.0, 0.1, -1.5, -2.0]
+2. Loss Function: BCEWithLogitsLoss
+   - "WithLogits" means it applies Sigmoid INTERNALLY
+   - More numerically stable than separate Sigmoid + BCELoss
+   - Compares each prediction to each ground truth label independently
+INFERENCE (this file):
+----------------------
+1. Model outputs: LOGITS (same as training)
+   Example: [2.5, -3.0, 0.1, -1.5, -2.0]
+2. We manually apply Sigmoid to convert to probabilities:
+   probs = torch.sigmoid(logits)
+   Result: [0.92, 0.05, 0.52, 0.18, 0.12]
+3. Apply threshold (e.g., 0.5) to each probability:
+   - 0.92 > 0.5 → Prolongation DETECTED
+   - 0.05 < 0.5 → Block NOT detected
+   - 0.52 > 0.5 → SoundRep DETECTED
+   - etc.
+4. If NO stutters detected (all below threshold):
+   → Label the chunk as "Fluent"
+THRESHOLD EXPLAINED:
+====================
+- Default: 0.5 (theoretically neutral, since sigmoid(0) = 0.5)
+- Lower threshold (0.3-0.4): More SENSITIVE, catches more stutters, but more false positives
+- Higher threshold (0.6-0.7): More STRICT, fewer false positives, but might miss subtle stutters
+- The slider in the UI lets users adjust this based on their needs
+- SAME threshold is applied to ALL 5 classes (simplest approach)
+"""
 class WaveLmStutterClassification(nn.Module):
     def __init__(self, num_labels=5):
         super().__init__()
             detected, _ = analyze_chunk(chunk, threshold)
             for l in detected:
                 stutter_counts[l] += 1
+            timeline.append({"time": f"{start/sr:.1f}-{end/sr:.1f}s", "detected": detected or ["Fluent"]})
         progress(0.75, desc="🗣️ Transcribing with Whisper...")
         print("Running Whisper...")