Spaces:
Sleeping
Sleeping
| # Unknown Rejection β Approach Log | |
| > **Goal**: The system must NEVER give a wrong word prediction for a non-verbal child. | |
| > If the audio doesn't match any known word, predict `_unknown` so the parent can review. | |
| > Wrong predictions are worse than missing ones. | |
| --- | |
| ## Why this is hard | |
| HuBERT/Wav2Vec2 embeddings always produce _some_ cosine similarity score β even for random noise or unrelated speech. The model was never trained to say "I don't know." Every audio file gets a winner, even if it has nothing to do with the vocabulary. | |
| The fundamental challenge: distinguish "a known word spoken imperfectly" from "a completely unknown word/sound." | |
| --- | |
| ## Approach 1: `_unknown` bank category with synthetic sounds | |
| **Status: FAILED** | |
| Put white noise, sine tones, or other non-speech sounds in `Bank/_unknown/`. | |
| The idea: unknown audio β noise β high similarity to noise samples. | |
| **Why it failed**: HuBERT speech embeddings live in a completely different region of embedding space from non-speech sounds. Speech vs. non-speech similarity is near zero regardless of content. A Hebrew word you've never seen still looks like speech, not like noise. | |
| **Lesson**: `_unknown` must contain real spoken words (e.g., English words the kids would never say), not synthetic audio. | |
| --- | |
| ## Approach 2: `_unknown` bank category with real (foreign) words | |
| **Status: Partially tried, inconclusive** | |
| Put English or other foreign-language words in `Bank/_unknown/`. | |
| The idea: if an unknown Hebrew word is spoken, it might score more similarly to the `_unknown` cluster than to any known Hebrew word. | |
| **Problem**: The `_unknown` cluster is inherently diverse (many different words/sounds) β low internal consistency β weak centroid β rarely wins against a focused known-word cluster, even for truly unknown input. HuBERT groups by phonetics, and any Hebrew syllable shares features with known words. | |
| **Current bank state**: No `_unknown` folder exists in Bank-12, Bank_New, or Bank-Noa. | |
| --- | |
| ## Approach 3: Gap-based rejection (min_gap between 1st and 2nd place) | |
| **Status: Currently active, works partially** | |
| In `/compute_similarities`, reject if `score_1st - score_2nd < min_gap`. | |
| Logic: a known word scores clearly higher than all others (large gap). An unknown word has no clear winner β scores are bunched together (small gap). | |
| **Implementation**: `app.py` lines 487-492, controlled by `unknown_min_gap` slider in UI. | |
| **Problems**: | |
| - Requires manual tuning β no principled default value | |
| - In `mean` mode, softmax (T=0.01) amplifies small differences, so even unknown audio can show a large gap between rank-1 and rank-2 | |
| - Doesn't account for the absolute level of scores (low but distinct β match) | |
| - User doesn't know what value to set | |
| --- | |
| ## Approach 4: Calibration floor threshold | |
| **Status: Computed but NEVER CONNECTED (bug!) β Fixed in current version** | |
| `/extract_bank` computes two calibration thresholds from bank self-similarity: | |
| - `cosine_threshold`: 10th percentile of all pairwise cosine similarities β 0.05 margin | |
| - `dtw_threshold`: 10th percentile of all pairwise DTW similarities β 0.05 margin | |
| **The idea**: If two recordings of the same word have at least `cosine_threshold` similarity, then any valid input should score at least this high against its matching word. If even the best match scores below this floor, the input is unknown. | |
| **The bug**: These thresholds were sent from frontend to the endpoint via `unknown_threshold` and `dtw_calibration_threshold` fields, but the rejection logic in `compute_similarities` never read them. Only the gap check ran. | |
| **Fix applied**: Now Layer 1 checks cosine floor, Layer 2 checks DTW floor, Layer 3 is the gap check. | |
| **Limitation**: Only computes calibration if words have β₯ 2 recordings. Single-sample words contribute nothing. | |
| --- | |
| ## Approach 5: Score spread / entropy check | |
| **Status: Considered, not implemented** | |
| Measure the standard deviation or entropy of the top-N scores. | |
| If all scores are very similar (low spread), reject as unknown. | |
| **Problem**: After softmax (T=0.01), the spread is always amplified, making this measure unreliable in `mean` mode. | |
| **Could work** in `dtw` or `hybrid` mode where raw DTW scores are used (no softmax). | |
| --- | |
| ## Approach 6: Channel disagreement (ensemble) | |
| **Status: Partially available, not used for rejection** | |
| If HuBERT and Wav2Vec2 disagree on the top word, the prediction is uncertain. | |
| Already surfaced in the UI as a warning ("β Models disagree"). | |
| **Could extend to**: if HuBERT winner β Wav2Vec2 winner AND gap is small β reject to unknown. | |
| --- | |
| ## Approach 7: Raw cosine gap instead of softmax gap | |
| **Status: Fixed (current version)** | |
| The gap check was using the softmax-rescaled score (`r['score']`) with temperature=0.01. | |
| This temperature is so extreme that even a 0.001 raw cosine difference becomes a large softmax gap, making any `min_gap` threshold meaningless β gaps always appear large. | |
| **Fix**: Compare `results[0]['mean_score'] - results[1]['mean_score']` (raw cosine, pre-softmax) instead of `results[0]['score'] - results[1]['score']` (softmax). | |
| **Expected values in raw cosine space**: | |
| - Known word, correct match: raw_gap β 0.05β0.15 | |
| - Unknown word (no match): raw_gap β 0.001β0.02 | |
| - Starting threshold: 0.03 | |
| **Slider range**: Changed from 0β0.5 (softmax space, useless) to 0β0.15 (raw cosine space, meaningful). | |
| --- | |
| ## Current Architecture (post-fix) | |
| ``` | |
| /compute_similarities rejection logic (3 layers, checked in order): | |
| Layer 1 β Cosine floor (automatic, bank-calibrated) | |
| if request.unknown_threshold is set AND best_raw_cosine < threshold: | |
| β reject (score too low for any known word) | |
| Requires: each word has β₯ 2 recordings in bank | |
| Layer 2 β DTW floor (automatic, bank-calibrated) | |
| if request.dtw_calibration_threshold is set AND best_dtw < threshold: | |
| β reject (DTW score too low) | |
| Requires: each word has β₯ 2 recordings in bank | |
| Layer 3 β Raw cosine gap check (manual, user-controlled via slider) | |
| raw_gap = results[0]['mean_score'] - results[1]['mean_score'] β pre-softmax | |
| if unknown_min_gap > 0 AND raw_gap < min_gap: | |
| β reject (no clear winner in raw cosine space) | |
| Start with min_gap = 0.03 | |
| ``` | |
| Layers 1+2 are automatic once bank has multi-recording words. Layer 3 needs manual tuning. | |
| --- | |
| ## Approach 8: Z-score automatic rejection (current Layer 3) | |
| **Status: Active (current version)** | |
| Replace the manual gap slider with a fully automatic statistical test. | |
| **Insight**: For a known word, the correct category scores much higher than all others β the top score is a strong outlier (high z-score above the mean). For an unknown word, all categories score similarly β the top score is barely above average (low z-score). | |
| ```python | |
| all_raw = [r['mean_score'] for r in results] # raw cosine scores | |
| mean_all = np.mean(all_raw) | |
| std_all = np.std(all_raw) | |
| z_top = (all_raw[0] - mean_all) / std_all | |
| if z_top < Z_THRESHOLD: # default 2.0 | |
| reject as unknown | |
| ``` | |
| **Why it's automatic**: z-score is dimensionless and self-normalizing. No calibration data required. Works with 1 recording per word. Adapts to any bank size and vocabulary. | |
| **Expected values**: | |
| - Known word correctly identified: z β 2.5β4.0 | |
| - Unknown word (no match): z β 0.5β1.8 | |
| - Default threshold: 2.0 (tunable per-request via `unknown_z_threshold`) | |
| **Limitation**: Unreliable with < 4 words in the bank (not enough data points for a meaningful distribution). Falls back to raw gap check in that case. | |
| --- | |
| ## Approach 9: Dual-model agreement check (Layer 4) | |
| **Status: Active (current version)** | |
| Z-score alone fails when an unknown sound happens to phonetically match one known word β the fake winner creates a high z-score. But HuBERT and Wav2Vec2 have different architectures and biases, so if the unknown sound triggers a fake match in HuBERT, W2V often picks a different word. | |
| **Rule**: If HuBERT top word β W2V top word β reject as unknown. Two independent models must agree for a prediction to be accepted. | |
| **Why it works**: Real known words produce a consistent phonetic signal that both models recognize. Unknown sounds that accidentally resemble one known word in one model's feature space rarely resemble the same word in the other model's space. | |
| **When it can fail**: If the unknown sound phonetically fools BOTH models into the same wrong word β both agree β passes. Rare but possible for sounds very similar to a known word. | |
| --- | |
| ## Current Architecture | |
| ``` | |
| /compute_similarities rejection logic (4 layers): | |
| Layer 1 β Cosine floor (automatic, bank-calibrated) | |
| if unknown_threshold is set AND best_raw_cosine < threshold β reject | |
| Requires: each word has β₯ 2 recordings in bank | |
| Layer 2 β DTW floor (automatic, bank-calibrated) | |
| if dtw_calibration_threshold is set AND best_dtw < threshold β reject | |
| Requires: each word has β₯ 2 recordings in bank | |
| Layer 3 β Z-score (automatic, no calibration needed) | |
| raw_scores = [r['mean_score'] for r in results] | |
| z = (raw_scores[0] - mean(raw_scores)) / std(raw_scores) | |
| if z < unknown_z_threshold (default 2.0) β reject | |
| Fallback for < 4 words: raw cosine gap < 0.03 β reject | |
| Layer 4 β Model agreement (automatic, only if W2V active) | |
| if HuBERT top word β W2V top word β reject | |
| Runs independently; can reject even if Layers 1-3 passed. | |
| ``` | |
| Debug line shows: mean_score, cosine_floor, dtw_score, dtw_floor, z (HuBERT), w2v_z (W2V), z_threshold, agree (β/β), raw_gap. | |
| --- | |
| ## Approach 10: Per-word z-floor calibration (current Layer 3) | |
| **Status: Active (current version)** | |
| **Problem with global z-threshold**: Different words sit in different regions of embedding space. Words with many phonetically similar neighbors naturally produce lower z-scores even when correctly predicted. A single global threshold rejects valid predictions for "crowded" words and misses unknowns for "isolated" words. | |
| **Solution**: At bank load time, compute each word's specific z-floor from the bank's internal geometry: | |
| 1. Compute mean embedding for each word | |
| 2. For word W: measure cosine similarity of W's mean vs all other word means β distribution of "other scores" | |
| 3. Expected z-score = (1.0 - mean_other) / std_other | |
| 4. Apply reliability factor (0.60) to account for real new-recording variation vs bank-mean | |
| **At inference time**: The predicted word's own z-floor is used as the threshold instead of a global value. A word in a crowded neighborhood gets a lower threshold; an isolated word gets a higher one. | |
| **Requires**: β₯ 4 words in the bank (need meaningful distribution). Falls back to global threshold otherwise. | |
| **Result**: Correct predictions for "easy" and "hard" words both pass their respective floors; unknown sounds that don't match any word fail the floor for whichever word accidentally wins. | |
| --- | |
| ## Current Architecture | |
| ``` | |
| /compute_similarities rejection logic (4 layers): | |
| Layer 1 β Cosine floor (bank self-calibrated, requires β₯ 2 recordings per word) | |
| Layer 2 β DTW floor (bank self-calibrated, requires β₯ 2 recordings per word) | |
| Layer 3 β Per-word z-score (computed at bank load from inter-word geometry, requires β₯ 4 words) | |
| z_floor per word = (1.0 - mean_other_sims) / std_other_sims Γ 0.60 | |
| Falls back to global z_threshold if per-word floor unavailable | |
| Falls back to raw gap check (0.03) if < 4 words in bank | |
| Layer 4 β Model agreement (if W2V active: HuBERT top β W2V top β reject) | |
| ``` | |
| Debug line shows: mean_score, cosine_floor, z (HuBERT), thr (per-word or global, labeled), w2v_z, agree (β/β), gap. | |
| 3. **Bank-specific tuning**: Calibration is computed per bank-load. If the bank changes (new words added, recordings replaced), you must reload the bank to update thresholds. | |
| 4. **No `_unknown` bank category yet**: A curated set of diverse Hebrew utterances not in the vocabulary could improve detection if included as `_unknown/` in the bank. Needs testing. | |
| --- | |
| ## Recommended settings (as of current fix) | |
| - **Mode**: `hybrid` (mean cosine pre-filter β DTW re-rank) | |
| - **min_gap slider**: Leave at 0 (disabled) and let the floor threshold handle rejection automatically | |
| - **Bank requirement**: Each word needs β₯ 2 recordings for calibration to activate | |
| - **If calibration is unavailable**: Enable min_gap slider with a value of ~0.05β0.10 | |