# Unknown Rejection — Approach Log > **Goal**: The system must NEVER give a wrong word prediction for a non-verbal child. > If the audio doesn't match any known word, predict `_unknown` so the parent can review. > Wrong predictions are worse than missing ones. --- ## Why this is hard HuBERT/Wav2Vec2 embeddings always produce _some_ cosine similarity score — even for random noise or unrelated speech. The model was never trained to say "I don't know." Every audio file gets a winner, even if it has nothing to do with the vocabulary. The fundamental challenge: distinguish "a known word spoken imperfectly" from "a completely unknown word/sound." --- ## Approach 1: `_unknown` bank category with synthetic sounds **Status: FAILED** Put white noise, sine tones, or other non-speech sounds in `Bank/_unknown/`. The idea: unknown audio ≈ noise → high similarity to noise samples. **Why it failed**: HuBERT speech embeddings live in a completely different region of embedding space from non-speech sounds. Speech vs. non-speech similarity is near zero regardless of content. A Hebrew word you've never seen still looks like speech, not like noise. **Lesson**: `_unknown` must contain real spoken words (e.g., English words the kids would never say), not synthetic audio. --- ## Approach 2: `_unknown` bank category with real (foreign) words **Status: Partially tried, inconclusive** Put English or other foreign-language words in `Bank/_unknown/`. The idea: if an unknown Hebrew word is spoken, it might score more similarly to the `_unknown` cluster than to any known Hebrew word. **Problem**: The `_unknown` cluster is inherently diverse (many different words/sounds) → low internal consistency → weak centroid → rarely wins against a focused known-word cluster, even for truly unknown input. HuBERT groups by phonetics, and any Hebrew syllable shares features with known words. **Current bank state**: No `_unknown` folder exists in Bank-12, Bank_New, or Bank-Noa. --- ## Approach 3: Gap-based rejection (min_gap between 1st and 2nd place) **Status: Currently active, works partially** In `/compute_similarities`, reject if `score_1st - score_2nd < min_gap`. Logic: a known word scores clearly higher than all others (large gap). An unknown word has no clear winner — scores are bunched together (small gap). **Implementation**: `app.py` lines 487-492, controlled by `unknown_min_gap` slider in UI. **Problems**: - Requires manual tuning — no principled default value - In `mean` mode, softmax (T=0.01) amplifies small differences, so even unknown audio can show a large gap between rank-1 and rank-2 - Doesn't account for the absolute level of scores (low but distinct ≠ match) - User doesn't know what value to set --- ## Approach 4: Calibration floor threshold **Status: Computed but NEVER CONNECTED (bug!) — Fixed in current version** `/extract_bank` computes two calibration thresholds from bank self-similarity: - `cosine_threshold`: 10th percentile of all pairwise cosine similarities − 0.05 margin - `dtw_threshold`: 10th percentile of all pairwise DTW similarities − 0.05 margin **The idea**: If two recordings of the same word have at least `cosine_threshold` similarity, then any valid input should score at least this high against its matching word. If even the best match scores below this floor, the input is unknown. **The bug**: These thresholds were sent from frontend to the endpoint via `unknown_threshold` and `dtw_calibration_threshold` fields, but the rejection logic in `compute_similarities` never read them. Only the gap check ran. **Fix applied**: Now Layer 1 checks cosine floor, Layer 2 checks DTW floor, Layer 3 is the gap check. **Limitation**: Only computes calibration if words have ≥ 2 recordings. Single-sample words contribute nothing. --- ## Approach 5: Score spread / entropy check **Status: Considered, not implemented** Measure the standard deviation or entropy of the top-N scores. If all scores are very similar (low spread), reject as unknown. **Problem**: After softmax (T=0.01), the spread is always amplified, making this measure unreliable in `mean` mode. **Could work** in `dtw` or `hybrid` mode where raw DTW scores are used (no softmax). --- ## Approach 6: Channel disagreement (ensemble) **Status: Partially available, not used for rejection** If HuBERT and Wav2Vec2 disagree on the top word, the prediction is uncertain. Already surfaced in the UI as a warning ("⚠ Models disagree"). **Could extend to**: if HuBERT winner ≠ Wav2Vec2 winner AND gap is small → reject to unknown. --- ## Approach 7: Raw cosine gap instead of softmax gap **Status: Fixed (current version)** The gap check was using the softmax-rescaled score (`r['score']`) with temperature=0.01. This temperature is so extreme that even a 0.001 raw cosine difference becomes a large softmax gap, making any `min_gap` threshold meaningless — gaps always appear large. **Fix**: Compare `results[0]['mean_score'] - results[1]['mean_score']` (raw cosine, pre-softmax) instead of `results[0]['score'] - results[1]['score']` (softmax). **Expected values in raw cosine space**: - Known word, correct match: raw_gap ≈ 0.05–0.15 - Unknown word (no match): raw_gap ≈ 0.001–0.02 - Starting threshold: 0.03 **Slider range**: Changed from 0–0.5 (softmax space, useless) to 0–0.15 (raw cosine space, meaningful). --- ## Current Architecture (post-fix) ``` /compute_similarities rejection logic (3 layers, checked in order): Layer 1 — Cosine floor (automatic, bank-calibrated) if request.unknown_threshold is set AND best_raw_cosine < threshold: → reject (score too low for any known word) Requires: each word has ≥ 2 recordings in bank Layer 2 — DTW floor (automatic, bank-calibrated) if request.dtw_calibration_threshold is set AND best_dtw < threshold: → reject (DTW score too low) Requires: each word has ≥ 2 recordings in bank Layer 3 — Raw cosine gap check (manual, user-controlled via slider) raw_gap = results[0]['mean_score'] - results[1]['mean_score'] ← pre-softmax if unknown_min_gap > 0 AND raw_gap < min_gap: → reject (no clear winner in raw cosine space) Start with min_gap = 0.03 ``` Layers 1+2 are automatic once bank has multi-recording words. Layer 3 needs manual tuning. --- ## Approach 8: Z-score automatic rejection (current Layer 3) **Status: Active (current version)** Replace the manual gap slider with a fully automatic statistical test. **Insight**: For a known word, the correct category scores much higher than all others → the top score is a strong outlier (high z-score above the mean). For an unknown word, all categories score similarly → the top score is barely above average (low z-score). ```python all_raw = [r['mean_score'] for r in results] # raw cosine scores mean_all = np.mean(all_raw) std_all = np.std(all_raw) z_top = (all_raw[0] - mean_all) / std_all if z_top < Z_THRESHOLD: # default 2.0 reject as unknown ``` **Why it's automatic**: z-score is dimensionless and self-normalizing. No calibration data required. Works with 1 recording per word. Adapts to any bank size and vocabulary. **Expected values**: - Known word correctly identified: z ≈ 2.5–4.0 - Unknown word (no match): z ≈ 0.5–1.8 - Default threshold: 2.0 (tunable per-request via `unknown_z_threshold`) **Limitation**: Unreliable with < 4 words in the bank (not enough data points for a meaningful distribution). Falls back to raw gap check in that case. --- ## Approach 9: Dual-model agreement check (Layer 4) **Status: Active (current version)** Z-score alone fails when an unknown sound happens to phonetically match one known word — the fake winner creates a high z-score. But HuBERT and Wav2Vec2 have different architectures and biases, so if the unknown sound triggers a fake match in HuBERT, W2V often picks a different word. **Rule**: If HuBERT top word ≠ W2V top word → reject as unknown. Two independent models must agree for a prediction to be accepted. **Why it works**: Real known words produce a consistent phonetic signal that both models recognize. Unknown sounds that accidentally resemble one known word in one model's feature space rarely resemble the same word in the other model's space. **When it can fail**: If the unknown sound phonetically fools BOTH models into the same wrong word → both agree → passes. Rare but possible for sounds very similar to a known word. --- ## Current Architecture ``` /compute_similarities rejection logic (4 layers): Layer 1 — Cosine floor (automatic, bank-calibrated) if unknown_threshold is set AND best_raw_cosine < threshold → reject Requires: each word has ≥ 2 recordings in bank Layer 2 — DTW floor (automatic, bank-calibrated) if dtw_calibration_threshold is set AND best_dtw < threshold → reject Requires: each word has ≥ 2 recordings in bank Layer 3 — Z-score (automatic, no calibration needed) raw_scores = [r['mean_score'] for r in results] z = (raw_scores[0] - mean(raw_scores)) / std(raw_scores) if z < unknown_z_threshold (default 2.0) → reject Fallback for < 4 words: raw cosine gap < 0.03 → reject Layer 4 — Model agreement (automatic, only if W2V active) if HuBERT top word ≠ W2V top word → reject Runs independently; can reject even if Layers 1-3 passed. ``` Debug line shows: mean_score, cosine_floor, dtw_score, dtw_floor, z (HuBERT), w2v_z (W2V), z_threshold, agree (✓/✗), raw_gap. --- ## Approach 10: Per-word z-floor calibration (current Layer 3) **Status: Active (current version)** **Problem with global z-threshold**: Different words sit in different regions of embedding space. Words with many phonetically similar neighbors naturally produce lower z-scores even when correctly predicted. A single global threshold rejects valid predictions for "crowded" words and misses unknowns for "isolated" words. **Solution**: At bank load time, compute each word's specific z-floor from the bank's internal geometry: 1. Compute mean embedding for each word 2. For word W: measure cosine similarity of W's mean vs all other word means → distribution of "other scores" 3. Expected z-score = (1.0 - mean_other) / std_other 4. Apply reliability factor (0.60) to account for real new-recording variation vs bank-mean **At inference time**: The predicted word's own z-floor is used as the threshold instead of a global value. A word in a crowded neighborhood gets a lower threshold; an isolated word gets a higher one. **Requires**: ≥ 4 words in the bank (need meaningful distribution). Falls back to global threshold otherwise. **Result**: Correct predictions for "easy" and "hard" words both pass their respective floors; unknown sounds that don't match any word fail the floor for whichever word accidentally wins. --- ## Current Architecture ``` /compute_similarities rejection logic (4 layers): Layer 1 — Cosine floor (bank self-calibrated, requires ≥ 2 recordings per word) Layer 2 — DTW floor (bank self-calibrated, requires ≥ 2 recordings per word) Layer 3 — Per-word z-score (computed at bank load from inter-word geometry, requires ≥ 4 words) z_floor per word = (1.0 - mean_other_sims) / std_other_sims × 0.60 Falls back to global z_threshold if per-word floor unavailable Falls back to raw gap check (0.03) if < 4 words in bank Layer 4 — Model agreement (if W2V active: HuBERT top ≠ W2V top → reject) ``` Debug line shows: mean_score, cosine_floor, z (HuBERT), thr (per-word or global, labeled), w2v_z, agree (✓/✗), gap. 3. **Bank-specific tuning**: Calibration is computed per bank-load. If the bank changes (new words added, recordings replaced), you must reload the bank to update thresholds. 4. **No `_unknown` bank category yet**: A curated set of diverse Hebrew utterances not in the vocabulary could improve detection if included as `_unknown/` in the bank. Needs testing. --- ## Recommended settings (as of current fix) - **Mode**: `hybrid` (mean cosine pre-filter → DTW re-rank) - **min_gap slider**: Leave at 0 (disabled) and let the floor threshold handle rejection automatically - **Bank requirement**: Each word needs ≥ 2 recordings for calibration to activate - **If calibration is unavailable**: Enable min_gap slider with a value of ~0.05–0.10