barvox-backend / unknown.md
RonenShilchikov
Restructure: move Python backend into backend/ directory
423bed8
# Unknown Rejection β€” Approach Log
> **Goal**: The system must NEVER give a wrong word prediction for a non-verbal child.
> If the audio doesn't match any known word, predict `_unknown` so the parent can review.
> Wrong predictions are worse than missing ones.
---
## Why this is hard
HuBERT/Wav2Vec2 embeddings always produce _some_ cosine similarity score β€” even for random noise or unrelated speech. The model was never trained to say "I don't know." Every audio file gets a winner, even if it has nothing to do with the vocabulary.
The fundamental challenge: distinguish "a known word spoken imperfectly" from "a completely unknown word/sound."
---
## Approach 1: `_unknown` bank category with synthetic sounds
**Status: FAILED**
Put white noise, sine tones, or other non-speech sounds in `Bank/_unknown/`.
The idea: unknown audio β‰ˆ noise β†’ high similarity to noise samples.
**Why it failed**: HuBERT speech embeddings live in a completely different region of embedding space from non-speech sounds. Speech vs. non-speech similarity is near zero regardless of content. A Hebrew word you've never seen still looks like speech, not like noise.
**Lesson**: `_unknown` must contain real spoken words (e.g., English words the kids would never say), not synthetic audio.
---
## Approach 2: `_unknown` bank category with real (foreign) words
**Status: Partially tried, inconclusive**
Put English or other foreign-language words in `Bank/_unknown/`.
The idea: if an unknown Hebrew word is spoken, it might score more similarly to the `_unknown` cluster than to any known Hebrew word.
**Problem**: The `_unknown` cluster is inherently diverse (many different words/sounds) β†’ low internal consistency β†’ weak centroid β†’ rarely wins against a focused known-word cluster, even for truly unknown input. HuBERT groups by phonetics, and any Hebrew syllable shares features with known words.
**Current bank state**: No `_unknown` folder exists in Bank-12, Bank_New, or Bank-Noa.
---
## Approach 3: Gap-based rejection (min_gap between 1st and 2nd place)
**Status: Currently active, works partially**
In `/compute_similarities`, reject if `score_1st - score_2nd < min_gap`.
Logic: a known word scores clearly higher than all others (large gap). An unknown word has no clear winner β€” scores are bunched together (small gap).
**Implementation**: `app.py` lines 487-492, controlled by `unknown_min_gap` slider in UI.
**Problems**:
- Requires manual tuning β€” no principled default value
- In `mean` mode, softmax (T=0.01) amplifies small differences, so even unknown audio can show a large gap between rank-1 and rank-2
- Doesn't account for the absolute level of scores (low but distinct β‰  match)
- User doesn't know what value to set
---
## Approach 4: Calibration floor threshold
**Status: Computed but NEVER CONNECTED (bug!) β€” Fixed in current version**
`/extract_bank` computes two calibration thresholds from bank self-similarity:
- `cosine_threshold`: 10th percentile of all pairwise cosine similarities βˆ’ 0.05 margin
- `dtw_threshold`: 10th percentile of all pairwise DTW similarities βˆ’ 0.05 margin
**The idea**: If two recordings of the same word have at least `cosine_threshold` similarity, then any valid input should score at least this high against its matching word. If even the best match scores below this floor, the input is unknown.
**The bug**: These thresholds were sent from frontend to the endpoint via `unknown_threshold` and `dtw_calibration_threshold` fields, but the rejection logic in `compute_similarities` never read them. Only the gap check ran.
**Fix applied**: Now Layer 1 checks cosine floor, Layer 2 checks DTW floor, Layer 3 is the gap check.
**Limitation**: Only computes calibration if words have β‰₯ 2 recordings. Single-sample words contribute nothing.
---
## Approach 5: Score spread / entropy check
**Status: Considered, not implemented**
Measure the standard deviation or entropy of the top-N scores.
If all scores are very similar (low spread), reject as unknown.
**Problem**: After softmax (T=0.01), the spread is always amplified, making this measure unreliable in `mean` mode.
**Could work** in `dtw` or `hybrid` mode where raw DTW scores are used (no softmax).
---
## Approach 6: Channel disagreement (ensemble)
**Status: Partially available, not used for rejection**
If HuBERT and Wav2Vec2 disagree on the top word, the prediction is uncertain.
Already surfaced in the UI as a warning ("⚠ Models disagree").
**Could extend to**: if HuBERT winner β‰  Wav2Vec2 winner AND gap is small β†’ reject to unknown.
---
## Approach 7: Raw cosine gap instead of softmax gap
**Status: Fixed (current version)**
The gap check was using the softmax-rescaled score (`r['score']`) with temperature=0.01.
This temperature is so extreme that even a 0.001 raw cosine difference becomes a large softmax gap, making any `min_gap` threshold meaningless β€” gaps always appear large.
**Fix**: Compare `results[0]['mean_score'] - results[1]['mean_score']` (raw cosine, pre-softmax) instead of `results[0]['score'] - results[1]['score']` (softmax).
**Expected values in raw cosine space**:
- Known word, correct match: raw_gap β‰ˆ 0.05–0.15
- Unknown word (no match): raw_gap β‰ˆ 0.001–0.02
- Starting threshold: 0.03
**Slider range**: Changed from 0–0.5 (softmax space, useless) to 0–0.15 (raw cosine space, meaningful).
---
## Current Architecture (post-fix)
```
/compute_similarities rejection logic (3 layers, checked in order):
Layer 1 β€” Cosine floor (automatic, bank-calibrated)
if request.unknown_threshold is set AND best_raw_cosine < threshold:
β†’ reject (score too low for any known word)
Requires: each word has β‰₯ 2 recordings in bank
Layer 2 β€” DTW floor (automatic, bank-calibrated)
if request.dtw_calibration_threshold is set AND best_dtw < threshold:
β†’ reject (DTW score too low)
Requires: each word has β‰₯ 2 recordings in bank
Layer 3 β€” Raw cosine gap check (manual, user-controlled via slider)
raw_gap = results[0]['mean_score'] - results[1]['mean_score'] ← pre-softmax
if unknown_min_gap > 0 AND raw_gap < min_gap:
β†’ reject (no clear winner in raw cosine space)
Start with min_gap = 0.03
```
Layers 1+2 are automatic once bank has multi-recording words. Layer 3 needs manual tuning.
---
## Approach 8: Z-score automatic rejection (current Layer 3)
**Status: Active (current version)**
Replace the manual gap slider with a fully automatic statistical test.
**Insight**: For a known word, the correct category scores much higher than all others β†’ the top score is a strong outlier (high z-score above the mean). For an unknown word, all categories score similarly β†’ the top score is barely above average (low z-score).
```python
all_raw = [r['mean_score'] for r in results] # raw cosine scores
mean_all = np.mean(all_raw)
std_all = np.std(all_raw)
z_top = (all_raw[0] - mean_all) / std_all
if z_top < Z_THRESHOLD: # default 2.0
reject as unknown
```
**Why it's automatic**: z-score is dimensionless and self-normalizing. No calibration data required. Works with 1 recording per word. Adapts to any bank size and vocabulary.
**Expected values**:
- Known word correctly identified: z β‰ˆ 2.5–4.0
- Unknown word (no match): z β‰ˆ 0.5–1.8
- Default threshold: 2.0 (tunable per-request via `unknown_z_threshold`)
**Limitation**: Unreliable with < 4 words in the bank (not enough data points for a meaningful distribution). Falls back to raw gap check in that case.
---
## Approach 9: Dual-model agreement check (Layer 4)
**Status: Active (current version)**
Z-score alone fails when an unknown sound happens to phonetically match one known word β€” the fake winner creates a high z-score. But HuBERT and Wav2Vec2 have different architectures and biases, so if the unknown sound triggers a fake match in HuBERT, W2V often picks a different word.
**Rule**: If HuBERT top word β‰  W2V top word β†’ reject as unknown. Two independent models must agree for a prediction to be accepted.
**Why it works**: Real known words produce a consistent phonetic signal that both models recognize. Unknown sounds that accidentally resemble one known word in one model's feature space rarely resemble the same word in the other model's space.
**When it can fail**: If the unknown sound phonetically fools BOTH models into the same wrong word β†’ both agree β†’ passes. Rare but possible for sounds very similar to a known word.
---
## Current Architecture
```
/compute_similarities rejection logic (4 layers):
Layer 1 β€” Cosine floor (automatic, bank-calibrated)
if unknown_threshold is set AND best_raw_cosine < threshold β†’ reject
Requires: each word has β‰₯ 2 recordings in bank
Layer 2 β€” DTW floor (automatic, bank-calibrated)
if dtw_calibration_threshold is set AND best_dtw < threshold β†’ reject
Requires: each word has β‰₯ 2 recordings in bank
Layer 3 β€” Z-score (automatic, no calibration needed)
raw_scores = [r['mean_score'] for r in results]
z = (raw_scores[0] - mean(raw_scores)) / std(raw_scores)
if z < unknown_z_threshold (default 2.0) β†’ reject
Fallback for < 4 words: raw cosine gap < 0.03 β†’ reject
Layer 4 β€” Model agreement (automatic, only if W2V active)
if HuBERT top word β‰  W2V top word β†’ reject
Runs independently; can reject even if Layers 1-3 passed.
```
Debug line shows: mean_score, cosine_floor, dtw_score, dtw_floor, z (HuBERT), w2v_z (W2V), z_threshold, agree (βœ“/βœ—), raw_gap.
---
## Approach 10: Per-word z-floor calibration (current Layer 3)
**Status: Active (current version)**
**Problem with global z-threshold**: Different words sit in different regions of embedding space. Words with many phonetically similar neighbors naturally produce lower z-scores even when correctly predicted. A single global threshold rejects valid predictions for "crowded" words and misses unknowns for "isolated" words.
**Solution**: At bank load time, compute each word's specific z-floor from the bank's internal geometry:
1. Compute mean embedding for each word
2. For word W: measure cosine similarity of W's mean vs all other word means β†’ distribution of "other scores"
3. Expected z-score = (1.0 - mean_other) / std_other
4. Apply reliability factor (0.60) to account for real new-recording variation vs bank-mean
**At inference time**: The predicted word's own z-floor is used as the threshold instead of a global value. A word in a crowded neighborhood gets a lower threshold; an isolated word gets a higher one.
**Requires**: β‰₯ 4 words in the bank (need meaningful distribution). Falls back to global threshold otherwise.
**Result**: Correct predictions for "easy" and "hard" words both pass their respective floors; unknown sounds that don't match any word fail the floor for whichever word accidentally wins.
---
## Current Architecture
```
/compute_similarities rejection logic (4 layers):
Layer 1 β€” Cosine floor (bank self-calibrated, requires β‰₯ 2 recordings per word)
Layer 2 β€” DTW floor (bank self-calibrated, requires β‰₯ 2 recordings per word)
Layer 3 β€” Per-word z-score (computed at bank load from inter-word geometry, requires β‰₯ 4 words)
z_floor per word = (1.0 - mean_other_sims) / std_other_sims Γ— 0.60
Falls back to global z_threshold if per-word floor unavailable
Falls back to raw gap check (0.03) if < 4 words in bank
Layer 4 β€” Model agreement (if W2V active: HuBERT top β‰  W2V top β†’ reject)
```
Debug line shows: mean_score, cosine_floor, z (HuBERT), thr (per-word or global, labeled), w2v_z, agree (βœ“/βœ—), gap.
3. **Bank-specific tuning**: Calibration is computed per bank-load. If the bank changes (new words added, recordings replaced), you must reload the bank to update thresholds.
4. **No `_unknown` bank category yet**: A curated set of diverse Hebrew utterances not in the vocabulary could improve detection if included as `_unknown/` in the bank. Needs testing.
---
## Recommended settings (as of current fix)
- **Mode**: `hybrid` (mean cosine pre-filter β†’ DTW re-rank)
- **min_gap slider**: Leave at 0 (disabled) and let the floor threshold handle rejection automatically
- **Bank requirement**: Each word needs β‰₯ 2 recordings for calibration to activate
- **If calibration is unavailable**: Enable min_gap slider with a value of ~0.05–0.10