# Unknown Rejection — Approach Log

> **Goal**: The system must NEVER give a wrong word prediction for a non-verbal child.  
> If the audio doesn't match any known word, predict `_unknown` so the parent can review.  
> Wrong predictions are worse than missing ones.

---

## Why this is hard

HuBERT/Wav2Vec2 embeddings always produce _some_ cosine similarity score — even for random noise or unrelated speech. The model was never trained to say "I don't know." Every audio file gets a winner, even if it has nothing to do with the vocabulary.

The fundamental challenge: distinguish "a known word spoken imperfectly" from "a completely unknown word/sound."

---

## Approach 1: `_unknown` bank category with synthetic sounds
**Status: FAILED**

Put white noise, sine tones, or other non-speech sounds in `Bank/_unknown/`.  
The idea: unknown audio ≈ noise → high similarity to noise samples.

**Why it failed**: HuBERT speech embeddings live in a completely different region of embedding space from non-speech sounds. Speech vs. non-speech similarity is near zero regardless of content. A Hebrew word you've never seen still looks like speech, not like noise.

**Lesson**: `_unknown` must contain real spoken words (e.g., English words the kids would never say), not synthetic audio.

---

## Approach 2: `_unknown` bank category with real (foreign) words
**Status: Partially tried, inconclusive**

Put English or other foreign-language words in `Bank/_unknown/`.  
The idea: if an unknown Hebrew word is spoken, it might score more similarly to the `_unknown` cluster than to any known Hebrew word.

**Problem**: The `_unknown` cluster is inherently diverse (many different words/sounds) → low internal consistency → weak centroid → rarely wins against a focused known-word cluster, even for truly unknown input. HuBERT groups by phonetics, and any Hebrew syllable shares features with known words.

**Current bank state**: No `_unknown` folder exists in Bank-12, Bank_New, or Bank-Noa.

---

## Approach 3: Gap-based rejection (min_gap between 1st and 2nd place)
**Status: Currently active, works partially**

In `/compute_similarities`, reject if `score_1st - score_2nd < min_gap`.  
Logic: a known word scores clearly higher than all others (large gap). An unknown word has no clear winner — scores are bunched together (small gap).

**Implementation**: `app.py` lines 487-492, controlled by `unknown_min_gap` slider in UI.

**Problems**:
- Requires manual tuning — no principled default value
- In `mean` mode, softmax (T=0.01) amplifies small differences, so even unknown audio can show a large gap between rank-1 and rank-2
- Doesn't account for the absolute level of scores (low but distinct ≠ match)
- User doesn't know what value to set

---

## Approach 4: Calibration floor threshold
**Status: Computed but NEVER CONNECTED (bug!) — Fixed in current version**

`/extract_bank` computes two calibration thresholds from bank self-similarity:
- `cosine_threshold`: 10th percentile of all pairwise cosine similarities − 0.05 margin
- `dtw_threshold`: 10th percentile of all pairwise DTW similarities − 0.05 margin

**The idea**: If two recordings of the same word have at least `cosine_threshold` similarity, then any valid input should score at least this high against its matching word. If even the best match scores below this floor, the input is unknown.

**The bug**: These thresholds were sent from frontend to the endpoint via `unknown_threshold` and `dtw_calibration_threshold` fields, but the rejection logic in `compute_similarities` never read them. Only the gap check ran.

**Fix applied**: Now Layer 1 checks cosine floor, Layer 2 checks DTW floor, Layer 3 is the gap check.

**Limitation**: Only computes calibration if words have ≥ 2 recordings. Single-sample words contribute nothing.

---

## Approach 5: Score spread / entropy check
**Status: Considered, not implemented**

Measure the standard deviation or entropy of the top-N scores.  
If all scores are very similar (low spread), reject as unknown.

**Problem**: After softmax (T=0.01), the spread is always amplified, making this measure unreliable in `mean` mode.

**Could work** in `dtw` or `hybrid` mode where raw DTW scores are used (no softmax).

---

## Approach 6: Channel disagreement (ensemble)
**Status: Partially available, not used for rejection**

If HuBERT and Wav2Vec2 disagree on the top word, the prediction is uncertain.  
Already surfaced in the UI as a warning ("⚠ Models disagree").

**Could extend to**: if HuBERT winner ≠ Wav2Vec2 winner AND gap is small → reject to unknown.

---

## Approach 7: Raw cosine gap instead of softmax gap
**Status: Fixed (current version)**

The gap check was using the softmax-rescaled score (`r['score']`) with temperature=0.01.  
This temperature is so extreme that even a 0.001 raw cosine difference becomes a large softmax gap, making any `min_gap` threshold meaningless — gaps always appear large.

**Fix**: Compare `results[0]['mean_score'] - results[1]['mean_score']` (raw cosine, pre-softmax) instead of `results[0]['score'] - results[1]['score']` (softmax).

**Expected values in raw cosine space**:
- Known word, correct match: raw_gap ≈ 0.05–0.15
- Unknown word (no match): raw_gap ≈ 0.001–0.02
- Starting threshold: 0.03

**Slider range**: Changed from 0–0.5 (softmax space, useless) to 0–0.15 (raw cosine space, meaningful).

---

## Current Architecture (post-fix)

```
/compute_similarities rejection logic (3 layers, checked in order):

Layer 1 — Cosine floor (automatic, bank-calibrated)
  if request.unknown_threshold is set AND best_raw_cosine < threshold:
    → reject (score too low for any known word)
  Requires: each word has ≥ 2 recordings in bank

Layer 2 — DTW floor (automatic, bank-calibrated)
  if request.dtw_calibration_threshold is set AND best_dtw < threshold:
    → reject (DTW score too low)
  Requires: each word has ≥ 2 recordings in bank

Layer 3 — Raw cosine gap check (manual, user-controlled via slider)
  raw_gap = results[0]['mean_score'] - results[1]['mean_score']  ← pre-softmax
  if unknown_min_gap > 0 AND raw_gap < min_gap:
    → reject (no clear winner in raw cosine space)
  Start with min_gap = 0.03
```

Layers 1+2 are automatic once bank has multi-recording words. Layer 3 needs manual tuning.

---

## Approach 8: Z-score automatic rejection (current Layer 3)
**Status: Active (current version)**

Replace the manual gap slider with a fully automatic statistical test.

**Insight**: For a known word, the correct category scores much higher than all others → the top score is a strong outlier (high z-score above the mean). For an unknown word, all categories score similarly → the top score is barely above average (low z-score).

```python
all_raw = [r['mean_score'] for r in results]   # raw cosine scores
mean_all = np.mean(all_raw)
std_all  = np.std(all_raw)
z_top = (all_raw[0] - mean_all) / std_all

if z_top < Z_THRESHOLD:   # default 2.0
    reject as unknown
```

**Why it's automatic**: z-score is dimensionless and self-normalizing. No calibration data required. Works with 1 recording per word. Adapts to any bank size and vocabulary.

**Expected values**:
- Known word correctly identified: z ≈ 2.5–4.0
- Unknown word (no match): z ≈ 0.5–1.8
- Default threshold: 2.0 (tunable per-request via `unknown_z_threshold`)

**Limitation**: Unreliable with < 4 words in the bank (not enough data points for a meaningful distribution). Falls back to raw gap check in that case.

---

## Approach 9: Dual-model agreement check (Layer 4)
**Status: Active (current version)**

Z-score alone fails when an unknown sound happens to phonetically match one known word — the fake winner creates a high z-score. But HuBERT and Wav2Vec2 have different architectures and biases, so if the unknown sound triggers a fake match in HuBERT, W2V often picks a different word.

**Rule**: If HuBERT top word ≠ W2V top word → reject as unknown. Two independent models must agree for a prediction to be accepted.

**Why it works**: Real known words produce a consistent phonetic signal that both models recognize. Unknown sounds that accidentally resemble one known word in one model's feature space rarely resemble the same word in the other model's space.

**When it can fail**: If the unknown sound phonetically fools BOTH models into the same wrong word → both agree → passes. Rare but possible for sounds very similar to a known word.

---

## Current Architecture

```
/compute_similarities rejection logic (4 layers):

Layer 1 — Cosine floor (automatic, bank-calibrated)
  if unknown_threshold is set AND best_raw_cosine < threshold → reject
  Requires: each word has ≥ 2 recordings in bank

Layer 2 — DTW floor (automatic, bank-calibrated)
  if dtw_calibration_threshold is set AND best_dtw < threshold → reject
  Requires: each word has ≥ 2 recordings in bank

Layer 3 — Z-score (automatic, no calibration needed)
  raw_scores = [r['mean_score'] for r in results]
  z = (raw_scores[0] - mean(raw_scores)) / std(raw_scores)
  if z < unknown_z_threshold (default 2.0) → reject
  Fallback for < 4 words: raw cosine gap < 0.03 → reject

Layer 4 — Model agreement (automatic, only if W2V active)
  if HuBERT top word ≠ W2V top word → reject
  Runs independently; can reject even if Layers 1-3 passed.
```

Debug line shows: mean_score, cosine_floor, dtw_score, dtw_floor, z (HuBERT), w2v_z (W2V), z_threshold, agree (✓/✗), raw_gap.

---

## Approach 10: Per-word z-floor calibration (current Layer 3)
**Status: Active (current version)**

**Problem with global z-threshold**: Different words sit in different regions of embedding space. Words with many phonetically similar neighbors naturally produce lower z-scores even when correctly predicted. A single global threshold rejects valid predictions for "crowded" words and misses unknowns for "isolated" words.

**Solution**: At bank load time, compute each word's specific z-floor from the bank's internal geometry:
1. Compute mean embedding for each word
2. For word W: measure cosine similarity of W's mean vs all other word means → distribution of "other scores"
3. Expected z-score = (1.0 - mean_other) / std_other
4. Apply reliability factor (0.60) to account for real new-recording variation vs bank-mean

**At inference time**: The predicted word's own z-floor is used as the threshold instead of a global value. A word in a crowded neighborhood gets a lower threshold; an isolated word gets a higher one.

**Requires**: ≥ 4 words in the bank (need meaningful distribution). Falls back to global threshold otherwise.

**Result**: Correct predictions for "easy" and "hard" words both pass their respective floors; unknown sounds that don't match any word fail the floor for whichever word accidentally wins.

---

## Current Architecture

```
/compute_similarities rejection logic (4 layers):

Layer 1 — Cosine floor (bank self-calibrated, requires ≥ 2 recordings per word)
Layer 2 — DTW floor (bank self-calibrated, requires ≥ 2 recordings per word)
Layer 3 — Per-word z-score (computed at bank load from inter-word geometry, requires ≥ 4 words)
  z_floor per word = (1.0 - mean_other_sims) / std_other_sims × 0.60
  Falls back to global z_threshold if per-word floor unavailable
  Falls back to raw gap check (0.03) if < 4 words in bank
Layer 4 — Model agreement (if W2V active: HuBERT top ≠ W2V top → reject)
```

Debug line shows: mean_score, cosine_floor, z (HuBERT), thr (per-word or global, labeled), w2v_z, agree (✓/✗), gap.

3. **Bank-specific tuning**: Calibration is computed per bank-load. If the bank changes (new words added, recordings replaced), you must reload the bank to update thresholds.

4. **No `_unknown` bank category yet**: A curated set of diverse Hebrew utterances not in the vocabulary could improve detection if included as `_unknown/` in the bank. Needs testing.

---

## Recommended settings (as of current fix)

- **Mode**: `hybrid` (mean cosine pre-filter → DTW re-rank)
- **min_gap slider**: Leave at 0 (disabled) and let the floor threshold handle rejection automatically
- **Bank requirement**: Each word needs ≥ 2 recordings for calibration to activate
- **If calibration is unavailable**: Enable min_gap slider with a value of ~0.05–0.10