barvox-backend / unknown.md
RonenShilchikov
Restructure: move Python backend into backend/ directory
423bed8

Unknown Rejection β€” Approach Log

Goal: The system must NEVER give a wrong word prediction for a non-verbal child.
If the audio doesn't match any known word, predict _unknown so the parent can review.
Wrong predictions are worse than missing ones.


Why this is hard

HuBERT/Wav2Vec2 embeddings always produce some cosine similarity score β€” even for random noise or unrelated speech. The model was never trained to say "I don't know." Every audio file gets a winner, even if it has nothing to do with the vocabulary.

The fundamental challenge: distinguish "a known word spoken imperfectly" from "a completely unknown word/sound."


Approach 1: _unknown bank category with synthetic sounds

Status: FAILED

Put white noise, sine tones, or other non-speech sounds in Bank/_unknown/.
The idea: unknown audio β‰ˆ noise β†’ high similarity to noise samples.

Why it failed: HuBERT speech embeddings live in a completely different region of embedding space from non-speech sounds. Speech vs. non-speech similarity is near zero regardless of content. A Hebrew word you've never seen still looks like speech, not like noise.

Lesson: _unknown must contain real spoken words (e.g., English words the kids would never say), not synthetic audio.


Approach 2: _unknown bank category with real (foreign) words

Status: Partially tried, inconclusive

Put English or other foreign-language words in Bank/_unknown/.
The idea: if an unknown Hebrew word is spoken, it might score more similarly to the _unknown cluster than to any known Hebrew word.

Problem: The _unknown cluster is inherently diverse (many different words/sounds) β†’ low internal consistency β†’ weak centroid β†’ rarely wins against a focused known-word cluster, even for truly unknown input. HuBERT groups by phonetics, and any Hebrew syllable shares features with known words.

Current bank state: No _unknown folder exists in Bank-12, Bank_New, or Bank-Noa.


Approach 3: Gap-based rejection (min_gap between 1st and 2nd place)

Status: Currently active, works partially

In /compute_similarities, reject if score_1st - score_2nd < min_gap.
Logic: a known word scores clearly higher than all others (large gap). An unknown word has no clear winner β€” scores are bunched together (small gap).

Implementation: app.py lines 487-492, controlled by unknown_min_gap slider in UI.

Problems:

  • Requires manual tuning β€” no principled default value
  • In mean mode, softmax (T=0.01) amplifies small differences, so even unknown audio can show a large gap between rank-1 and rank-2
  • Doesn't account for the absolute level of scores (low but distinct β‰  match)
  • User doesn't know what value to set

Approach 4: Calibration floor threshold

Status: Computed but NEVER CONNECTED (bug!) β€” Fixed in current version

/extract_bank computes two calibration thresholds from bank self-similarity:

  • cosine_threshold: 10th percentile of all pairwise cosine similarities βˆ’ 0.05 margin
  • dtw_threshold: 10th percentile of all pairwise DTW similarities βˆ’ 0.05 margin

The idea: If two recordings of the same word have at least cosine_threshold similarity, then any valid input should score at least this high against its matching word. If even the best match scores below this floor, the input is unknown.

The bug: These thresholds were sent from frontend to the endpoint via unknown_threshold and dtw_calibration_threshold fields, but the rejection logic in compute_similarities never read them. Only the gap check ran.

Fix applied: Now Layer 1 checks cosine floor, Layer 2 checks DTW floor, Layer 3 is the gap check.

Limitation: Only computes calibration if words have β‰₯ 2 recordings. Single-sample words contribute nothing.


Approach 5: Score spread / entropy check

Status: Considered, not implemented

Measure the standard deviation or entropy of the top-N scores.
If all scores are very similar (low spread), reject as unknown.

Problem: After softmax (T=0.01), the spread is always amplified, making this measure unreliable in mean mode.

Could work in dtw or hybrid mode where raw DTW scores are used (no softmax).


Approach 6: Channel disagreement (ensemble)

Status: Partially available, not used for rejection

If HuBERT and Wav2Vec2 disagree on the top word, the prediction is uncertain.
Already surfaced in the UI as a warning ("⚠ Models disagree").

Could extend to: if HuBERT winner β‰  Wav2Vec2 winner AND gap is small β†’ reject to unknown.


Approach 7: Raw cosine gap instead of softmax gap

Status: Fixed (current version)

The gap check was using the softmax-rescaled score (r['score']) with temperature=0.01.
This temperature is so extreme that even a 0.001 raw cosine difference becomes a large softmax gap, making any min_gap threshold meaningless β€” gaps always appear large.

Fix: Compare results[0]['mean_score'] - results[1]['mean_score'] (raw cosine, pre-softmax) instead of results[0]['score'] - results[1]['score'] (softmax).

Expected values in raw cosine space:

  • Known word, correct match: raw_gap β‰ˆ 0.05–0.15
  • Unknown word (no match): raw_gap β‰ˆ 0.001–0.02
  • Starting threshold: 0.03

Slider range: Changed from 0–0.5 (softmax space, useless) to 0–0.15 (raw cosine space, meaningful).


Current Architecture (post-fix)

/compute_similarities rejection logic (3 layers, checked in order):

Layer 1 β€” Cosine floor (automatic, bank-calibrated)
  if request.unknown_threshold is set AND best_raw_cosine < threshold:
    β†’ reject (score too low for any known word)
  Requires: each word has β‰₯ 2 recordings in bank

Layer 2 β€” DTW floor (automatic, bank-calibrated)
  if request.dtw_calibration_threshold is set AND best_dtw < threshold:
    β†’ reject (DTW score too low)
  Requires: each word has β‰₯ 2 recordings in bank

Layer 3 β€” Raw cosine gap check (manual, user-controlled via slider)
  raw_gap = results[0]['mean_score'] - results[1]['mean_score']  ← pre-softmax
  if unknown_min_gap > 0 AND raw_gap < min_gap:
    β†’ reject (no clear winner in raw cosine space)
  Start with min_gap = 0.03

Layers 1+2 are automatic once bank has multi-recording words. Layer 3 needs manual tuning.


Approach 8: Z-score automatic rejection (current Layer 3)

Status: Active (current version)

Replace the manual gap slider with a fully automatic statistical test.

Insight: For a known word, the correct category scores much higher than all others β†’ the top score is a strong outlier (high z-score above the mean). For an unknown word, all categories score similarly β†’ the top score is barely above average (low z-score).

all_raw = [r['mean_score'] for r in results]   # raw cosine scores
mean_all = np.mean(all_raw)
std_all  = np.std(all_raw)
z_top = (all_raw[0] - mean_all) / std_all

if z_top < Z_THRESHOLD:   # default 2.0
    reject as unknown

Why it's automatic: z-score is dimensionless and self-normalizing. No calibration data required. Works with 1 recording per word. Adapts to any bank size and vocabulary.

Expected values:

  • Known word correctly identified: z β‰ˆ 2.5–4.0
  • Unknown word (no match): z β‰ˆ 0.5–1.8
  • Default threshold: 2.0 (tunable per-request via unknown_z_threshold)

Limitation: Unreliable with < 4 words in the bank (not enough data points for a meaningful distribution). Falls back to raw gap check in that case.


Approach 9: Dual-model agreement check (Layer 4)

Status: Active (current version)

Z-score alone fails when an unknown sound happens to phonetically match one known word β€” the fake winner creates a high z-score. But HuBERT and Wav2Vec2 have different architectures and biases, so if the unknown sound triggers a fake match in HuBERT, W2V often picks a different word.

Rule: If HuBERT top word β‰  W2V top word β†’ reject as unknown. Two independent models must agree for a prediction to be accepted.

Why it works: Real known words produce a consistent phonetic signal that both models recognize. Unknown sounds that accidentally resemble one known word in one model's feature space rarely resemble the same word in the other model's space.

When it can fail: If the unknown sound phonetically fools BOTH models into the same wrong word β†’ both agree β†’ passes. Rare but possible for sounds very similar to a known word.


Current Architecture

/compute_similarities rejection logic (4 layers):

Layer 1 β€” Cosine floor (automatic, bank-calibrated)
  if unknown_threshold is set AND best_raw_cosine < threshold β†’ reject
  Requires: each word has β‰₯ 2 recordings in bank

Layer 2 β€” DTW floor (automatic, bank-calibrated)
  if dtw_calibration_threshold is set AND best_dtw < threshold β†’ reject
  Requires: each word has β‰₯ 2 recordings in bank

Layer 3 β€” Z-score (automatic, no calibration needed)
  raw_scores = [r['mean_score'] for r in results]
  z = (raw_scores[0] - mean(raw_scores)) / std(raw_scores)
  if z < unknown_z_threshold (default 2.0) β†’ reject
  Fallback for < 4 words: raw cosine gap < 0.03 β†’ reject

Layer 4 β€” Model agreement (automatic, only if W2V active)
  if HuBERT top word β‰  W2V top word β†’ reject
  Runs independently; can reject even if Layers 1-3 passed.

Debug line shows: mean_score, cosine_floor, dtw_score, dtw_floor, z (HuBERT), w2v_z (W2V), z_threshold, agree (βœ“/βœ—), raw_gap.


Approach 10: Per-word z-floor calibration (current Layer 3)

Status: Active (current version)

Problem with global z-threshold: Different words sit in different regions of embedding space. Words with many phonetically similar neighbors naturally produce lower z-scores even when correctly predicted. A single global threshold rejects valid predictions for "crowded" words and misses unknowns for "isolated" words.

Solution: At bank load time, compute each word's specific z-floor from the bank's internal geometry:

  1. Compute mean embedding for each word
  2. For word W: measure cosine similarity of W's mean vs all other word means β†’ distribution of "other scores"
  3. Expected z-score = (1.0 - mean_other) / std_other
  4. Apply reliability factor (0.60) to account for real new-recording variation vs bank-mean

At inference time: The predicted word's own z-floor is used as the threshold instead of a global value. A word in a crowded neighborhood gets a lower threshold; an isolated word gets a higher one.

Requires: β‰₯ 4 words in the bank (need meaningful distribution). Falls back to global threshold otherwise.

Result: Correct predictions for "easy" and "hard" words both pass their respective floors; unknown sounds that don't match any word fail the floor for whichever word accidentally wins.


Current Architecture

/compute_similarities rejection logic (4 layers):

Layer 1 β€” Cosine floor (bank self-calibrated, requires β‰₯ 2 recordings per word)
Layer 2 β€” DTW floor (bank self-calibrated, requires β‰₯ 2 recordings per word)
Layer 3 β€” Per-word z-score (computed at bank load from inter-word geometry, requires β‰₯ 4 words)
  z_floor per word = (1.0 - mean_other_sims) / std_other_sims Γ— 0.60
  Falls back to global z_threshold if per-word floor unavailable
  Falls back to raw gap check (0.03) if < 4 words in bank
Layer 4 β€” Model agreement (if W2V active: HuBERT top β‰  W2V top β†’ reject)

Debug line shows: mean_score, cosine_floor, z (HuBERT), thr (per-word or global, labeled), w2v_z, agree (βœ“/βœ—), gap.

  1. Bank-specific tuning: Calibration is computed per bank-load. If the bank changes (new words added, recordings replaced), you must reload the bank to update thresholds.

  2. No _unknown bank category yet: A curated set of diverse Hebrew utterances not in the vocabulary could improve detection if included as _unknown/ in the bank. Needs testing.


Recommended settings (as of current fix)

  • Mode: hybrid (mean cosine pre-filter β†’ DTW re-rank)
  • min_gap slider: Leave at 0 (disabled) and let the floor threshold handle rejection automatically
  • Bank requirement: Each word needs β‰₯ 2 recordings for calibration to activate
  • If calibration is unavailable: Enable min_gap slider with a value of ~0.05–0.10