Spaces:

ronima12
/

barvox-backend

Sleeping

App Files Files Community

barvox-backend / unknown.md

RonenShilchikov

Restructure: move Python backend into backend/ directory

423bed8 about 2 months ago

preview code

raw

history blame contribute delete

12.4 kB

	# Unknown Rejection — Approach Log

	> Goal: The system must NEVER give a wrong word prediction for a non-verbal child.
	> If the audio doesn't match any known word, predict `_unknown` so the parent can review.
	> Wrong predictions are worse than missing ones.

	---

	## Why this is hard

	HuBERT/Wav2Vec2 embeddings always produce _some_ cosine similarity score — even for random noise or unrelated speech. The model was never trained to say "I don't know." Every audio file gets a winner, even if it has nothing to do with the vocabulary.

	The fundamental challenge: distinguish "a known word spoken imperfectly" from "a completely unknown word/sound."

	---

	## Approach 1: `_unknown` bank category with synthetic sounds
	Status: FAILED

	Put white noise, sine tones, or other non-speech sounds in `Bank/_unknown/`.
	The idea: unknown audio ≈ noise → high similarity to noise samples.

	Why it failed: HuBERT speech embeddings live in a completely different region of embedding space from non-speech sounds. Speech vs. non-speech similarity is near zero regardless of content. A Hebrew word you've never seen still looks like speech, not like noise.

	Lesson: `_unknown` must contain real spoken words (e.g., English words the kids would never say), not synthetic audio.

	---

	## Approach 2: `_unknown` bank category with real (foreign) words
	Status: Partially tried, inconclusive

	Put English or other foreign-language words in `Bank/_unknown/`.
	The idea: if an unknown Hebrew word is spoken, it might score more similarly to the `_unknown` cluster than to any known Hebrew word.

	Problem: The `_unknown` cluster is inherently diverse (many different words/sounds) → low internal consistency → weak centroid → rarely wins against a focused known-word cluster, even for truly unknown input. HuBERT groups by phonetics, and any Hebrew syllable shares features with known words.

	Current bank state: No `_unknown` folder exists in Bank-12, Bank_New, or Bank-Noa.

	---

	## Approach 3: Gap-based rejection (min_gap between 1st and 2nd place)
	Status: Currently active, works partially

	In `/compute_similarities`, reject if `score_1st - score_2nd < min_gap`.
	Logic: a known word scores clearly higher than all others (large gap). An unknown word has no clear winner — scores are bunched together (small gap).

	Implementation: `app.py` lines 487-492, controlled by `unknown_min_gap` slider in UI.

	Problems:
	- Requires manual tuning — no principled default value
	- In `mean` mode, softmax (T=0.01) amplifies small differences, so even unknown audio can show a large gap between rank-1 and rank-2
	- Doesn't account for the absolute level of scores (low but distinct ≠ match)
	- User doesn't know what value to set

	---

	## Approach 4: Calibration floor threshold
	Status: Computed but NEVER CONNECTED (bug!) — Fixed in current version

	`/extract_bank` computes two calibration thresholds from bank self-similarity:
	- `cosine_threshold`: 10th percentile of all pairwise cosine similarities − 0.05 margin
	- `dtw_threshold`: 10th percentile of all pairwise DTW similarities − 0.05 margin

	The idea: If two recordings of the same word have at least `cosine_threshold` similarity, then any valid input should score at least this high against its matching word. If even the best match scores below this floor, the input is unknown.

	The bug: These thresholds were sent from frontend to the endpoint via `unknown_threshold` and `dtw_calibration_threshold` fields, but the rejection logic in `compute_similarities` never read them. Only the gap check ran.

	Fix applied: Now Layer 1 checks cosine floor, Layer 2 checks DTW floor, Layer 3 is the gap check.

	Limitation: Only computes calibration if words have ≥ 2 recordings. Single-sample words contribute nothing.

	---

	## Approach 5: Score spread / entropy check
	Status: Considered, not implemented

	Measure the standard deviation or entropy of the top-N scores.
	If all scores are very similar (low spread), reject as unknown.

	Problem: After softmax (T=0.01), the spread is always amplified, making this measure unreliable in `mean` mode.

	Could work in `dtw` or `hybrid` mode where raw DTW scores are used (no softmax).

	---

	## Approach 6: Channel disagreement (ensemble)
	Status: Partially available, not used for rejection

	If HuBERT and Wav2Vec2 disagree on the top word, the prediction is uncertain.
	Already surfaced in the UI as a warning ("⚠ Models disagree").

	Could extend to: if HuBERT winner ≠ Wav2Vec2 winner AND gap is small → reject to unknown.

	---

	## Approach 7: Raw cosine gap instead of softmax gap
	Status: Fixed (current version)

	The gap check was using the softmax-rescaled score (`r['score']`) with temperature=0.01.
	This temperature is so extreme that even a 0.001 raw cosine difference becomes a large softmax gap, making any `min_gap` threshold meaningless — gaps always appear large.

	Fix: Compare `results[0]['mean_score'] - results[1]['mean_score']` (raw cosine, pre-softmax) instead of `results[0]['score'] - results[1]['score']` (softmax).

	Expected values in raw cosine space:
	- Known word, correct match: raw_gap ≈ 0.05–0.15
	- Unknown word (no match): raw_gap ≈ 0.001–0.02
	- Starting threshold: 0.03

	Slider range: Changed from 0–0.5 (softmax space, useless) to 0–0.15 (raw cosine space, meaningful).

	---

	## Current Architecture (post-fix)

	```
	/compute_similarities rejection logic (3 layers, checked in order):

	Layer 1 — Cosine floor (automatic, bank-calibrated)
	if request.unknown_threshold is set AND best_raw_cosine < threshold:
	→ reject (score too low for any known word)
	Requires: each word has ≥ 2 recordings in bank

	Layer 2 — DTW floor (automatic, bank-calibrated)
	if request.dtw_calibration_threshold is set AND best_dtw < threshold:
	→ reject (DTW score too low)
	Requires: each word has ≥ 2 recordings in bank

	Layer 3 — Raw cosine gap check (manual, user-controlled via slider)
	raw_gap = results[0]['mean_score'] - results[1]['mean_score'] ← pre-softmax
	if unknown_min_gap > 0 AND raw_gap < min_gap:
	→ reject (no clear winner in raw cosine space)
	Start with min_gap = 0.03
	```

	Layers 1+2 are automatic once bank has multi-recording words. Layer 3 needs manual tuning.

	---

	## Approach 8: Z-score automatic rejection (current Layer 3)
	Status: Active (current version)

	Replace the manual gap slider with a fully automatic statistical test.

	Insight: For a known word, the correct category scores much higher than all others → the top score is a strong outlier (high z-score above the mean). For an unknown word, all categories score similarly → the top score is barely above average (low z-score).

	```python
	all_raw = [r['mean_score'] for r in results] # raw cosine scores
	mean_all = np.mean(all_raw)
	std_all = np.std(all_raw)
	z_top = (all_raw[0] - mean_all) / std_all

	if z_top < Z_THRESHOLD: # default 2.0
	reject as unknown
	```

	Why it's automatic: z-score is dimensionless and self-normalizing. No calibration data required. Works with 1 recording per word. Adapts to any bank size and vocabulary.

	Expected values:
	- Known word correctly identified: z ≈ 2.5–4.0
	- Unknown word (no match): z ≈ 0.5–1.8
	- Default threshold: 2.0 (tunable per-request via `unknown_z_threshold`)

	Limitation: Unreliable with < 4 words in the bank (not enough data points for a meaningful distribution). Falls back to raw gap check in that case.

	---

	## Approach 9: Dual-model agreement check (Layer 4)
	Status: Active (current version)

	Z-score alone fails when an unknown sound happens to phonetically match one known word — the fake winner creates a high z-score. But HuBERT and Wav2Vec2 have different architectures and biases, so if the unknown sound triggers a fake match in HuBERT, W2V often picks a different word.

	Rule: If HuBERT top word ≠ W2V top word → reject as unknown. Two independent models must agree for a prediction to be accepted.

	Why it works: Real known words produce a consistent phonetic signal that both models recognize. Unknown sounds that accidentally resemble one known word in one model's feature space rarely resemble the same word in the other model's space.

	When it can fail: If the unknown sound phonetically fools BOTH models into the same wrong word → both agree → passes. Rare but possible for sounds very similar to a known word.

	---

	## Current Architecture

	```
	/compute_similarities rejection logic (4 layers):

	Layer 1 — Cosine floor (automatic, bank-calibrated)
	if unknown_threshold is set AND best_raw_cosine < threshold → reject
	Requires: each word has ≥ 2 recordings in bank

	Layer 2 — DTW floor (automatic, bank-calibrated)
	if dtw_calibration_threshold is set AND best_dtw < threshold → reject
	Requires: each word has ≥ 2 recordings in bank

	Layer 3 — Z-score (automatic, no calibration needed)
	raw_scores = [r['mean_score'] for r in results]
	z = (raw_scores[0] - mean(raw_scores)) / std(raw_scores)
	if z < unknown_z_threshold (default 2.0) → reject
	Fallback for < 4 words: raw cosine gap < 0.03 → reject

	Layer 4 — Model agreement (automatic, only if W2V active)
	if HuBERT top word ≠ W2V top word → reject
	Runs independently; can reject even if Layers 1-3 passed.
	```

	Debug line shows: mean_score, cosine_floor, dtw_score, dtw_floor, z (HuBERT), w2v_z (W2V), z_threshold, agree (✓/✗), raw_gap.

	---

	## Approach 10: Per-word z-floor calibration (current Layer 3)
	Status: Active (current version)

	Problem with global z-threshold: Different words sit in different regions of embedding space. Words with many phonetically similar neighbors naturally produce lower z-scores even when correctly predicted. A single global threshold rejects valid predictions for "crowded" words and misses unknowns for "isolated" words.

	Solution: At bank load time, compute each word's specific z-floor from the bank's internal geometry:
	1. Compute mean embedding for each word
	2. For word W: measure cosine similarity of W's mean vs all other word means → distribution of "other scores"
	3. Expected z-score = (1.0 - mean_other) / std_other
	4. Apply reliability factor (0.60) to account for real new-recording variation vs bank-mean

	At inference time: The predicted word's own z-floor is used as the threshold instead of a global value. A word in a crowded neighborhood gets a lower threshold; an isolated word gets a higher one.

	Requires: ≥ 4 words in the bank (need meaningful distribution). Falls back to global threshold otherwise.

	Result: Correct predictions for "easy" and "hard" words both pass their respective floors; unknown sounds that don't match any word fail the floor for whichever word accidentally wins.

	---

	## Current Architecture

	```
	/compute_similarities rejection logic (4 layers):

	Layer 1 — Cosine floor (bank self-calibrated, requires ≥ 2 recordings per word)
	Layer 2 — DTW floor (bank self-calibrated, requires ≥ 2 recordings per word)
	Layer 3 — Per-word z-score (computed at bank load from inter-word geometry, requires ≥ 4 words)
	z_floor per word = (1.0 - mean_other_sims) / std_other_sims × 0.60
	Falls back to global z_threshold if per-word floor unavailable
	Falls back to raw gap check (0.03) if < 4 words in bank
	Layer 4 — Model agreement (if W2V active: HuBERT top ≠ W2V top → reject)
	```

	Debug line shows: mean_score, cosine_floor, z (HuBERT), thr (per-word or global, labeled), w2v_z, agree (✓/✗), gap.

	3. Bank-specific tuning: Calibration is computed per bank-load. If the bank changes (new words added, recordings replaced), you must reload the bank to update thresholds.

	4. No `_unknown` bank category yet: A curated set of diverse Hebrew utterances not in the vocabulary could improve detection if included as `_unknown/` in the bank. Needs testing.

	---

	## Recommended settings (as of current fix)

	- Mode: `hybrid` (mean cosine pre-filter → DTW re-rank)
	- min_gap slider: Leave at 0 (disabled) and let the floor threshold handle rejection automatically
	- Bank requirement: Each word needs ≥ 2 recordings for calibration to activate
	- If calibration is unavailable: Enable min_gap slider with a value of ~0.05–0.10