--- title: Lisper ZeroGPU emoji: 🐍 colorFrom: green colorTo: red sdk: gradio sdk_version: 5.29.1 app_file: app.py pinned: false hardware: zerogpu short_description: Raw-audio lisp coaching on ZeroGPU. --- # Lisper ZeroGPU This Space is the server-side companion to the browser WebGPU demo. - Browser demo target: `thomasjvu/lisper-gemma4-e2b-audio-onnx-q4f16` - ZeroGPU default target: `thomasjvu/lisper-gemma4-e2b-audio-full` - Purpose: provide a reliable server-side fallback for users whose browser or laptop cannot comfortably run the q4f16 ONNX model. The live path uses the v18 acoustic gate by default; server-side Gemma generation is optional and disabled until endpoint stability is verified. ## Important Runtime Notes ZeroGPU is for inference demos, not training. For hackathon submission, this Space should stay on the validated Gemma 4 E2B story. Larger Gemma variants are post-submission experiments and should not be linked as the submitted app. Hardware note: this Space still requires Hugging Face ZeroGPU access on the owning account. If hardware remains `cpu-basic`, enable PRO/ZeroGPU access or select another GPU runtime before using it for a demo. Status: this Space has been smoke-tested on `zero-a10g` with the fine-tuned E2B full model. Live analysis defaults to the v18 acoustic gate plus template coaching because the displayed class is anchored to that sidecar; optional Gemma generation can be re-enabled with `LISPER_ZERO_GPU_USE_GEMMA_GENERATION=1` after endpoint stability is verified. Input handling: - The app rejects silent, empty, too-short, or very low-energy recordings before calling Gemma. This prevents confident but falsified coaching on empty microphone captures. - The primary recorder bypasses Gradio's microphone component and captures raw microphone PCM through the browser Web Audio API, then encodes a small WAV payload client-side. This avoids the blank/silent recordings seen from Gradio's built-in microphone recorder on some browser/device combinations. - The Gradio `Audio` component is kept as upload-only fallback. If the browser recorder reports a near-zero peak/RMS, the issue is browser permission/input capture before the backend sees the file. - Live classifications are gated before Gemma generation. The Space can return `rejected_audio` or `inconclusive` instead of forcing `clear`, `palatal`, or another class when the clip is silent, noisy, missing usable /s/ evidence, or classifier confidence is weak. - Live analysis prefers the v18 ExtraTrees acoustic hint artifact in `acoustic_extratrees_v18.joblib`. In `auto` mode, a narrow KNN fallback can override only when the clip is extremely close to a known synthetic non-clear exemplar. If the acoustic artifact is missing, the app reports analysis unavailable instead of letting Gemma freely guess the class. Set these Space variables/secrets: - `LISPER_ZERO_GPU_MODEL_ID`: model repo to load. Defaults to `thomasjvu/lisper-gemma4-e2b-audio-full`. - `LISPER_ZERO_GPU_ADAPTER_ID`: optional PEFT/LoRA adapter repo to load on top of `LISPER_ZERO_GPU_MODEL_ID`. Leave empty for merged full-model repos. - `LISPER_ZERO_GPU_DTYPE`: `float16`, `bfloat16`, or `float32`. Defaults to `float16`. - `LISPER_ZERO_GPU_AUDIO_DTYPE`: optional override for Gemma audio features. Adapter deployments default to `bfloat16`. - `LISPER_ZERO_GPU_LOAD_IN_4BIT`: defaults to `1` when `LISPER_ZERO_GPU_ADAPTER_ID` is set, otherwise `0`. - `LISPER_ZERO_GPU_ACOUSTIC_HINT`: defaults to `1`. Set to `0` only when intentionally testing direct Gemma audio classification. - `LISPER_ZERO_GPU_ACOUSTIC_MODEL`: `auto`, `extratrees`, or `knn`. Defaults to `auto`, which uses v18 ExtraTrees when `acoustic_extratrees_v18.joblib` is present and allows only distance-gated KNN synthetic-exemplar overrides. - `LISPER_ZERO_GPU_KNN_OVERRIDE_MAX_DISTANCE`: defaults to `0.25`. Lower values are safer; higher values make the KNN synthetic-exemplar override more aggressive. - `LISPER_ZERO_GPU_KNN_OVERRIDE_MIN_CONFIDENCE`: defaults to `0.90`. - `LISPER_ZERO_GPU_LIVE_CLEAR_MIN_CONFIDENCE`: defaults to `0.85`. - `LISPER_ZERO_GPU_LIVE_NONCLEAR_MIN_CONFIDENCE`: defaults to `0.55`. - `LISPER_ZERO_GPU_MIN_SIBILANT_FRAME_RATIO`: defaults to `0.015`; increase this to reject more clips without enough /s/ or /z/ evidence. - `LISPER_ZERO_GPU_ALIGN_AUDIO_TOKENS`: defaults to `0` for adapter deployments and `1` for merged-model deployments. - `LISPER_ZERO_GPU_MAX_SEQ_LENGTH`: defaults to `2048`. - `LISPER_ZERO_GPU_SIZE`: `large` or `xlarge`. Defaults to `large`. - `LISPER_ZERO_GPU_MAX_NEW_TOKENS`: defaults to `96`. - `LISPER_ZERO_GPU_EAGER_LOAD`: defaults to `0`. Keep model loading inside the Analyze GPU call so the public page and recorder stay responsive. - `LISPER_ZERO_GPU_USE_GEMMA_GENERATION`: defaults to `0`. Keep disabled for the reliable live demo path; enable only when intentionally testing server-side Gemma generation. - `HF_TOKEN`: required if the selected model repo is private or gated. ## Model Lineage The currently validated Lisper fine-tune is Gemma 4 E2B. The E4B LoRA run completed training, but the Kaggle full-model merge failed due local disk pressure after training and it has not passed the same held-out eval/publish-verdict gate. Keep E4B out of the hackathon submission path. For hackathon claims, keep this distinction precise: - The browser artifact is the E2B q4f16 ONNX/WebGPU package. - The quality gate was the v18 hybrid acoustic+Gemma evaluation path. - This ZeroGPU Space is the server fallback for the same E2B submission story, not a separate submitted model.