Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.15.2
title: Lisper ZeroGPU
emoji: 🐍
colorFrom: green
colorTo: red
sdk: gradio
sdk_version: 5.29.1
app_file: app.py
pinned: false
hardware: zerogpu
short_description: Raw-audio lisp coaching on ZeroGPU.
Lisper ZeroGPU
This Space is the server-side companion to the browser WebGPU demo.
- Browser demo target:
thomasjvu/lisper-gemma4-e2b-audio-onnx-q4f16 - ZeroGPU default target:
thomasjvu/lisper-gemma4-e2b-audio-full - Purpose: provide a reliable server-side fallback for users whose browser or laptop cannot comfortably run the q4f16 ONNX model. The live path uses the v18 acoustic gate by default; server-side Gemma generation is optional and disabled until endpoint stability is verified.
Important Runtime Notes
ZeroGPU is for inference demos, not training. For hackathon submission, this Space should stay on the validated Gemma 4 E2B story. Larger Gemma variants are post-submission experiments and should not be linked as the submitted app.
Hardware note: this Space still requires Hugging Face ZeroGPU access on the owning account. If hardware remains cpu-basic, enable PRO/ZeroGPU access or select another GPU runtime before using it for a demo.
Status: this Space has been smoke-tested on zero-a10g with the fine-tuned E2B full model. Live analysis defaults to the v18 acoustic gate plus template coaching because the displayed class is anchored to that sidecar; optional Gemma generation can be re-enabled with LISPER_ZERO_GPU_USE_GEMMA_GENERATION=1 after endpoint stability is verified.
Input handling:
- The app rejects silent, empty, too-short, or very low-energy recordings before calling Gemma. This prevents confident but falsified coaching on empty microphone captures.
- The primary recorder bypasses Gradio's microphone component and captures raw microphone PCM through the browser Web Audio API, then encodes a small WAV payload client-side. This avoids the blank/silent recordings seen from Gradio's built-in microphone recorder on some browser/device combinations.
- The Gradio
Audiocomponent is kept as upload-only fallback. If the browser recorder reports a near-zero peak/RMS, the issue is browser permission/input capture before the backend sees the file. - Live classifications are gated before Gemma generation. The Space can return
rejected_audioorinconclusiveinstead of forcingclear,palatal, or another class when the clip is silent, noisy, missing usable /s/ evidence, or classifier confidence is weak. - Live analysis prefers the v18 ExtraTrees acoustic hint artifact in
acoustic_extratrees_v18.joblib. Inautomode, a narrow KNN fallback can override only when the clip is extremely close to a known synthetic non-clear exemplar. If the acoustic artifact is missing, the app reports analysis unavailable instead of letting Gemma freely guess the class.
Set these Space variables/secrets:
LISPER_ZERO_GPU_MODEL_ID: model repo to load. Defaults tothomasjvu/lisper-gemma4-e2b-audio-full.LISPER_ZERO_GPU_ADAPTER_ID: optional PEFT/LoRA adapter repo to load on top ofLISPER_ZERO_GPU_MODEL_ID. Leave empty for merged full-model repos.LISPER_ZERO_GPU_DTYPE:float16,bfloat16, orfloat32. Defaults tofloat16.LISPER_ZERO_GPU_AUDIO_DTYPE: optional override for Gemma audio features. Adapter deployments default tobfloat16.LISPER_ZERO_GPU_LOAD_IN_4BIT: defaults to1whenLISPER_ZERO_GPU_ADAPTER_IDis set, otherwise0.LISPER_ZERO_GPU_ACOUSTIC_HINT: defaults to1. Set to0only when intentionally testing direct Gemma audio classification.LISPER_ZERO_GPU_ACOUSTIC_MODEL:auto,extratrees, orknn. Defaults toauto, which uses v18 ExtraTrees whenacoustic_extratrees_v18.joblibis present and allows only distance-gated KNN synthetic-exemplar overrides.LISPER_ZERO_GPU_KNN_OVERRIDE_MAX_DISTANCE: defaults to0.25. Lower values are safer; higher values make the KNN synthetic-exemplar override more aggressive.LISPER_ZERO_GPU_KNN_OVERRIDE_MIN_CONFIDENCE: defaults to0.90.LISPER_ZERO_GPU_LIVE_CLEAR_MIN_CONFIDENCE: defaults to0.85.LISPER_ZERO_GPU_LIVE_NONCLEAR_MIN_CONFIDENCE: defaults to0.55.LISPER_ZERO_GPU_MIN_SIBILANT_FRAME_RATIO: defaults to0.015; increase this to reject more clips without enough /s/ or /z/ evidence.LISPER_ZERO_GPU_ALIGN_AUDIO_TOKENS: defaults to0for adapter deployments and1for merged-model deployments.LISPER_ZERO_GPU_MAX_SEQ_LENGTH: defaults to2048.LISPER_ZERO_GPU_SIZE:largeorxlarge. Defaults tolarge.LISPER_ZERO_GPU_MAX_NEW_TOKENS: defaults to96.LISPER_ZERO_GPU_EAGER_LOAD: defaults to0. Keep model loading inside the Analyze GPU call so the public page and recorder stay responsive.LISPER_ZERO_GPU_USE_GEMMA_GENERATION: defaults to0. Keep disabled for the reliable live demo path; enable only when intentionally testing server-side Gemma generation.HF_TOKEN: required if the selected model repo is private or gated.
Model Lineage
The currently validated Lisper fine-tune is Gemma 4 E2B. The E4B LoRA run completed training, but the Kaggle full-model merge failed due local disk pressure after training and it has not passed the same held-out eval/publish-verdict gate. Keep E4B out of the hackathon submission path.
For hackathon claims, keep this distinction precise:
- The browser artifact is the E2B q4f16 ONNX/WebGPU package.
- The quality gate was the v18 hybrid acoustic+Gemma evaluation path.
- This ZeroGPU Space is the server fallback for the same E2B submission story, not a separate submitted model.