--- license: apache-2.0 library_name: nemo tags: - onnx - nemo - speech-commands - wake-word-spotting datasets: - HashNuke/tincan-wakewords-data metrics: - accuracy model-index: - name: TinCan Speech Commands Model results: - task: type: audio-classification name: Speech command recognition dataset: name: TinCan Speech Commands validation set type: tincan-speech-commands-validation metrics: - type: loss name: Validation loss value: 0.1493 - type: accuracy name: Validation micro top-1 accuracy value: 95.28 - type: accuracy name: Validation macro accuracy value: 94.61 --- # TinCan Speech Commands Model A compact English speech-command recognition model for tincan app. This model recognizes 47 short command classes and is designed for small-footprint command recognition where cloud ASR is unnecessary or undesirable. The exported ONNX artifact is under 400 KB, making it practical for local-first applications, prototypes, and edge deployments. * 12 custom words * and 35 words from the Google Speech Commands dataset v2 ## Highlights - 47-class English command recognizer - ONNX export for portable inference - Small model artifact: `model.onnx` is approximately 378 KB - Based on NVIDIA NeMo's MatchboxNet command-recognition model family ## Base Model This model uses NVIDIA NeMo's `commandrecognition_en_matchboxnet3x2x64_v2` MatchboxNet command-recognition architecture. Base model reference: [`commandrecognition_en_matchboxnet3x2x64_v2`](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/commandrecognition_en_matchboxnet3x2x64_v2) ## Metrics These metrics describe the currently exported `model.onnx` artifact. | Metric | Value | |---|---:| | Validation loss | 0.1493 | | Validation micro top-1 accuracy | 95.28% | | Validation macro accuracy | 94.61% | ## Supported Commands Custom TinCan commands: `astra`, `bali`, `boston`, `capri`, `delhi`, `dublin`, `frisco`, `monaco`, `oslo`, `paris`, `seatown`, `tokyo` Google Speech Commands labels: `yes`, `no`, `up`, `down`, `left`, `right`, `on`, `off`, `stop`, `go`, `zero`, `one`, `two`, `three`, `four`, `five`, `six`, `seven`, `eight`, `nine`, `bed`, `bird`, `cat`, `dog`, `happy`, `house`, `marvin`, `sheila`, `tree`, `wow`, `backward`, `forward`, `follow`, `learn`, `visual` ## Inference Notes The model outputs logits over the 47 labels listed in `labels.json`. Use the output index to look up the predicted command label. ## Training Provenance | Field | Value | |---|---| | Model name | `commandrecognition_en_matchboxnet3x2x64_v2` | | Export format | ONNX | | Epochs | 10 | | Batch size | 32 | ## Limitations - This is a closed-vocabulary command recognizer, not a general speech-to-text model. - The model is intended for English short-command recognition. - Validation metrics may not fully predict performance with every microphone, speaker, accent, room, or noise condition.