File size: 2,968 Bytes
0795bcf
 
a938cff
 
 
 
 
 
630200e
 
a938cff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0795bcf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a938cff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: apache-2.0
library_name: nemo
tags:
- onnx
- nemo
- speech-commands
- wake-word-spotting
datasets:
- HashNuke/tincan-wakewords-data
metrics:
- accuracy
model-index:
- name: TinCan Speech Commands Model
  results:
  - task:
      type: audio-classification
      name: Speech command recognition
    dataset:
      name: TinCan Speech Commands validation set
      type: tincan-speech-commands-validation
    metrics:
    - type: loss
      name: Validation loss
      value: 0.1493
    - type: accuracy
      name: Validation micro top-1 accuracy
      value: 95.28
    - type: accuracy
      name: Validation macro accuracy
      value: 94.61
---

# TinCan Speech Commands Model

A compact English speech-command recognition model for tincan app.

This model recognizes 47 short command classes and is designed for small-footprint command recognition where cloud ASR is unnecessary or undesirable. The exported ONNX artifact is under 400 KB, making it practical for local-first applications, prototypes, and edge deployments.

* 12 custom words
* and 35 words from the Google Speech Commands dataset v2

## Highlights

- 47-class English command recognizer
- ONNX export for portable inference
- Small model artifact: `model.onnx` is approximately 378 KB
- Based on NVIDIA NeMo's MatchboxNet command-recognition model family

## Base Model

This model uses NVIDIA NeMo's `commandrecognition_en_matchboxnet3x2x64_v2` MatchboxNet command-recognition architecture.

Base model reference: [`commandrecognition_en_matchboxnet3x2x64_v2`](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/commandrecognition_en_matchboxnet3x2x64_v2)

## Metrics

These metrics describe the currently exported `model.onnx` artifact.

| Metric | Value |
|---|---:|
| Validation loss | 0.1493 |
| Validation micro top-1 accuracy | 95.28% |
| Validation macro accuracy | 94.61% |

## Supported Commands

Custom TinCan commands:

`astra`, `bali`, `boston`, `capri`, `delhi`, `dublin`, `frisco`, `monaco`, `oslo`, `paris`, `seatown`, `tokyo`

Google Speech Commands labels:

`yes`, `no`, `up`, `down`, `left`, `right`, `on`, `off`, `stop`, `go`, `zero`, `one`, `two`, `three`, `four`, `five`, `six`, `seven`, `eight`, `nine`, `bed`, `bird`, `cat`, `dog`, `happy`, `house`, `marvin`, `sheila`, `tree`, `wow`, `backward`, `forward`, `follow`, `learn`, `visual`

## Inference Notes

The model outputs logits over the 47 labels listed in `labels.json`. Use the output index to look up the predicted command label.

## Training Provenance

| Field | Value |
|---|---|
| Model name | `commandrecognition_en_matchboxnet3x2x64_v2` |
| Export format | ONNX |
| Epochs | 10 |
| Batch size | 32 |

## Limitations

- This is a closed-vocabulary command recognizer, not a general speech-to-text model.
- The model is intended for English short-command recognition.
- Validation metrics may not fully predict performance with every microphone, speaker, accent, room, or noise condition.