File size: 10,537 Bytes

---
library_name: transformers
pipeline_tag: audio-classification
tags:
- audio
- audio-classification
- keyword-spotting
- kws
- wav2vec2
- pytorch
- onnx
- sagemaker
- streaming-inference
- realtime
datasets:
- google/speech_commands
base_model:
- facebook/wav2vec2-base
license: other
language: en
---

# Model Card for hf-kws (Wav2Vec2 Keyword Spotting)

<!-- Provide a quick summary of what the model is/does. -->

A compact, end‑to‑end pipeline for training, evaluating, and deploying a **Wav2Vec2‑based keyword spotting (KWS)** model on **Google Speech Commands v2**. The repository includes offline and **real‑time streaming inference**,  **ONNX export**, and **AWS SageMaker** deployment scripts.

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This project fine‑tunes a Wav2Vec2 audio classifier (e.g., `facebook/wav2vec2-base`) for keyword spotting on **Speech Commands v2** using Hugging Face `transformers`/`datasets`. It supports microphone streaming with sliding‑window smoothing, file‑based inference, saved JSON metrics/plots, and a minimal **SageMaker** stack (train, realtime/serverless deploy, batch transform).

- **Developed by:** Amirhossein Yousefiramandi (GitHub: [@amirhossein-yousefi](https://github.com/amirhossein-yousefi))
- **Model type:** Audio Classification (Keyword Spotting) — Wav2Vec2 backbone with classification head
- **Language(s) (NLP):** English
- **License:** No explicit repository license file found (verify before redistribution)
- **Finetuned from model :** `facebook/wav2vec2-base` (16 kHz)

### Model Sources 

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/amirhossein-yousefi/keyword-spotting
- **Paper :** Warden, P. (2018). *Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition*. arXiv:1804.03209.

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

- On‑device or edge keyword detection for small command sets (e.g., “yes/no/up/down/stop/go”).
- Real‑time wake word / trigger prototypes via the included streaming inference script.
- Batch scoring of short audio clips for command presence via CLI or SageMaker Batch Transform.

### Downstream Use 

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

- Fine‑tune on custom keyword lists or languages (swap dataset, keep pipeline).
- Distillation/quantization for mobile deployment (roadmap mentions TFLite/CoreML).

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

- Open‑vocabulary ASR or general transcription.
- Long‑form audio or multi‑speaker diarization.
- Safety‑critical activation (e.g., medical/industrial controls) without rigorous evaluation and fail‑safes.
- Always‑on surveillance scenarios without clear user consent and privacy controls.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

- **Language & domain bias:** Trained on **English**, one‑second command words—limited transfer to other languages, accents, far‑field mics, or noisy environments without adaptation.
- **Vocabulary constraints:** Detects from a fixed label set; out‑of‑vocabulary words may map to “unknown” or be misclassified.
- **Data licensing:** Ensure **CC‑BY‑4.0** attribution when redistributing models trained on Speech Commands.

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Evaluate on target devices/microphones; add noise augmentation and tune detection thresholds for deployment context.

## Usage in HuggingFace (Recommended)
```bash
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification, pipeline

model_id = "Amirhossein75/Keyword-Spotting"
# Option A — simple:
clf = pipeline("audio-classification", model=model_id)
print(clf("path/to/1sec_16kHz.wav"))

# Option B — manual pre/post:
fe = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModelForAudioClassification.from_pretrained(model_id)

import soundfile as sf, torch
wave, sr = sf.read("path/to/1sec_16kHz.wav")
inputs = fe(wave, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
pred_id = int(logits.argmax(-1))
print(model.config.id2label[pred_id])

```
**Note** : it is better not to use the `AutoProcessor`

## How to Get Started with the Model

Use the code below to get started with the model.

```bash
# clone and install
git clone https://github.com/amirhossein-yousefi/keyword-spotting
cd keyword-spotting
python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt

# train (example)
python -m src.train   --checkpoint facebook/wav2vec2-base   --output_dir ./checkpoints/kws_w2v2   --num_train_epochs 8   --per_device_train_batch_size 16   --per_device_eval_batch_size 16

# single-file inference
python -m src.infer   --model_dir ./checkpoints/kws_w2v2   --wav_path /path/to/your.wav   --top_k 5

# streaming (microphone)
python -m src.stream_infer --model_dir ./checkpoints/kws_w2v2

# evaluate
python -m src.evaluate_fn --model_dir ./checkpoints/kws_w2v2
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

- **Dataset:** Google **Speech Commands v2** (1‑second WAVs, 16 kHz; English; CC‑BY‑4.0). Typical label set includes “yes/no, up/down, left/right, on/off, stop/go,” plus auxiliary words and silence/unknown classes.

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing

- Resampled/processed at **16 kHz**.
- Augmentations: **time‑shift, noise, random gain**.

#### Training Hyperparameters

- **Training regime:** fp32 (example; adjust as needed)
- **Backbone:** `facebook/wav2vec2-base` (audio classification head).
- **Epochs (example):** 8
- **Batch size (example):** 16 train / 16 eval
- **Framework:** PyTorch + Hugging Face `transformers`

#### Speeds, Sizes, Times 

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

- **Example environment:** Single **NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB)**, PyTorch **2.8.0+cu129**, CUDA **12.9**.
- **Reported training runtime:** ~**3,446.3 s** for the default run (see repository logs/README).

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

- Speech Commands v2 **test split**.

#### Factors

- Evaluate by **keyword**, **speaker**, **noise type/level**, and **device/mic** to assess robustness.

#### Metrics

- **Accuracy**, **F1**, **Precision**, **Recall**, and **Cross‑entropy loss**; plus runtime and throughput.

### Results

Below are the aggregated metrics at **epoch 10**.

| Split | Accuracy | F1 (weighted) | Precision (weighted) | Recall (weighted) | Loss | Runtime (s) | Samples/s | Steps/s |
|------:|:--------:|:-------------:|:---------------------:|:-----------------:|:----:|:-----------:|:---------:|:-------:|
| **Validation** | 97.13% | 97.14% | 97.17% | 97.13% | 0.123 | 9.29 | 1074.9 | 33.60 |
| **Test** | 96.79% | 96.79% | 96.81% | 96.79% | 0.137 | 9.99 | 1101.97 | 34.446 |

#### Summary

The pipeline reproduces standard Wav2Vec2 KWS performance on Speech Commands; tailor thresholds and augmentations for deployment.

## Model Examination 

<!-- Relevant interpretability work for the model goes here -->

- Inspect per‑class confusion matrices and score distributions from saved metrics to identify false‑positive/negative patterns.

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** Single NVIDIA GeForce RTX 3080 Ti Laptop GPU
- **Hours used:** ~0.96 h (example run)

## Technical Specifications 

### Model Architecture and Objective

- **Architecture:** Wav2Vec2 (self‑supervised acoustic encoder) + classification head for KWS.

### Compute Infrastructure

#### Hardware

- Example: **NVIDIA RTX 3080 Ti Laptop**, 16 GB VRAM.

#### Software

- **PyTorch 2.8.0+cu129**, CUDA driver **12.9**; Hugging Face `transformers`/`datasets`.

## Citation 

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX (Dataset):**
```
@article{warden2018speechcommands,
  author = {Warden, Pete},
  title  = {Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition},
  journal= {arXiv e-prints},
  eprint = {1804.03209},
  year   = {2018},
  month  = apr,
  url    = {https://arxiv.org/abs/1804.03209}
}
```

**APA (Dataset):**  
Warden, P. (2018). *Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition*. arXiv:1804.03209.

## Glossary 

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

- **KWS:** Keyword Spotting — detecting a small set of pre‑registered words in short audio clips.
- **Streaming inference:** Frame‑by‑frame scoring with smoothing over a sliding window.

## More Information 

- Speech Commands dataset card: https://huggingface.co/datasets/google/speech_commands
- Wav2Vec2 model docs: https://huggingface.co/docs/transformers/en/model_doc/wav2vec2

## Model Card Authors 

- Amirhossein Yousefiramandi

## Model Card Contact

Please open a GitHub Issue in this repository with questions or requests.