Update README.md
Browse files
README.md
CHANGED
|
@@ -25,14 +25,14 @@ tags:
|
|
| 25 |
---
|
| 26 |
# Audio Language Classifier (SNAC backbone, Common Voice 17.0)
|
| 27 |
First iteration of a lightweight (7M parameter) model for detecting language from a speech audio. Code is available at [GitHub](https://github.com/surus-lat/audio-language-classification)
|
| 28 |
-
|
| 29 |
-
|
|
|
|
| 30 |
- Backbone: SNAC (hubertsiuzdak/snac_24khz) with attention pooling
|
| 31 |
-
- Dataset: Mozilla Common Voice 17.0 (streaming)
|
| 32 |
- Sample rate: 24 kHz; Max audio length: 10 s (pad/trim)
|
| 33 |
- Mixed precision: FP16
|
| 34 |
- Best validation accuracy: 0.5016
|
| 35 |
-
- Test accuracy: 0.3830
|
| 36 |
|
| 37 |
Supported languages (labels):
|
| 38 |
- en, es, fr, de, it, pt, ru, zh-CN, ja, ar
|
|
@@ -47,8 +47,8 @@ Out-of-scope:
|
|
| 47 |
Data:
|
| 48 |
- Source: Mozilla Common Voice 17.0 (streaming; per-language subset).
|
| 49 |
- License: CC-0 (check dataset card for details).
|
| 50 |
-
- Splits: Official validation/test splits used (use_official_splits: true).
|
| 51 |
-
-
|
| 52 |
|
| 53 |
Model architecture:
|
| 54 |
- Backbone: SNAC encoder (pretrained).
|
|
@@ -72,7 +72,7 @@ Training setup:
|
|
| 72 |
- Label smoothing: 0.1
|
| 73 |
- Max grad norm: 1.0
|
| 74 |
- Seed: 42
|
| 75 |
-
- Hardware:
|
| 76 |
|
| 77 |
Preprocessing:
|
| 78 |
- Mono waveform at 24 kHz; pad/trim to 10 s.
|
|
|
|
| 25 |
---
|
| 26 |
# Audio Language Classifier (SNAC backbone, Common Voice 17.0)
|
| 27 |
First iteration of a lightweight (7M parameter) model for detecting language from a speech audio. Code is available at [GitHub](https://github.com/surus-lat/audio-language-classification)
|
| 28 |
+
|
| 29 |
+
In short:
|
| 30 |
+
- Identification of spoken language in audio (10 languages)
|
| 31 |
- Backbone: SNAC (hubertsiuzdak/snac_24khz) with attention pooling
|
| 32 |
+
- Dataset used: Mozilla Common Voice 17.0 (streaming)
|
| 33 |
- Sample rate: 24 kHz; Max audio length: 10 s (pad/trim)
|
| 34 |
- Mixed precision: FP16
|
| 35 |
- Best validation accuracy: 0.5016
|
|
|
|
| 36 |
|
| 37 |
Supported languages (labels):
|
| 38 |
- en, es, fr, de, it, pt, ru, zh-CN, ja, ar
|
|
|
|
| 47 |
Data:
|
| 48 |
- Source: Mozilla Common Voice 17.0 (streaming; per-language subset).
|
| 49 |
- License: CC-0 (check dataset card for details).
|
| 50 |
+
- Splits: Official validation/test splits used (use_official_splits: true). Parquet branch to handle the large sizes
|
| 51 |
+
- Percent slice per split used during training: 25%.
|
| 52 |
|
| 53 |
Model architecture:
|
| 54 |
- Backbone: SNAC encoder (pretrained).
|
|
|
|
| 72 |
- Label smoothing: 0.1
|
| 73 |
- Max grad norm: 1.0
|
| 74 |
- Seed: 42
|
| 75 |
+
- Hardware: 1x RTX3090; FP16 enabled
|
| 76 |
|
| 77 |
Preprocessing:
|
| 78 |
- Mono waveform at 24 kHz; pad/trim to 10 s.
|