marianbasti commited on
Commit
6e0631b
·
verified ·
1 Parent(s): a55bd45

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -25,14 +25,14 @@ tags:
25
  ---
26
  # Audio Language Classifier (SNAC backbone, Common Voice 17.0)
27
  First iteration of a lightweight (7M parameter) model for detecting language from a speech audio. Code is available at [GitHub](https://github.com/surus-lat/audio-language-classification)
28
- Summary:
29
- - Task: Spoken language identification (10 languages)
 
30
  - Backbone: SNAC (hubertsiuzdak/snac_24khz) with attention pooling
31
- - Dataset: Mozilla Common Voice 17.0 (streaming)
32
  - Sample rate: 24 kHz; Max audio length: 10 s (pad/trim)
33
  - Mixed precision: FP16
34
  - Best validation accuracy: 0.5016
35
- - Test accuracy: 0.3830
36
 
37
  Supported languages (labels):
38
  - en, es, fr, de, it, pt, ru, zh-CN, ja, ar
@@ -47,8 +47,8 @@ Out-of-scope:
47
  Data:
48
  - Source: Mozilla Common Voice 17.0 (streaming; per-language subset).
49
  - License: CC-0 (check dataset card for details).
50
- - Splits: Official validation/test splits used (use_official_splits: true).
51
- - Optional percent slice per split used during training: 25%.
52
 
53
  Model architecture:
54
  - Backbone: SNAC encoder (pretrained).
@@ -72,7 +72,7 @@ Training setup:
72
  - Label smoothing: 0.1
73
  - Max grad norm: 1.0
74
  - Seed: 42
75
- - Hardware: CUDA if available; FP16 enabled
76
 
77
  Preprocessing:
78
  - Mono waveform at 24 kHz; pad/trim to 10 s.
 
25
  ---
26
  # Audio Language Classifier (SNAC backbone, Common Voice 17.0)
27
  First iteration of a lightweight (7M parameter) model for detecting language from a speech audio. Code is available at [GitHub](https://github.com/surus-lat/audio-language-classification)
28
+
29
+ In short:
30
+ - Identification of spoken language in audio (10 languages)
31
  - Backbone: SNAC (hubertsiuzdak/snac_24khz) with attention pooling
32
+ - Dataset used: Mozilla Common Voice 17.0 (streaming)
33
  - Sample rate: 24 kHz; Max audio length: 10 s (pad/trim)
34
  - Mixed precision: FP16
35
  - Best validation accuracy: 0.5016
 
36
 
37
  Supported languages (labels):
38
  - en, es, fr, de, it, pt, ru, zh-CN, ja, ar
 
47
  Data:
48
  - Source: Mozilla Common Voice 17.0 (streaming; per-language subset).
49
  - License: CC-0 (check dataset card for details).
50
+ - Splits: Official validation/test splits used (use_official_splits: true). Parquet branch to handle the large sizes
51
+ - Percent slice per split used during training: 25%.
52
 
53
  Model architecture:
54
  - Backbone: SNAC encoder (pretrained).
 
72
  - Label smoothing: 0.1
73
  - Max grad norm: 1.0
74
  - Seed: 42
75
+ - Hardware: 1x RTX3090; FP16 enabled
76
 
77
  Preprocessing:
78
  - Mono waveform at 24 kHz; pad/trim to 10 s.