marianbasti
/

audio-language-classification

Audio Classification

Model card Files Files and versions

marianbasti commited on Oct 21

Commit

6e0631b

·

verified ·

1 Parent(s): a55bd45

Update README.md

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -25,14 +25,14 @@ tags:
 ---
 # Audio Language Classifier (SNAC backbone, Common Voice 17.0)
 First iteration of a lightweight (7M parameter) model for detecting language from a speech audio. Code is available at [GitHub](https://github.com/surus-lat/audio-language-classification)
-Summary:
-- Task: Spoken language identification (10 languages)
 - Backbone: SNAC (hubertsiuzdak/snac_24khz) with attention pooling
-- Dataset: Mozilla Common Voice 17.0 (streaming)
 - Sample rate: 24 kHz; Max audio length: 10 s (pad/trim)
 - Mixed precision: FP16
 - Best validation accuracy: 0.5016
-- Test accuracy: 0.3830
 Supported languages (labels):
 - en, es, fr, de, it, pt, ru, zh-CN, ja, ar
@@ -47,8 +47,8 @@ Out-of-scope:
 Data:
 - Source: Mozilla Common Voice 17.0 (streaming; per-language subset).
 - License: CC-0 (check dataset card for details).
-- Splits: Official validation/test splits used (use_official_splits: true).
-- Optional percent slice per split used during training: 25%.
 Model architecture:
 - Backbone: SNAC encoder (pretrained).
@@ -72,7 +72,7 @@ Training setup:
 - Label smoothing: 0.1
 - Max grad norm: 1.0
 - Seed: 42
-- Hardware: CUDA if available; FP16 enabled
 Preprocessing:
 - Mono waveform at 24 kHz; pad/trim to 10 s.

 ---
 # Audio Language Classifier (SNAC backbone, Common Voice 17.0)
 First iteration of a lightweight (7M parameter) model for detecting language from a speech audio. Code is available at [GitHub](https://github.com/surus-lat/audio-language-classification)
+In short:
+- Identification of spoken language in audio (10 languages)
 - Backbone: SNAC (hubertsiuzdak/snac_24khz) with attention pooling
+- Dataset used: Mozilla Common Voice 17.0 (streaming)
 - Sample rate: 24 kHz; Max audio length: 10 s (pad/trim)
 - Mixed precision: FP16
 - Best validation accuracy: 0.5016
 Supported languages (labels):
 - en, es, fr, de, it, pt, ru, zh-CN, ja, ar
 Data:
 - Source: Mozilla Common Voice 17.0 (streaming; per-language subset).
 - License: CC-0 (check dataset card for details).
+- Splits: Official validation/test splits used (use_official_splits: true). Parquet branch to handle the large sizes
+- Percent slice per split used during training: 25%.
 Model architecture:
 - Backbone: SNAC encoder (pretrained).
 - Label smoothing: 0.1
 - Max grad norm: 1.0
 - Seed: 42
+- Hardware: 1x RTX3090; FP16 enabled
 Preprocessing:
 - Mono waveform at 24 kHz; pad/trim to 10 s.