Audio Classification
Safetensors
wav2vec2-bert
5roop commited on
Commit
416ef89
·
verified ·
1 Parent(s): 777f975

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -2
README.md CHANGED
@@ -18,9 +18,47 @@ metrics:
18
 
19
  # Frame classification for filled pauses
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  This model classifies individual 20ms frames of audio based on
22
  presence of filled pauses ("eee", "errm", ...).
23
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  # Training data
26
 
@@ -28,6 +66,16 @@ The model was trained on human-annotated Slovenian speech corpus
28
  [ROG-Artur](http://hdl.handle.net/11356/1992). Recordings from the train split were segmented into
29
  at most 30s long chunks.
30
 
 
 
 
 
 
 
 
 
 
 
31
  # Evaluation
32
 
33
  Although the output of the model is a series 0 or 1, describing their 20ms frames,
@@ -182,5 +230,4 @@ print(ds["intervals"][0])
182
 
183
 
184
 
185
- # Citation
186
- Coming soon.
 
18
 
19
  # Frame classification for filled pauses
20
 
21
+ ## Paper
22
+ ```bibtex
23
+ @inproceedings{ljubesic-etal-2025-identifying,
24
+ title = "Identifying Filled Pauses in Speech Across South and {W}est {S}lavic Languages",
25
+ author = "Ljube{\v{s}}i{\'c}, Nikola and
26
+ Porupski, Ivan and
27
+ Rupnik, Peter",
28
+ editor = "Piskorski, Jakub and
29
+ P{\v{r}}ib{\'a}{\v{n}}, Pavel and
30
+ Nakov, Preslav and
31
+ Yangarber, Roman and
32
+ Marcinczuk, Michal",
33
+ booktitle = "Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)",
34
+ month = jul,
35
+ year = "2025",
36
+ address = "Vienna, Austria",
37
+ publisher = "Association for Computational Linguistics",
38
+ url = "https://aclanthology.org/2025.bsnlp-1.1/",
39
+ doi = "10.18653/v1/2025.bsnlp-1.1",
40
+ pages = "1--8",
41
+ ISBN = "978-1-959429-57-9",
42
+ abstract = "Filled pauses are among the most common paralinguistic features of speech, yet they are mainly omitted from transcripts. We propose a transformer-based approach for detecting filled pauses directly from the speech signal, fine-tuned on Slovenian and evaluated across South and West Slavic languages. Our results show that speech transformers achieve excellent performance in detecting filled pauses when evaluated in the in-language scenario. We further evaluate cross-lingual capabilities of the model on two closely related South Slavic languages (Croatian and Serbian) and two less closely related West Slavic languages (Czech and Polish). Our results reveal strong cross-lingual generalization capabilities of the model, with only minor performance drops. Moreover, error analysis reveals that the model outperforms human annotators in recall and F1 score, while trailing slightly in precision. In addition to evaluating the capabilities of speech transformers for filled pause detection across Slavic languages, we release new multilingual test datasets and make our fine-tuned model publicly available to support further research and applications in spoken language processing."
43
+ }
44
+ ```
45
+ ## Model Details
46
+
47
  This model classifies individual 20ms frames of audio based on
48
  presence of filled pauses ("eee", "errm", ...).
49
 
50
+ ### Model Description
51
+
52
+
53
+
54
+ - **Developed by:** Peter Rupnik, Nikola Ljubešić, Darinka Verdonik, Simona
55
+ Majhenič
56
+ - **Funded by:** MEZZANINE project
57
+ - **Model type:** Wav2Vec2Bert for Audio Frame Classification
58
+ - **Language(s) (NLP):** Trained and tested on [ROG-Artur](http://hdl.handle.net/11356/1992)
59
+ - **Finetuned from model:** facebook/w2v-bert-2.0
60
+
61
+
62
 
63
  # Training data
64
 
 
66
  [ROG-Artur](http://hdl.handle.net/11356/1992). Recordings from the train split were segmented into
67
  at most 30s long chunks.
68
 
69
+ ## Training Details
70
+
71
+ | hyperparameter | value |
72
+ | --------------------------- | ----- |
73
+ | learning rate | 3e-5 |
74
+ | effective batch size | 16 |
75
+ | num train epochs | 20 |
76
+
77
+
78
+
79
  # Evaluation
80
 
81
  Although the output of the model is a series 0 or 1, describing their 20ms frames,
 
230
 
231
 
232
 
233
+