classla
/

wav2vecbert2-filledPause

Audio Classification

Safetensors

wav2vec2-bert

Model card Files Files and versions

xet

Community

5roop commited on Apr 10, 2025

Commit

327b583

verified ·

1 Parent(s): 8941911

Update metrics

Browse files

Files changed (1) hide show

README.md +55 -24

README.md CHANGED Viewed

@@ -24,32 +24,55 @@ te test split of the same dataset.
 Although the output of the model is a series 0 or 1, describing their  20ms frames, the evaluation was done on
 event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
-events partially overlap, this is counted as a true positive.
 ## Evaluation on ROG corpus
-In evaluation, we only evaluate positive events, i.e.
 ```
-              precision    recall  f1-score   support
-           1      0.907     0.987     0.946      1834
 ```
-## Evaluation on ParlaSpeech [HR](https://huggingface.co/datasets/classla/ParlaSpeech-HR) and [RS](https://huggingface.co/datasets/classla/ParlaSpeech-RS) corpora
-Evaluation on 800 human-annotated instances  ParlaSpeech-HR and ParlaSpeech-RS produced the following metrics:
-```
-Performance on RS:
-Classification report for human vs model on event level:
-              precision    recall  f1-score   support
-           1       0.95      0.99      0.97       542
-Performance on HR:
-Classification report for human vs model on event level:
-              precision    recall  f1-score   support
-           1       0.93      0.98      0.95       531
 ```
 The metrics reported are on event level, which means that if true and
 predicted filled pauses at least partially overlap, we count them as a
@@ -81,19 +104,25 @@ ds = Dataset.from_dict(
 def frames_to_intervals(
-    frames: list[int], drop_short=True, drop_initial=True, short_cutoff_s=0.08
 ) -> list[tuple[float]]:
     """Transforms a list of ones or zeros, corresponding to annotations on frame
     levels, to a list of intervals ([start second, end second]).
-    Allows for additional filtering on duration (false positives are often short)
-    and start times (false positives starting at 0.0 are often an artifact of
-    poor segmentation).
     :param list[int] frames: Input frame labels
-    :param bool drop_short: Drop everything shorter than short_cutoff_s, defaults to True
     :param bool drop_initial: Drop predictions starting at 0.0, defaults to True
-    :param float short_cutoff_s: Duration in seconds of shortest allowable prediction, defaults to 0.08
     :return list[tuple[float]]: List of intervals [start_s, end_s]
     """
     from itertools import pairwise
@@ -115,13 +144,15 @@ def frames_to_intervals(
             results.append(
                 (
                     round(ndf.loc[si, "time_s"], 3),
-                    round(ndf.loc[ei - 1, "time_s"], 3),
                 )
             )
     if drop_short and (len(results) > 0):
         results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
     if drop_initial and (len(results) > 0):
         results = [i for i in results if i[0] != 0.0]
     return results

 Although the output of the model is a series 0 or 1, describing their  20ms frames, the evaluation was done on
 event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
+events partially overlap, this is counted as a true positive. We report precisions, recalls, and f1-scores of the positive class.
+We observed several failure modes of the automatic inferrence process and designed post-processing steps to mitigate them.
+False positives were observed to be caused by improper audio segmentation, which is why disabling predictions that start at the start of the audio or
+end at the end of the audio can be beneficial. Another failure mode is predicting very short events, which is why ignoring very short predictions
+can be safely discarded.
 ## Evaluation on ROG corpus
 ```
+| postprocessing                    |   recall |   precision |    F1 |
+|:-----------------------|---------:|------------:|------:|
+| none                    |    0.981 |       0.955 | 0.968 |
+| drop_short             |    0.981 |       0.957 | 0.969 |
+| drop_short_initial_and_final  |    0.964 |       0.966 | 0.965 |
+| drop_short_and_initial |    0.964 |       0.966 | 0.965 |
+| drop_initial           |    0.964 |       0.963 | 0.963 |
 ```
+## Evaluation on ParlaSpeech corpora
+For every language in the [ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
+400 instances were sampled and annotated by human annotators.
+Evaluation on human-annotated instances  produced the following metrics:
+```
+| lang   | postprocessing         |   recall |   precision |    F1 |
+|:-------|:-----------------------|---------:|------------:|------:|
+| CZ     | drop_short_initial_and_final  |    0.889 |       0.859 | 0.874 |
+| CZ     | drop_short_and_initial |    0.889 |       0.859 | 0.874 |
+| CZ     | drop_short             |    0.905 |       0.833 | 0.868 |
+| CZ     | drop_initial           |    0.889 |       0.846 | 0.867 |
+| CZ     | raw                    |    0.905 |       0.814 | 0.857 |
+| HR     | drop_short_initial_and_final  |    0.94  |       0.887 | 0.913 |
+| HR     | drop_short_and_initial |    0.94  |       0.887 | 0.913 |
+| HR     | drop_short             |    0.94  |       0.884 | 0.911 |
+| HR     | drop_initial           |    0.94  |       0.875 | 0.906 |
+| HR     | raw                    |    0.94  |       0.872 | 0.905 |
+| PL     | drop_short             |    0.906 |       0.947 | 0.926 |
+| PL     | drop_short_initial_and_final  |    0.903 |       0.947 | 0.924 |
+| PL     | drop_short_and_initial |    0.903 |       0.947 | 0.924 |
+| PL     | raw                    |    0.91  |       0.924 | 0.917 |
+| PL     | drop_initial           |    0.908 |       0.924 | 0.916 |
+| RS     | drop_short             |    0.966 |       0.915 | 0.94  |
+| RS     | drop_short_initial_and_final  |    0.966 |       0.915 | 0.94  |
+| RS     | drop_short_and_initial |    0.966 |       0.915 | 0.94  |
+| RS     | drop_initial           |    0.974 |       0.9   | 0.936 |
+| RS     | raw                    |    0.974 |       0.9   | 0.936 |
 ```
 The metrics reported are on event level, which means that if true and
 predicted filled pauses at least partially overlap, we count them as a
 def frames_to_intervals(
+    frames: list[int],
+    drop_short=True,
+    drop_initial=True,
+    drop_final=False,
+    short_cutoff_s=0.08,
 ) -> list[tuple[float]]:
     """Transforms a list of ones or zeros, corresponding to annotations on frame
     levels, to a list of intervals ([start second, end second]).
+    Allows for additional filtering on duration (false positives are often
+    short) and start times (false positives starting at 0.0 are often an
+    artifact of poor segmentation).
     :param list[int] frames: Input frame labels
+    :param bool drop_short: Drop everything shorter than short_cutoff_s,
+        defaults to True
     :param bool drop_initial: Drop predictions starting at 0.0, defaults to True
+    :param float short_cutoff_s: Duration in seconds of shortest allowable
+        prediction, defaults to 0.08
     :return list[tuple[float]]: List of intervals [start_s, end_s]
     """
     from itertools import pairwise
             results.append(
                 (
                     round(ndf.loc[si, "time_s"], 3),
+                    round(ndf.loc[ei, "time_s"], 3),
                 )
             )
     if drop_short and (len(results) > 0):
         results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
     if drop_initial and (len(results) > 0):
         results = [i for i in results if i[0] != 0.0]
+    if drop_final and (len(results) > 0):
+        results = [i for i in results if i[1] != 0.02 * len(frames)]
     return results