| --- |
| base_model: |
| - facebook/w2v-bert-2.0 |
| datasets: |
| - classla/ParlaSpeech-RS |
| - classla/ParlaSpeech-HR |
| - classla/Mici_Princ |
| language: |
| - sl |
| - hr |
| - sr |
| library_name: transformers |
| license: cc-by-sa-4.0 |
| metrics: |
| - accuracy |
| pipeline_tag: audio-classification |
| --- |
| |
| # Model Card |
|
|
| This model annotates primary stress in words on 20ms frames. |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| <!-- Provide a longer summary of what this model is. --> |
|
|
|
|
| - **Developed by:** [Peter Rupnik](https://huggingface.co/5roop), [Nikola Ljubešić](https://huggingface.co/nljubesi), [Ivan Porupski](https://huggingface.co/porupski) |
| - **Model type:** Audio frame classifier |
| - **Language(s) (NLP):** Croatian, Slovenian, Serbian, Chakavian variant of Croatian |
| - **License:** Creative Commons - Share Alike 4.0 |
|
|
| <!-- Provide the basic links for the model. --> |
|
|
| - **Paper:** Please cite the following paper: |
|
|
| ``` |
| @inproceedings{ljubesic2025identifying, |
| title = {Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models}, |
| author = {Ljubešić, Nikola and Porupski, Ivan and Rupnik, Peter}, |
| booktitle = {Proceedings of Interspeech 2025}, |
| year = {2025}, |
| note = {Accepted at Interspeech 2025} |
| } |
| ``` |
| ### Training data |
|
|
| The model was trained on the training split of [ParlaStress-HR dataset](http://hdl.handle.net/11356/2038). |
|
|
| ### Evaluation results |
|
|
| For evaluation, the test splits of [ParlaStress-HR dataset](http://hdl.handle.net/11356/2038) were used. |
|
|
| |test language|accuracy| |
| | ---|---| |
| | Croatian| 99.1| |
| |Serbian|99.3| |
| |Chakavian (variant of Croatian)|88.9| |
| |Slovenian|89.0| |
|
|
| ### Direct Use |
|
|
| The model is intended for data-driven analyses in primary stress position. At the moment, it has been proven to work on 4 datasets in 3 languages. |
|
|
|
|
| ## Example use |
|
|
| ```python |
| import numpy as np |
| |
| from datasets import Audio, Dataset |
| from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification |
| import torch |
| import numpy as np |
| |
| if torch.cuda.is_available(): |
| device = torch.device("cuda") |
| else: |
| device = torch.device("cpu") |
| |
| model_name = "classla/Wav2Vec2BertPrimaryStressAudioFrameClassifier" |
| feature_extractor = AutoFeatureExtractor.from_pretrained(model_name) |
| model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device) |
| # Path to the file, containing the word to be annotated: |
| f = "wavs/word.wav" |
| |
| |
| def frames_to_intervals(frames: list[int]) -> list[tuple[float]]: |
| from itertools import pairwise |
| import pandas as pd |
| |
| results = [] |
| ndf = pd.DataFrame( |
| data={ |
| "time_s": [0.020 * i for i in range(len(frames))], |
| "frames": frames, |
| } |
| ) |
| ndf = ndf.dropna() |
| indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values |
| for si, ei in pairwise(indices_of_change): |
| if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0: |
| pass |
| else: |
| results.append( |
| (round(ndf.loc[si, "time_s"], 3), round(ndf.loc[ei - 1, "time_s"], 3)) |
| ) |
| if results == []: |
| return None |
| # Post-processing: if multiple regions were returned, only the longest should be taken: |
| if len(results) > 1: |
| results = sorted(results, key=lambda t: t[1]-t[0], reverse=True) |
| return results[0:1] |
| |
| |
| def evaluator(chunks): |
| sampling_rate = chunks["audio"][0]["sampling_rate"] |
| with torch.no_grad(): |
| inputs = feature_extractor( |
| [i["array"] for i in chunks["audio"]], |
| return_tensors="pt", |
| sampling_rate=sampling_rate, |
| ).to(device) |
| logits = model(**inputs).logits |
| y_pred_raw = np.array(logits.cpu()) |
| y_pred = y_pred_raw.argmax(axis=-1) |
| primary_stress = [frames_to_intervals(i) for i in y_pred] |
| return { |
| "y_pred": y_pred, |
| "y_pred_logits": y_pred_raw, |
| "primary_stress": primary_stress, |
| } |
| |
| # Create a dataset with a single instance and map our evaluator function on it: |
| ds = Dataset.from_dict({"audio": [f]}).cast_column("audio", Audio(16000, mono=True)) |
| ds = ds.map(evaluator, batched=True, batch_size=1) # Adjust batch size according to your hardware specs |
| print(ds["y_pred"][0]) |
| # Outputs: [0, 0, 1, 1, 1, 1, 1, ...] |
| print(ds["y_pred_logits"][0]) |
| # Outputs: |
| # [[ 0.89419061, -0.77746612], |
| # [ 0.44213724, -0.34862748], |
| # [-0.08605709, 0.13012762], |
| # .... |
| print(ds["primary_stress"][0]) |
| # Outputs: [0.34, 0.4] |
| |
| ``` |
|
|
| ## Training Details |
|
|
| ### Training Data |
|
|
| 10443 manually annotated multisyllabic words from [ParlaSpeech-HR](https://huggingface.co/datasets/classla/ParlaSpeech-HR). |
|
|
| ### Training Procedure |
|
|
| <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
| #### Training Hyperparameters |
|
|
| - Learning rate: 1e-5 |
| - Batch size: 32 |
| - Number of epochs: 20 |
| - Weight decay: 0.01 |
| - Gradient accumulation steps: 1 |
|
|
| ## Evaluation |
|
|
| <!-- This section describes the evaluation protocols and provides the results. --> |