README.md DELETED
@@ -1,192 +0,0 @@
1
- ---
2
- base_model:
3
- - facebook/wav2vec2-large
4
- - facebook/wav2vec2-large-960h
5
- - facebook/wav2vec2-large-lv60
6
- - facebook/wav2vec2-large-xlsr-53
7
- - facebook/wav2vec2-xls-r-300m
8
- - facebook/hubert-large-ll60k
9
- - facebook/hubert-base-ls960
10
- - facebook/hubert-xlarge-ll60k
11
- - facebook/hubert-xlarge-ls960-ft
12
- - microsoft/wavlm-large
13
- - microsoft/wavlm-base-plus
14
- - microsoft/wavlm-base-plus-sv
15
- tags:
16
- - self-supervised-learning
17
- - pronunciation-assessment
18
- - speech
19
- - wav2vec2
20
- - hubert
21
- - wavlm
22
- - ctc
23
- - regression
24
- - feature-extraction
25
- datasets:
26
- - openslr/speechocean762
27
- metrics:
28
- - pearsonr
29
- ---
30
-
31
- # SSL-FT-PRON: Fine-tuned SSL Models for Automatic Pronunciation Assessment (APA)
32
-
33
- A collection of fine-tuned **Self-Supervised Learning (SSL)** speech models (Wav2Vec2.0, HuBERT, WavLM) for **Automatic Pronunciation Assessment (APA)**.
34
- Three strategies are provided per backbone:
35
-
36
- - **CTC**: ASR-style head trained with CTC
37
- - **Freeze**: CNN feature extractor frozen; rest is fine-tuned
38
- - **General**: no CTC head;
39
-
40
- > **Important:** This Hub repository is a *collection*. Each model lives in a **subdirectory**.
41
- > Load with the full sub-path, e.g. `haeylee/ssl_ft_pron/wav2vec2/general/02_wav2vec2-large-960h`.
42
-
43
- ---
44
-
45
- ## Model Details
46
-
47
- - **Developed by:** Haeyoung Lee (haeylee)
48
- - **Affiliation (paper):** Seoul National University, SNU Spoken Language Processing Lab
49
- - **Model type:** SSL speech encoders fine-tuned for APA (CTC / General / Freeze)
50
- - **Language(s):** English (evaluated on Speechocean762)
51
- - **Finetuned from:** See `base_model` list above
52
-
53
- ### Model Sources
54
- - **Code:** https://github.com/hy310/ssl_finetuning
55
- - **Paper:** *Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment (APSIPA ASC 2024)*
56
-
57
- ---
58
-
59
- ## Uses
60
- - Research/prototyping for **pronunciation scoring** and **representation analysis** (e.g., PCA on hidden states).
61
- - Feature extraction for downstream APA tasks.
62
- ---
63
-
64
- ## Bias, Risks, and Limitations
65
- - Trained/evaluated on **Speechocean762** (read English by L2 speakers). Generalization to other languages/speaking styles is not guaranteed.
66
- - APA relies on subjective human scores; apply domain calibration and monitor subgroup performance.
67
- **Recommendation:** Validate on in-domain data; report uncertainty and subgroup metrics.
68
-
69
- ---
70
-
71
- ## How to Get Started
72
-
73
- ### Load a CTC model (with CTC head)
74
- ~~~python
75
- from transformers import AutoModelForCTC, AutoProcessor
76
-
77
- ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large"
78
- model = AutoModelForCTC.from_pretrained(ckpt)
79
- processor = AutoProcessor.from_pretrained(ckpt)
80
- ~~~
81
-
82
- ### Load a General / Freeze model (no CTC head)
83
- ~~~python
84
- from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel
85
-
86
- # Wav2Vec2 (General)
87
- ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
88
- model = Wav2Vec2Model.from_pretrained(ckpt)
89
- processor = AutoProcessor.from_pretrained(ckpt)
90
-
91
- # HuBERT (Freeze)
92
- # ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
93
- # model = HubertModel.from_pretrained(ckpt)
94
- # processor = AutoProcessor.from_pretrained(ckpt)
95
-
96
- # WavLM (General)
97
- # ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
98
- # model = WavLMModel.from_pretrained(ckpt)
99
- # processor = AutoProcessor.from_pretrained(ckpt)
100
- ~~~
101
-
102
- **Summary:**
103
- - **CTC:** `AutoModelForCTC.from_pretrained(...)`
104
- - **General/Freeze:** `Wav2Vec2Model` / `HubertModel` / `WavLMModel` `.from_pretrained(...)`
105
-
106
- ---
107
-
108
- ## Training Details
109
-
110
- ### Training Data
111
- - **Dataset:** [Speechocean762](https://openslr.org/101/)
112
- - **Preprocessing:** We used `preprocess_dataset.py` (see the GitHub repo) to convert raw audio/labels into Hugging Face `datasets` format.
113
-
114
- **Expected processed layout:**
115
- ~~~text
116
- /your/data/path/speechocean762/
117
- └── preprocess/
118
- β”œβ”€β”€ speechocean_train_ds/
119
- └── speechocean_test_ds/
120
- ~~~
121
-
122
- ### Training Procedure
123
-
124
- #### Preprocessing
125
- ~~~bash
126
- # Adjust paths inside the script or via CLI args
127
- python preprocess_dataset.py \
128
- --data_root /your/data/path/speechocean762 \
129
- --out_dir /your/data/path/speechocean762/preprocess
130
- ~~~
131
-
132
- #### General (no CTC head)
133
- Loads encoders with `Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)` and trains a regression head to predict 4 APA scores.
134
- ~~~bash
135
- python train/baseline.py \
136
- --model_name facebook/hubert-xlarge-ls960-ft \
137
- --batch_size 4 \
138
- --learning_rate 1e-5 \
139
- --num_train_epochs 30
140
- ~~~
141
-
142
- #### Freeze (feature extractor frozen)
143
- Same as **General**, but freezes the CNN feature extractor.
144
- ~~~bash
145
- python train/freeze.py \
146
- --model_name facebook/hubert-xlarge-ls960-ft \
147
- --freeze_feature_extractor \
148
- --batch_size 4 \
149
- --learning_rate 1e-5 \
150
- --num_train_epochs 30
151
- ~~~
152
-
153
- #### CTC (ASR-style head)
154
- Uses `AutoModelForCTC.from_pretrained(...)` for CTC training.
155
- ~~~bash
156
- python train/ctc.py \
157
- --model_name facebook/wav2vec2-large \
158
- --batch_size 4 \
159
- --learning_rate 1e-5 \
160
- --num_train_epochs 30
161
- ~~~
162
-
163
- **Artifacts saved:** `model.safetensors`, `trainer_state.json`, `training_args.bin`, logs, and checkpoints (per run: `args.json`, `trainer_args.json`).
164
-
165
- ---
166
-
167
- ## Evaluation
168
-
169
- ### Testing Data, Factors & Metrics
170
- - **Test set:** Speechocean762 (held-out split prepared by `preprocess_dataset.py`)
171
- - **Factors:** Backbone (Wav2Vec2 / HuBERT / WavLM) Γ— strategy (CTC / General / Freeze)
172
- - **Metric:** `pearsonr` (Pearson correlation coefficient, PCC) for Accuracy, Fluency, Prosody, and Total.
173
- ---
174
-
175
- ## Citation
176
- ~~~bibtex
177
- @inproceedings{lee2024analysis,
178
- title={Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment},
179
- author={Lee, Haeyoung and Kim, Sunhee and Chung, Minhwa},
180
- booktitle={2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
181
- pages={1--6},
182
- year={2024},
183
- organization={IEEE}
184
- }
185
- ~~~
186
-
187
- ---
188
-
189
- ## Authors & Contact
190
- - **Author:** Haeyoung Lee (haeylee)
191
- - **Email:** haeylee@snu.ac.kr
192
- - **Issues/Requests:** https://github.com/hy310/ssl_finetuning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
wav2vec2/freeze/{02_wav2vec2-large-960h β†’ 02_wav2vec2-large-960-h}/all_results.json RENAMED
File without changes
wav2vec2/freeze/{02_wav2vec2-large-960h β†’ 02_wav2vec2-large-960-h}/args.json RENAMED
File without changes
wav2vec2/freeze/{02_wav2vec2-large-960h β†’ 02_wav2vec2-large-960-h}/eval_results.json RENAMED
File without changes
wav2vec2/freeze/{02_wav2vec2-large-960h β†’ 02_wav2vec2-large-960-h}/finetuned_pytorch_model.bin RENAMED
File without changes
wav2vec2/freeze/{02_wav2vec2-large-960h β†’ 02_wav2vec2-large-960-h}/model.safetensors RENAMED
File without changes
wav2vec2/freeze/{02_wav2vec2-large-960h β†’ 02_wav2vec2-large-960-h}/model_weights.pt RENAMED
File without changes
wav2vec2/freeze/{02_wav2vec2-large-960h β†’ 02_wav2vec2-large-960-h}/preprocessor_config.json RENAMED
File without changes
wav2vec2/freeze/{02_wav2vec2-large-960h β†’ 02_wav2vec2-large-960-h}/train_results.json RENAMED
File without changes
wav2vec2/freeze/{02_wav2vec2-large-960h β†’ 02_wav2vec2-large-960-h}/trainer_args.json RENAMED
File without changes
wav2vec2/freeze/{02_wav2vec2-large-960h β†’ 02_wav2vec2-large-960-h}/trainer_state.json RENAMED
File without changes
wav2vec2/freeze/{02_wav2vec2-large-960h β†’ 02_wav2vec2-large-960-h}/training_args.bin RENAMED
File without changes