haeylee commited on
Commit
b63eb5b
·
verified ·
1 Parent(s): d225510

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -17
README.md CHANGED
@@ -30,12 +30,12 @@ metrics:
30
 
31
  # SSL-FT-PRON: Fine-tuned SSL Models for Automatic Pronunciation Assessment (APA)
32
 
33
- A collection of fine-tuned **Self-Supervised Learning (SSL)** speech checkpoints (Wav2Vec2.0, HuBERT, WavLM) for **Automatic Pronunciation Assessment (APA)**.
34
  Three strategies are provided per backbone:
35
 
36
  - **CTC**: ASR-style head trained with CTC
37
  - **Freeze**: CNN feature extractor frozen; rest is fine-tuned
38
- - **General**: no CTC head; a lightweight regression head predicts four APA scores (Accuracy, Fluency, Prosody, Total)
39
 
40
  > **Important:** This Hub repository is a *collection*. Each model lives in a **subdirectory**.
41
  > Load with the full sub-path, e.g. `haeylee/ssl_ft_pron/wav2vec2/general/02_wav2vec2-large-960h`.
@@ -46,9 +46,8 @@ Three strategies are provided per backbone:
46
 
47
  - **Developed by:** Haeyoung Lee (haeylee)
48
  - **Affiliation (paper):** Seoul National University, SNU Spoken Language Processing Lab
49
- - **Model type:** SSL speech encoders fine-tuned for APA (CTC / General / Freeze variants)
50
  - **Language(s):** English (evaluated on Speechocean762)
51
- - **License:** *TBD by author*
52
  - **Finetuned from:** See `base_model` list above
53
 
54
  ### Model Sources
@@ -57,27 +56,15 @@ Three strategies are provided per backbone:
57
 
58
  ---
59
 
60
- ## Uses
61
-
62
- ### Direct Use
63
  - Research and prototyping for **pronunciation scoring** and **feature analysis** on read English speech.
64
  - As encoders for downstream APA tasks, analytics, or visualization (e.g., PCA of hidden states).
65
-
66
- ### Downstream Use
67
- - Integrate APA scores into CALL (Computer-Assisted Language Learning) or assessment tools.
68
- - Use CTC variants for ASR-aligned pipelines; use General/Freeze variants for score regression.
69
-
70
- ### Out-of-Scope Use
71
- - Non-English targets without adaptation.
72
- - High-stakes assessment without proper validation, calibration, and fairness checks.
73
-
74
  ---
75
 
76
  ## Bias, Risks, and Limitations
77
 
78
  - Trained/evaluated on **Speechocean762** (read English speech by L2 speakers). May not generalize to spontaneous speech, other accents/languages, or noisy conditions.
79
  - APA involves subjective human judgments; ensure careful calibration and validation on your domain.
80
- - Consider privacy/consent when handling speech data.
81
 
82
  **Recommendation:** Validate on in-domain data and monitor subgroup performance.
83
 
@@ -92,4 +79,36 @@ from transformers import AutoModelForCTC, AutoProcessor
92
  ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large" # pick your subdir
93
  model = AutoModelForCTC.from_pretrained(ckpt)
94
  processor = AutoProcessor.from_pretrained(ckpt)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
 
30
 
31
  # SSL-FT-PRON: Fine-tuned SSL Models for Automatic Pronunciation Assessment (APA)
32
 
33
+ A collection of fine-tuned **Self-Supervised Learning (SSL)** speech models (Wav2Vec2.0, HuBERT, WavLM) for **Automatic Pronunciation Assessment (APA)**.
34
  Three strategies are provided per backbone:
35
 
36
  - **CTC**: ASR-style head trained with CTC
37
  - **Freeze**: CNN feature extractor frozen; rest is fine-tuned
38
+ - **General**: no CTC head;
39
 
40
  > **Important:** This Hub repository is a *collection*. Each model lives in a **subdirectory**.
41
  > Load with the full sub-path, e.g. `haeylee/ssl_ft_pron/wav2vec2/general/02_wav2vec2-large-960h`.
 
46
 
47
  - **Developed by:** Haeyoung Lee (haeylee)
48
  - **Affiliation (paper):** Seoul National University, SNU Spoken Language Processing Lab
49
+ - **Model type:** SSL speech encoders fine-tuned for APA (CTC / General / Freeze)
50
  - **Language(s):** English (evaluated on Speechocean762)
 
51
  - **Finetuned from:** See `base_model` list above
52
 
53
  ### Model Sources
 
56
 
57
  ---
58
 
59
+ ### Use
 
 
60
  - Research and prototyping for **pronunciation scoring** and **feature analysis** on read English speech.
61
  - As encoders for downstream APA tasks, analytics, or visualization (e.g., PCA of hidden states).
 
 
 
 
 
 
 
 
 
62
  ---
63
 
64
  ## Bias, Risks, and Limitations
65
 
66
  - Trained/evaluated on **Speechocean762** (read English speech by L2 speakers). May not generalize to spontaneous speech, other accents/languages, or noisy conditions.
67
  - APA involves subjective human judgments; ensure careful calibration and validation on your domain.
 
68
 
69
  **Recommendation:** Validate on in-domain data and monitor subgroup performance.
70
 
 
79
  ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large" # pick your subdir
80
  model = AutoModelForCTC.from_pretrained(ckpt)
81
  processor = AutoProcessor.from_pretrained(ckpt)
82
+ ```
83
+
84
+ ### B) General / Freeze models (no CTC head)
85
+ ```python
86
+ from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel
87
+
88
+ # Wav2Vec2 example (General)
89
+ ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
90
+ model = Wav2Vec2Model.from_pretrained(ckpt)
91
+ processor = AutoProcessor.from_pretrained(ckpt)
92
+
93
+ # HuBERT example (Freeze)
94
+ # ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
95
+ # model = HubertModel.from_pretrained(ckpt)
96
+ # processor = AutoProcessor.from_pretrained(ckpt)
97
+
98
+ # WavLM example (General)
99
+ # ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
100
+ # model = WavLMModel.from_pretrained(ckpt)
101
+ # processor = AutoProcessor.from_pretrained(ckpt)
102
+ ```
103
+ ### Summary:
104
+ CTC: AutoModelForCTC.from_pretrained(...)
105
+ General/Freeze: Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)
106
+
107
+ ## Training Details
108
+
109
+ ### Training Data
110
+ - **Dataset:** [Speechocean762](https://openslr.org/101/)
111
+ - **Preprocessing:** Use `preprocess_dataset.py` (in the repo) to convert raw audio/labels into Hugging Face `datasets` format.
112
+
113
+ Expected processed layout:
114