File size: 4,137 Bytes
ef9ed8c
 
b25d0ff
 
 
 
 
 
 
 
 
ef9ed8c
 
 
73cfea1
ef9ed8c
73cfea1
 
 
 
 
 
599a55a
ef9ed8c
73cfea1
 
ef9ed8c
73cfea1
 
 
 
 
ef9ed8c
73cfea1
ef9ed8c
73cfea1
 
 
 
 
ef9ed8c
73cfea1
 
ef9ed8c
73cfea1
 
 
ef9ed8c
73cfea1
ef9ed8c
73cfea1
 
 
ef9ed8c
73cfea1
 
 
 
 
 
 
 
 
ef9ed8c
73cfea1
 
 
ef9ed8c
73cfea1
ef9ed8c
73cfea1
ef9ed8c
92ae94a
73cfea1
 
 
 
 
 
 
ef9ed8c
73cfea1
 
ef9ed8c
92ae94a
 
 
 
 
 
 
 
 
 
 
73cfea1
ef9ed8c
73cfea1
 
 
 
ef9ed8c
73cfea1
 
 
 
ef9ed8c
73cfea1
ef9ed8c
73cfea1
 
 
 
ef9ed8c
73cfea1
 
 
 
 
ef9ed8c
73cfea1
9395558
73cfea1
ef9ed8c
73cfea1
ef9ed8c
73cfea1
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
library_name: transformers
license: cc-by-4.0
datasets:
- DataLabX/ScreenTalk-XS
language:
- en
metrics:
- wer
base_model:
- openai/whisper-small
---


# πŸ“Œ ScreenTalk-xs: Fine-Tuned Whisper Model for Movie & TV Audio

## πŸ“œ Model Details
- **Model Name**: ScreenTalk-xs
- **Developed by**: DataLabX
- **Finetuned from**: [`openai/whisper-small`](https://huggingface.co/openai/whisper-small)
- **Language(s)**: English
- **License**: Apache-2.0
- **Repository**: [Hugging Face Model Hub](https://huggingface.co/fj11/ScreenTalk-xs)

## πŸ“Œ Model Description
ScreenTalk-xs is a fine-tuned version of OpenAI's `whisper-small` model, optimized for **speech-to-text transcription** on **movies & TV show audio**. This model is specifically trained to **improve ASR (Automatic Speech Recognition) performance** in dialogue-heavy scenarios.

### πŸ”Ή Key Features
- πŸ“Ί **Optimized for movie & TV dialogues**
- 🎀 **Robust to noisy environments**
- πŸ” **Improved handling of long-form speech**
- πŸš€ **Efficient inference with LoRA fine-tuning**

---

## πŸš€ Uses
### βœ… Direct Use
- **Speech-to-text transcription** for movies, TV shows, and general spoken audio.
- **Automatic subtitling & captioning** for multimedia content.
- **Voice-enabled applications** such as AI assistants & transcription services.

### πŸ”Ή Downstream Use
- Can be used for **improving ASR models** in entertainment, media, and accessibility applications.

### ❌ Out-of-Scope Use
- Not optimized for **real-time streaming ASR**.
- May not generalize well to **heavily accented speech** outside its training dataset.

---

## πŸ›  Training Details
### πŸ“Œ Training Data
The model was fine-tuned using the **ScreenTalk-XS dataset**, a collection of transcribed movie & TV audio.

### πŸ“Œ Training Hyperparameters
| **Hyperparameter** | **Value** |
|-------------------|---------|
| Learning Rate | `5e-5` |
| Batch Size | `6` |
| Gradient Accumulation | `4` |
| Epochs | `5` |
| LoRA Rank (`r`) | `4` |
| Optimizer | AdamW |

### πŸ“Œ Training Procedure
- **Fine-tuned with LoRA** to reduce memory consumption while maintaining efficiency.
- **Evaluation on a held-out test set** to monitor WER (Word Error Rate).

---

## πŸ“Š Evaluation

### πŸ“Œ Training Results
| **Epoch** | **Training Loss** | **Validation Loss** | **WER (%)** |
|-----------|-----------------|-----------------|-------------|
| **1**  | 0.502400	| 0.333292   | 20.870653 |
| **2**  | 0.244200 | 0.327987	 | 20.580875 |
| **3**  | 0.523600 | 0.325907	 | 21.924394 |
| **4**  | 0.445500 | 0.326386	 | 20.508430 |
| **5**  | 0.285700	| 0.327116   | 20.752107 |

- **Best Model:** `Epoch 4`, achieving **WER = 20.50%**  
- **Model performance degrades after epoch 4**, suggesting overfitting.

### πŸ“Œ Test Results
| **Model** | **WER (%)** |
|-----------|-------------|
| **Whisper-small (baseline)** | **30.00%** |
| **ScreenTalk-xs (fine-tuned)** | **27.00%** βœ… |

### **πŸ” Key Observations**
- **Fine-tuning reduced WER from 30.00% β†’ 27.00%** 🎯
- **Achieved a 10% relative improvement in ASR accuracy.**
- **Tested on the ScreenTalk-XS dataset.**

---

## πŸ–₯️ Technical Specifications
### πŸ“Œ Model Architecture
- Based on **Whisper-small**, a transformer-based sequence-to-sequence ASR model.
- Fine-tuned using **LoRA** to reduce memory footprint.

### πŸ“Œ Hardware & Compute Infrastructure
- **Training Hardware:** T4 (16GB) GPU  
- **Training Time:** ~5 hours  
- **Training Environment:** PyTorch + Transformers (Hugging Face)  

---

## πŸ“– How to Use
You can use this model for **speech-to-text transcription** with `pipeline`:
```python
from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="fj11/ScreenTalk-xs",
    device=0  # Run on GPU
)

result = pipe("path/to/audio.wav")
print(result["text"])
```

---

## πŸ“œ Citation
If you use this model, please cite:
```java
@misc{DataLabX2025ScreenTalkXS,
  author = {DataLabX},
  title = {ScreenTalk-xs: ASR Model Fine-Tuned on Movie & TV Audio},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DataLabX/ScreenTalk-xs}
}
```