File size: 4,137 Bytes
ef9ed8c b25d0ff ef9ed8c 73cfea1 ef9ed8c 73cfea1 599a55a ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 92ae94a 73cfea1 ef9ed8c 73cfea1 ef9ed8c 92ae94a 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 9395558 73cfea1 ef9ed8c 73cfea1 ef9ed8c 73cfea1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
library_name: transformers
license: cc-by-4.0
datasets:
- DataLabX/ScreenTalk-XS
language:
- en
metrics:
- wer
base_model:
- openai/whisper-small
---
# π ScreenTalk-xs: Fine-Tuned Whisper Model for Movie & TV Audio
## π Model Details
- **Model Name**: ScreenTalk-xs
- **Developed by**: DataLabX
- **Finetuned from**: [`openai/whisper-small`](https://huggingface.co/openai/whisper-small)
- **Language(s)**: English
- **License**: Apache-2.0
- **Repository**: [Hugging Face Model Hub](https://huggingface.co/fj11/ScreenTalk-xs)
## π Model Description
ScreenTalk-xs is a fine-tuned version of OpenAI's `whisper-small` model, optimized for **speech-to-text transcription** on **movies & TV show audio**. This model is specifically trained to **improve ASR (Automatic Speech Recognition) performance** in dialogue-heavy scenarios.
### πΉ Key Features
- πΊ **Optimized for movie & TV dialogues**
- π€ **Robust to noisy environments**
- π **Improved handling of long-form speech**
- π **Efficient inference with LoRA fine-tuning**
---
## π Uses
### β
Direct Use
- **Speech-to-text transcription** for movies, TV shows, and general spoken audio.
- **Automatic subtitling & captioning** for multimedia content.
- **Voice-enabled applications** such as AI assistants & transcription services.
### πΉ Downstream Use
- Can be used for **improving ASR models** in entertainment, media, and accessibility applications.
### β Out-of-Scope Use
- Not optimized for **real-time streaming ASR**.
- May not generalize well to **heavily accented speech** outside its training dataset.
---
## π Training Details
### π Training Data
The model was fine-tuned using the **ScreenTalk-XS dataset**, a collection of transcribed movie & TV audio.
### π Training Hyperparameters
| **Hyperparameter** | **Value** |
|-------------------|---------|
| Learning Rate | `5e-5` |
| Batch Size | `6` |
| Gradient Accumulation | `4` |
| Epochs | `5` |
| LoRA Rank (`r`) | `4` |
| Optimizer | AdamW |
### π Training Procedure
- **Fine-tuned with LoRA** to reduce memory consumption while maintaining efficiency.
- **Evaluation on a held-out test set** to monitor WER (Word Error Rate).
---
## π Evaluation
### π Training Results
| **Epoch** | **Training Loss** | **Validation Loss** | **WER (%)** |
|-----------|-----------------|-----------------|-------------|
| **1** | 0.502400 | 0.333292 | 20.870653 |
| **2** | 0.244200 | 0.327987 | 20.580875 |
| **3** | 0.523600 | 0.325907 | 21.924394 |
| **4** | 0.445500 | 0.326386 | 20.508430 |
| **5** | 0.285700 | 0.327116 | 20.752107 |
- **Best Model:** `Epoch 4`, achieving **WER = 20.50%**
- **Model performance degrades after epoch 4**, suggesting overfitting.
### π Test Results
| **Model** | **WER (%)** |
|-----------|-------------|
| **Whisper-small (baseline)** | **30.00%** |
| **ScreenTalk-xs (fine-tuned)** | **27.00%** β
|
### **π Key Observations**
- **Fine-tuning reduced WER from 30.00% β 27.00%** π―
- **Achieved a 10% relative improvement in ASR accuracy.**
- **Tested on the ScreenTalk-XS dataset.**
---
## π₯οΈ Technical Specifications
### π Model Architecture
- Based on **Whisper-small**, a transformer-based sequence-to-sequence ASR model.
- Fine-tuned using **LoRA** to reduce memory footprint.
### π Hardware & Compute Infrastructure
- **Training Hardware:** T4 (16GB) GPU
- **Training Time:** ~5 hours
- **Training Environment:** PyTorch + Transformers (Hugging Face)
---
## π How to Use
You can use this model for **speech-to-text transcription** with `pipeline`:
```python
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="fj11/ScreenTalk-xs",
device=0 # Run on GPU
)
result = pipe("path/to/audio.wav")
print(result["text"])
```
---
## π Citation
If you use this model, please cite:
```java
@misc{DataLabX2025ScreenTalkXS,
author = {DataLabX},
title = {ScreenTalk-xs: ASR Model Fine-Tuned on Movie & TV Audio},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/DataLabX/ScreenTalk-xs}
}
``` |