|
|
--- |
|
|
library_name: transformers |
|
|
license: cc-by-4.0 |
|
|
datasets: |
|
|
- DataLabX/ScreenTalk-XS |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- wer |
|
|
base_model: |
|
|
- openai/whisper-small |
|
|
--- |
|
|
|
|
|
|
|
|
# π ScreenTalk-xs: Fine-Tuned Whisper Model for Movie & TV Audio |
|
|
|
|
|
## π Model Details |
|
|
- **Model Name**: ScreenTalk-xs |
|
|
- **Developed by**: DataLabX |
|
|
- **Finetuned from**: [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) |
|
|
- **Language(s)**: English |
|
|
- **License**: Apache-2.0 |
|
|
- **Repository**: [Hugging Face Model Hub](https://huggingface.co/fj11/ScreenTalk-xs) |
|
|
|
|
|
## π Model Description |
|
|
ScreenTalk-xs is a fine-tuned version of OpenAI's `whisper-small` model, optimized for **speech-to-text transcription** on **movies & TV show audio**. This model is specifically trained to **improve ASR (Automatic Speech Recognition) performance** in dialogue-heavy scenarios. |
|
|
|
|
|
### πΉ Key Features |
|
|
- πΊ **Optimized for movie & TV dialogues** |
|
|
- π€ **Robust to noisy environments** |
|
|
- π **Improved handling of long-form speech** |
|
|
- π **Efficient inference with LoRA fine-tuning** |
|
|
|
|
|
--- |
|
|
|
|
|
## π Uses |
|
|
### β
Direct Use |
|
|
- **Speech-to-text transcription** for movies, TV shows, and general spoken audio. |
|
|
- **Automatic subtitling & captioning** for multimedia content. |
|
|
- **Voice-enabled applications** such as AI assistants & transcription services. |
|
|
|
|
|
### πΉ Downstream Use |
|
|
- Can be used for **improving ASR models** in entertainment, media, and accessibility applications. |
|
|
|
|
|
### β Out-of-Scope Use |
|
|
- Not optimized for **real-time streaming ASR**. |
|
|
- May not generalize well to **heavily accented speech** outside its training dataset. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Training Details |
|
|
### π Training Data |
|
|
The model was fine-tuned using the **ScreenTalk-XS dataset**, a collection of transcribed movie & TV audio. |
|
|
|
|
|
### π Training Hyperparameters |
|
|
| **Hyperparameter** | **Value** | |
|
|
|-------------------|---------| |
|
|
| Learning Rate | `5e-5` | |
|
|
| Batch Size | `6` | |
|
|
| Gradient Accumulation | `4` | |
|
|
| Epochs | `5` | |
|
|
| LoRA Rank (`r`) | `4` | |
|
|
| Optimizer | AdamW | |
|
|
|
|
|
### π Training Procedure |
|
|
- **Fine-tuned with LoRA** to reduce memory consumption while maintaining efficiency. |
|
|
- **Evaluation on a held-out test set** to monitor WER (Word Error Rate). |
|
|
|
|
|
--- |
|
|
|
|
|
## π Evaluation |
|
|
|
|
|
### π Training Results |
|
|
| **Epoch** | **Training Loss** | **Validation Loss** | **WER (%)** | |
|
|
|-----------|-----------------|-----------------|-------------| |
|
|
| **1** | 0.502400 | 0.333292 | 20.870653 | |
|
|
| **2** | 0.244200 | 0.327987 | 20.580875 | |
|
|
| **3** | 0.523600 | 0.325907 | 21.924394 | |
|
|
| **4** | 0.445500 | 0.326386 | 20.508430 | |
|
|
| **5** | 0.285700 | 0.327116 | 20.752107 | |
|
|
|
|
|
- **Best Model:** `Epoch 4`, achieving **WER = 20.50%** |
|
|
- **Model performance degrades after epoch 4**, suggesting overfitting. |
|
|
|
|
|
### π Test Results |
|
|
| **Model** | **WER (%)** | |
|
|
|-----------|-------------| |
|
|
| **Whisper-small (baseline)** | **30.00%** | |
|
|
| **ScreenTalk-xs (fine-tuned)** | **27.00%** β
| |
|
|
|
|
|
### **π Key Observations** |
|
|
- **Fine-tuning reduced WER from 30.00% β 27.00%** π― |
|
|
- **Achieved a 10% relative improvement in ASR accuracy.** |
|
|
- **Tested on the ScreenTalk-XS dataset.** |
|
|
|
|
|
--- |
|
|
|
|
|
## π₯οΈ Technical Specifications |
|
|
### π Model Architecture |
|
|
- Based on **Whisper-small**, a transformer-based sequence-to-sequence ASR model. |
|
|
- Fine-tuned using **LoRA** to reduce memory footprint. |
|
|
|
|
|
### π Hardware & Compute Infrastructure |
|
|
- **Training Hardware:** T4 (16GB) GPU |
|
|
- **Training Time:** ~5 hours |
|
|
- **Training Environment:** PyTorch + Transformers (Hugging Face) |
|
|
|
|
|
--- |
|
|
|
|
|
## π How to Use |
|
|
You can use this model for **speech-to-text transcription** with `pipeline`: |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline( |
|
|
"automatic-speech-recognition", |
|
|
model="fj11/ScreenTalk-xs", |
|
|
device=0 # Run on GPU |
|
|
) |
|
|
|
|
|
result = pipe("path/to/audio.wav") |
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
If you use this model, please cite: |
|
|
```java |
|
|
@misc{DataLabX2025ScreenTalkXS, |
|
|
author = {DataLabX}, |
|
|
title = {ScreenTalk-xs: ASR Model Fine-Tuned on Movie & TV Audio}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/DataLabX/ScreenTalk-xs} |
|
|
} |
|
|
``` |