ScreenTalk-xs / README.md
fj11's picture
Update README.md
9395558 verified
---
library_name: transformers
license: cc-by-4.0
datasets:
- DataLabX/ScreenTalk-XS
language:
- en
metrics:
- wer
base_model:
- openai/whisper-small
---
# πŸ“Œ ScreenTalk-xs: Fine-Tuned Whisper Model for Movie & TV Audio
## πŸ“œ Model Details
- **Model Name**: ScreenTalk-xs
- **Developed by**: DataLabX
- **Finetuned from**: [`openai/whisper-small`](https://huggingface.co/openai/whisper-small)
- **Language(s)**: English
- **License**: Apache-2.0
- **Repository**: [Hugging Face Model Hub](https://huggingface.co/fj11/ScreenTalk-xs)
## πŸ“Œ Model Description
ScreenTalk-xs is a fine-tuned version of OpenAI's `whisper-small` model, optimized for **speech-to-text transcription** on **movies & TV show audio**. This model is specifically trained to **improve ASR (Automatic Speech Recognition) performance** in dialogue-heavy scenarios.
### πŸ”Ή Key Features
- πŸ“Ί **Optimized for movie & TV dialogues**
- 🎀 **Robust to noisy environments**
- πŸ” **Improved handling of long-form speech**
- πŸš€ **Efficient inference with LoRA fine-tuning**
---
## πŸš€ Uses
### βœ… Direct Use
- **Speech-to-text transcription** for movies, TV shows, and general spoken audio.
- **Automatic subtitling & captioning** for multimedia content.
- **Voice-enabled applications** such as AI assistants & transcription services.
### πŸ”Ή Downstream Use
- Can be used for **improving ASR models** in entertainment, media, and accessibility applications.
### ❌ Out-of-Scope Use
- Not optimized for **real-time streaming ASR**.
- May not generalize well to **heavily accented speech** outside its training dataset.
---
## πŸ›  Training Details
### πŸ“Œ Training Data
The model was fine-tuned using the **ScreenTalk-XS dataset**, a collection of transcribed movie & TV audio.
### πŸ“Œ Training Hyperparameters
| **Hyperparameter** | **Value** |
|-------------------|---------|
| Learning Rate | `5e-5` |
| Batch Size | `6` |
| Gradient Accumulation | `4` |
| Epochs | `5` |
| LoRA Rank (`r`) | `4` |
| Optimizer | AdamW |
### πŸ“Œ Training Procedure
- **Fine-tuned with LoRA** to reduce memory consumption while maintaining efficiency.
- **Evaluation on a held-out test set** to monitor WER (Word Error Rate).
---
## πŸ“Š Evaluation
### πŸ“Œ Training Results
| **Epoch** | **Training Loss** | **Validation Loss** | **WER (%)** |
|-----------|-----------------|-----------------|-------------|
| **1** | 0.502400 | 0.333292 | 20.870653 |
| **2** | 0.244200 | 0.327987 | 20.580875 |
| **3** | 0.523600 | 0.325907 | 21.924394 |
| **4** | 0.445500 | 0.326386 | 20.508430 |
| **5** | 0.285700 | 0.327116 | 20.752107 |
- **Best Model:** `Epoch 4`, achieving **WER = 20.50%**
- **Model performance degrades after epoch 4**, suggesting overfitting.
### πŸ“Œ Test Results
| **Model** | **WER (%)** |
|-----------|-------------|
| **Whisper-small (baseline)** | **30.00%** |
| **ScreenTalk-xs (fine-tuned)** | **27.00%** βœ… |
### **πŸ” Key Observations**
- **Fine-tuning reduced WER from 30.00% β†’ 27.00%** 🎯
- **Achieved a 10% relative improvement in ASR accuracy.**
- **Tested on the ScreenTalk-XS dataset.**
---
## πŸ–₯️ Technical Specifications
### πŸ“Œ Model Architecture
- Based on **Whisper-small**, a transformer-based sequence-to-sequence ASR model.
- Fine-tuned using **LoRA** to reduce memory footprint.
### πŸ“Œ Hardware & Compute Infrastructure
- **Training Hardware:** T4 (16GB) GPU
- **Training Time:** ~5 hours
- **Training Environment:** PyTorch + Transformers (Hugging Face)
---
## πŸ“– How to Use
You can use this model for **speech-to-text transcription** with `pipeline`:
```python
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="fj11/ScreenTalk-xs",
device=0 # Run on GPU
)
result = pipe("path/to/audio.wav")
print(result["text"])
```
---
## πŸ“œ Citation
If you use this model, please cite:
```java
@misc{DataLabX2025ScreenTalkXS,
author = {DataLabX},
title = {ScreenTalk-xs: ASR Model Fine-Tuned on Movie & TV Audio},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/DataLabX/ScreenTalk-xs}
}
```