---
library_name: transformers
license: cc-by-4.0
datasets:
- DataLabX/ScreenTalk-XS
language:
- en
metrics:
- wer
base_model:
- openai/whisper-small
---


# 📌 ScreenTalk-xs: Fine-Tuned Whisper Model for Movie & TV Audio

## 📜 Model Details
- **Model Name**: ScreenTalk-xs
- **Developed by**: DataLabX
- **Finetuned from**: [`openai/whisper-small`](https://huggingface.co/openai/whisper-small)
- **Language(s)**: English
- **License**: Apache-2.0
- **Repository**: [Hugging Face Model Hub](https://huggingface.co/fj11/ScreenTalk-xs)

## 📌 Model Description
ScreenTalk-xs is a fine-tuned version of OpenAI's `whisper-small` model, optimized for **speech-to-text transcription** on **movies & TV show audio**. This model is specifically trained to **improve ASR (Automatic Speech Recognition) performance** in dialogue-heavy scenarios.

### 🔹 Key Features
- 📺 **Optimized for movie & TV dialogues**
- 🎤 **Robust to noisy environments**
- 🔍 **Improved handling of long-form speech**
- 🚀 **Efficient inference with LoRA fine-tuning**

---

## 🚀 Uses
### ✅ Direct Use
- **Speech-to-text transcription** for movies, TV shows, and general spoken audio.
- **Automatic subtitling & captioning** for multimedia content.
- **Voice-enabled applications** such as AI assistants & transcription services.

### 🔹 Downstream Use
- Can be used for **improving ASR models** in entertainment, media, and accessibility applications.

### ❌ Out-of-Scope Use
- Not optimized for **real-time streaming ASR**.
- May not generalize well to **heavily accented speech** outside its training dataset.

---

## 🛠 Training Details
### 📌 Training Data
The model was fine-tuned using the **ScreenTalk-XS dataset**, a collection of transcribed movie & TV audio.

### 📌 Training Hyperparameters
| **Hyperparameter** | **Value** |
|-------------------|---------|
| Learning Rate | `5e-5` |
| Batch Size | `6` |
| Gradient Accumulation | `4` |
| Epochs | `5` |
| LoRA Rank (`r`) | `4` |
| Optimizer | AdamW |

### 📌 Training Procedure
- **Fine-tuned with LoRA** to reduce memory consumption while maintaining efficiency.
- **Evaluation on a held-out test set** to monitor WER (Word Error Rate).

---

## 📊 Evaluation

### 📌 Training Results
| **Epoch** | **Training Loss** | **Validation Loss** | **WER (%)** |
|-----------|-----------------|-----------------|-------------|
| **1**  | 0.502400	| 0.333292   | 20.870653 |
| **2**  | 0.244200 | 0.327987	 | 20.580875 |
| **3**  | 0.523600 | 0.325907	 | 21.924394 |
| **4**  | 0.445500 | 0.326386	 | 20.508430 |
| **5**  | 0.285700	| 0.327116   | 20.752107 |

- **Best Model:** `Epoch 4`, achieving **WER = 20.50%**  
- **Model performance degrades after epoch 4**, suggesting overfitting.

### 📌 Test Results
| **Model** | **WER (%)** |
|-----------|-------------|
| **Whisper-small (baseline)** | **30.00%** |
| **ScreenTalk-xs (fine-tuned)** | **27.00%** ✅ |

### **🔍 Key Observations**
- **Fine-tuning reduced WER from 30.00% → 27.00%** 🎯
- **Achieved a 10% relative improvement in ASR accuracy.**
- **Tested on the ScreenTalk-XS dataset.**

---

## 🖥️ Technical Specifications
### 📌 Model Architecture
- Based on **Whisper-small**, a transformer-based sequence-to-sequence ASR model.
- Fine-tuned using **LoRA** to reduce memory footprint.

### 📌 Hardware & Compute Infrastructure
- **Training Hardware:** T4 (16GB) GPU  
- **Training Time:** ~5 hours  
- **Training Environment:** PyTorch + Transformers (Hugging Face)  

---

## 📖 How to Use
You can use this model for **speech-to-text transcription** with `pipeline`:
```python
from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="fj11/ScreenTalk-xs",
    device=0  # Run on GPU
)

result = pipe("path/to/audio.wav")
print(result["text"])
```

---

## 📜 Citation
If you use this model, please cite:
```java
@misc{DataLabX2025ScreenTalkXS,
  author = {DataLabX},
  title = {ScreenTalk-xs: ASR Model Fine-Tuned on Movie & TV Audio},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DataLabX/ScreenTalk-xs}
}
```