--- library_name: transformers license: cc-by-4.0 datasets: - DataLabX/ScreenTalk-XS language: - en metrics: - wer base_model: - openai/whisper-small --- # 📌 ScreenTalk-xs: Fine-Tuned Whisper Model for Movie & TV Audio ## 📜 Model Details - **Model Name**: ScreenTalk-xs - **Developed by**: DataLabX - **Finetuned from**: [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) - **Language(s)**: English - **License**: Apache-2.0 - **Repository**: [Hugging Face Model Hub](https://huggingface.co/fj11/ScreenTalk-xs) ## 📌 Model Description ScreenTalk-xs is a fine-tuned version of OpenAI's `whisper-small` model, optimized for **speech-to-text transcription** on **movies & TV show audio**. This model is specifically trained to **improve ASR (Automatic Speech Recognition) performance** in dialogue-heavy scenarios. ### 🔹 Key Features - 📺 **Optimized for movie & TV dialogues** - 🎤 **Robust to noisy environments** - 🔍 **Improved handling of long-form speech** - 🚀 **Efficient inference with LoRA fine-tuning** --- ## 🚀 Uses ### ✅ Direct Use - **Speech-to-text transcription** for movies, TV shows, and general spoken audio. - **Automatic subtitling & captioning** for multimedia content. - **Voice-enabled applications** such as AI assistants & transcription services. ### 🔹 Downstream Use - Can be used for **improving ASR models** in entertainment, media, and accessibility applications. ### ❌ Out-of-Scope Use - Not optimized for **real-time streaming ASR**. - May not generalize well to **heavily accented speech** outside its training dataset. --- ## 🛠 Training Details ### 📌 Training Data The model was fine-tuned using the **ScreenTalk-XS dataset**, a collection of transcribed movie & TV audio. ### 📌 Training Hyperparameters | **Hyperparameter** | **Value** | |-------------------|---------| | Learning Rate | `5e-5` | | Batch Size | `6` | | Gradient Accumulation | `4` | | Epochs | `5` | | LoRA Rank (`r`) | `4` | | Optimizer | AdamW | ### 📌 Training Procedure - **Fine-tuned with LoRA** to reduce memory consumption while maintaining efficiency. - **Evaluation on a held-out test set** to monitor WER (Word Error Rate). --- ## 📊 Evaluation ### 📌 Training Results | **Epoch** | **Training Loss** | **Validation Loss** | **WER (%)** | |-----------|-----------------|-----------------|-------------| | **1** | 0.502400 | 0.333292 | 20.870653 | | **2** | 0.244200 | 0.327987 | 20.580875 | | **3** | 0.523600 | 0.325907 | 21.924394 | | **4** | 0.445500 | 0.326386 | 20.508430 | | **5** | 0.285700 | 0.327116 | 20.752107 | - **Best Model:** `Epoch 4`, achieving **WER = 20.50%** - **Model performance degrades after epoch 4**, suggesting overfitting. ### 📌 Test Results | **Model** | **WER (%)** | |-----------|-------------| | **Whisper-small (baseline)** | **30.00%** | | **ScreenTalk-xs (fine-tuned)** | **27.00%** ✅ | ### **🔍 Key Observations** - **Fine-tuning reduced WER from 30.00% → 27.00%** 🎯 - **Achieved a 10% relative improvement in ASR accuracy.** - **Tested on the ScreenTalk-XS dataset.** --- ## 🖥️ Technical Specifications ### 📌 Model Architecture - Based on **Whisper-small**, a transformer-based sequence-to-sequence ASR model. - Fine-tuned using **LoRA** to reduce memory footprint. ### 📌 Hardware & Compute Infrastructure - **Training Hardware:** T4 (16GB) GPU - **Training Time:** ~5 hours - **Training Environment:** PyTorch + Transformers (Hugging Face) --- ## 📖 How to Use You can use this model for **speech-to-text transcription** with `pipeline`: ```python from transformers import pipeline pipe = pipeline( "automatic-speech-recognition", model="fj11/ScreenTalk-xs", device=0 # Run on GPU ) result = pipe("path/to/audio.wav") print(result["text"]) ``` --- ## 📜 Citation If you use this model, please cite: ```java @misc{DataLabX2025ScreenTalkXS, author = {DataLabX}, title = {ScreenTalk-xs: ASR Model Fine-Tuned on Movie & TV Audio}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/DataLabX/ScreenTalk-xs} } ```