ScreenTalk-xs / README.md

Update README.md

9395558 verified 10 months ago

4.14 kB

	---
	library_name: transformers
	license: cc-by-4.0
	datasets:
	- DataLabX/ScreenTalk-XS
	language:
	- en
	metrics:
	- wer
	base_model:
	- openai/whisper-small
	---


	# 📌 ScreenTalk-xs: Fine-Tuned Whisper Model for Movie & TV Audio

	## 📜 Model Details
	- Model Name: ScreenTalk-xs
	- Developed by: DataLabX
	- Finetuned from: [`openai/whisper-small`](https://huggingface.co/openai/whisper-small)
	- Language(s): English
	- License: Apache-2.0
	- Repository: [Hugging Face Model Hub](https://huggingface.co/fj11/ScreenTalk-xs)

	## 📌 Model Description
	ScreenTalk-xs is a fine-tuned version of OpenAI's `whisper-small` model, optimized for speech-to-text transcription on movies & TV show audio. This model is specifically trained to improve ASR (Automatic Speech Recognition) performance in dialogue-heavy scenarios.

	### 🔹 Key Features
	- 📺 Optimized for movie & TV dialogues
	- 🎤 Robust to noisy environments
	- 🔍 Improved handling of long-form speech
	- 🚀 Efficient inference with LoRA fine-tuning

	---

	## 🚀 Uses
	### ✅ Direct Use
	- Speech-to-text transcription for movies, TV shows, and general spoken audio.
	- Automatic subtitling & captioning for multimedia content.
	- Voice-enabled applications such as AI assistants & transcription services.

	### 🔹 Downstream Use
	- Can be used for improving ASR models in entertainment, media, and accessibility applications.

	### ❌ Out-of-Scope Use
	- Not optimized for real-time streaming ASR.
	- May not generalize well to heavily accented speech outside its training dataset.

	---

	## 🛠 Training Details
	### 📌 Training Data
	The model was fine-tuned using the ScreenTalk-XS dataset, a collection of transcribed movie & TV audio.

	### 📌 Training Hyperparameters
	\| Hyperparameter \| Value \|
	\|-------------------\|---------\|
	\| Learning Rate \| `5e-5` \|
	\| Batch Size \| `6` \|
	\| Gradient Accumulation \| `4` \|
	\| Epochs \| `5` \|
	\| LoRA Rank (`r`) \| `4` \|
	\| Optimizer \| AdamW \|

	### 📌 Training Procedure
	- Fine-tuned with LoRA to reduce memory consumption while maintaining efficiency.
	- Evaluation on a held-out test set to monitor WER (Word Error Rate).

	---

	## 📊 Evaluation

	### 📌 Training Results
	\| Epoch \| Training Loss \| Validation Loss \| WER (%) \|
	\|-----------\|-----------------\|-----------------\|-------------\|
	\| 1 \| 0.502400 \| 0.333292 \| 20.870653 \|
	\| 2 \| 0.244200 \| 0.327987 \| 20.580875 \|
	\| 3 \| 0.523600 \| 0.325907 \| 21.924394 \|
	\| 4 \| 0.445500 \| 0.326386 \| 20.508430 \|
	\| 5 \| 0.285700 \| 0.327116 \| 20.752107 \|

	- Best Model: `Epoch 4`, achieving WER = 20.50%
	- Model performance degrades after epoch 4, suggesting overfitting.

	### 📌 Test Results
	\| Model \| WER (%) \|
	\|-----------\|-------------\|
	\| Whisper-small (baseline) \| 30.00% \|
	\| ScreenTalk-xs (fine-tuned) \| 27.00% ✅ \|

	### 🔍 Key Observations
	- Fine-tuning reduced WER from 30.00% → 27.00% 🎯
	- Achieved a 10% relative improvement in ASR accuracy.
	- Tested on the ScreenTalk-XS dataset.

	---

	## 🖥️ Technical Specifications
	### 📌 Model Architecture
	- Based on Whisper-small, a transformer-based sequence-to-sequence ASR model.
	- Fine-tuned using LoRA to reduce memory footprint.

	### 📌 Hardware & Compute Infrastructure
	- Training Hardware: T4 (16GB) GPU
	- Training Time: ~5 hours
	- Training Environment: PyTorch + Transformers (Hugging Face)

	---

	## 📖 How to Use
	You can use this model for speech-to-text transcription with `pipeline`:
	```python
	from transformers import pipeline

	pipe = pipeline(
	"automatic-speech-recognition",
	model="fj11/ScreenTalk-xs",
	device=0 # Run on GPU
	)

	result = pipe("path/to/audio.wav")
	print(result["text"])
	```

	---

	## 📜 Citation
	If you use this model, please cite:
	```java
	@misc{DataLabX2025ScreenTalkXS,
	author = {DataLabX},
	title = {ScreenTalk-xs: ASR Model Fine-Tuned on Movie & TV Audio},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/DataLabX/ScreenTalk-xs}
	}
	```