File size: 4,879 Bytes
039b3d7
 
 
 
 
 
 
 
 
 
 
 
 
90f2b62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
title: Text Extraction Summarization
emoji: 📈
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.18.0
app_file: app.py
pinned: false
short_description: Text extraction and summarization
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Speech and Video Transcription & Summarization App

This project is a Python-based application that provides an interactive Gradio interface for extracting text from audio and video files using OpenAI's Whisper model and summarizing the extracted text using an Arabic BART summarization model.

Table of Contents

Introduction

Features

Requirements

Installation & Setup

Usage

Code Explanation

1. Basic Setup

2. Model Loading

3. Helper Functions

4. Example Audio Processing

5. User Interface

Notes

License

Introduction

This application provides an end-to-end solution for speech-to-text conversion from audio or video files, leveraging ASR (Automatic Speech Recognition) technology. The extracted text can then be summarized using a pre-trained BART summarization model for Arabic. The system supports both audio and video file processing, breaking down large files into smaller segments for efficient processing.

Features

Extract text from audio and video: Supports MP4, AVI, MOV, MKV formats for video, and WAV, MP3 for audio.

Handles large files: Automatically splits long audio files into 30-second segments for smoother processing.

Summarizes extracted text: Uses a fine-tuned BART model to generate concise summaries.

Interactive UI: Built with Gradio, providing a simple drag-and-drop interface.

Requirements

Python: 3.6 or later

Required Libraries:

gradio

torch

transformers

moviepy

librosa

soundfile

numpy

re

os (built-in)

GPU Support: If available, the system will use CUDA for faster processing.

Installation & Setup

Install Python: Ensure you have Python 3.6 or later installed.

Install required dependencies: Run:

pip install gradio torch transformers moviepy librosa soundfile numpy

Additional Requirements:

To process video files, install FFmpeg: FFmpeg Official Site.

Ensure an internet connection for downloading models on first run.

Usage

Run the application:

python filename.py

This will launch a Gradio interface with a local or public URL.

Using the UI:

Test Example: A sample audio file is provided; click the "Try Example ⚡" button to test it.

Upload File: Drag and drop an audio (WAV, MP3) or video (MP4, AVI, MOV) file.

Extract Text: Click "Extract Text" after uploading to convert speech to text.

Summarize Text: Once the text is extracted, click "Summarize" to generate a concise summary.

Code Explanation

1. Basic Setup

Detect GPU availability: Uses CUDA if available.

device = "cuda" if torch.cuda.is_available() else "cpu"

2. Model Loading

ASR Model: Uses Whisper-medium from OpenAI.

Summarization Model: Loads a fine-tuned BART model for Arabic.

pipe = pipeline("automatic-speech-recognition", model="openai/whisper-medium", device=0 if device=="cuda" else -1)
bart_model = AutoModelForSeq2SeqLM.from_pretrained("ahmedabdo/arabic-summarizer-bart")
bart_tokenizer = AutoTokenizer.from_pretrained("ahmedabdo/arabic-summarizer-bart")

3. Helper Functions

Text Cleaning: Removes extra spaces.

def clean_text(text):
    return re.sub(r'\s+', ' ', text).strip()

Audio/Video Processing:

Extracts audio from video files.

Splits long audio into 30-second segments.

Uses Whisper ASR to transcribe speech into text.

def convert_audio_to_text(uploaded_file):
    ...

Text Summarization: Uses BART to generate summaries.

def summarize_text(text):
    ...

4. Example Audio Processing

A sample MP3 file is provided for testing.

Function process_example_audio ensures the file exists and processes it.

EXAMPLE_AUDIO_PATH = "AUDIO-2025-02-24-22-10-37.mp3"

def process_example_audio():
    if not os.path.exists(EXAMPLE_AUDIO_PATH):
        return "⛔ Example file not found!"
    return convert_audio_to_text(EXAMPLE_AUDIO_PATH)

5. User Interface

Gradio UI Components:

Audio preview & Example button

File upload section

Buttons for text extraction and summarization

Textboxes to display results

Button Callbacks: Link UI buttons to processing functions.

with gr.Blocks() as demo:
    ...
    extract_btn.click(convert_audio_to_text, inputs=file_input, outputs=extracted_text)
    summarize_btn.click(summarize_text, inputs=extracted_text, outputs=summary_output)
    example_btn.click(process_example_audio, outputs=extracted_text)

Launch the App: Runs the Gradio interface.

if __name__ == "__main__":
    demo.launch()

Notes

Processing Speed: Large files take longer due to segmentation and ASR processing.

Video Files: Ensure FFmpeg is installed for proper audio extraction.

Resources: Large models like Whisper and BART require GPU acceleration for optimal performance.