Shatha2030 commited on
Commit
a95c09d
·
verified ·
1 Parent(s): 1fbb35a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -0
README.md CHANGED
@@ -11,3 +11,145 @@ short_description: This project uses AI to transcribe and summarize media conte
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
+
15
+ # AI-Powered Audio & Video to Text Conversion
16
+
17
+ ## Project Description
18
+ This project enables users to convert audio and video content into text using AI. Additionally, it allows interaction with the extracted text through question-answering, summarization, and text-to-speech functionalities.
19
+
20
+ ## Introduction
21
+ With the increasing amount of digital content, users often face challenges in accessing information from audio and video files efficiently. This project aims to facilitate information retrieval by leveraging AI to transcribe content into text, summarize key points, and provide an interactive Q&A system. Users can also listen to the summarized text if preferred.
22
+
23
+ ## Project Objectives
24
+ 1. **Enhance the speed of information retrieval** by enabling quick searches within extracted text.
25
+ 2. **Provide an interactive experience** where users can ask questions about the content and receive instant answers.
26
+ 3. **Support both Arabic and English languages** to accommodate a wider user base.
27
+ 4. **Utilize AI to analyze and comprehend content**, allowing effective extraction and interaction with information.
28
+
29
+ ## Features
30
+ - **Speech-to-Text**: Extracts text from audio and video files.
31
+ - **Text Summarization**: Generates concise summaries of extracted text.
32
+ - **Question Answering**: Answers user queries based on extracted text.
33
+ - **Text-to-Speech**: Converts text into speech for an auditory experience.
34
+ - **Multilingual Support**: Works in both Arabic and English.
35
+
36
+ ## Technology Stack
37
+ ### 1. Programming Language
38
+ - **Python**: The core programming language used for this project due to its extensive support for AI and data processing libraries.
39
+
40
+ ### 2. Libraries & Frameworks
41
+ - **Gradio**: Used for creating an interactive web-based UI for users to upload files and interact with the extracted text.
42
+ - **Hugging Face Transformers**: Provides pre-trained models for automatic speech recognition, summarization, translation, and question-answering.
43
+ - **MoviePy**: Extracts audio from video files for further processing.
44
+ - **Librosa & SoundFile**: Handles audio processing, including loading, resampling, and segmenting audio clips.
45
+ - **gTTS (Google Text-to-Speech)**: Converts text into spoken words.
46
+ - **LangDetect**: Detects the language of the extracted text to provide appropriate processing.
47
+
48
+ ## Code Breakdown
49
+ ### 1. **Setting Up the Environment**
50
+ The script starts by checking if a GPU (CUDA) is available, which significantly speeds up AI model inference.
51
+ ```python
52
+ import torch
53
+ device = "cuda" if torch.cuda.is_available() else "cpu"
54
+ ```
55
+
56
+ ### 2. **Loading AI Models**
57
+ Several AI models are loaded to handle different functionalities:
58
+ - **Whisper (Speech Recognition)**: Converts audio into text.
59
+ - **BART (Summarization)**: Generates concise summaries.
60
+ - **Helsinki-NLP (Translation)**: Translates text between Arabic and English.
61
+ - **BERT (Question Answering)**: Finds answers from extracted text.
62
+ ```python
63
+ from transformers import pipeline
64
+ pipe = pipeline("automatic-speech-recognition", model="openai/whisper-medium", device=0 if device == "cuda" else -1)
65
+ bart_model = AutoModelForSeq2SeqLM.from_pretrained("ahmedabdo/arabic-summarizer-bart")
66
+ bart_tokenizer = AutoTokenizer.from_pretrained("ahmedabdo/arabic-summarizer-bart")
67
+ translate_ar_to_en = pipeline("translation", model="Helsinki-NLP/opus-mt-ar-en")
68
+ translate_en_to_ar = pipeline("translation", model="Helsinki-NLP/opus-mt-en-ar")
69
+ qa_pipeline = pipeline("question-answering", model="deepset/bert-base-cased-squad2", tokenizer="deepset/bert-base-cased-squad2")
70
+ ```
71
+
72
+ ### 3. **Audio Processing**
73
+ Handles both audio and video files. If a video is uploaded, it extracts the audio and converts it into a compatible format.
74
+ ```python
75
+ from moviepy.video.io.VideoFileClip import VideoFileClip
76
+ import librosa
77
+ import soundfile as sf
78
+
79
+ def convert_audio_to_text(uploaded_file):
80
+ if not uploaded_file:
81
+ return "⛔ Please upload a file first"
82
+
83
+ input_path = uploaded_file if isinstance(uploaded_file, str) else uploaded_file.name
84
+ output_path = "/tmp/processed.wav"
85
+
86
+ if input_path.split('.')[-1].lower() in ['mp4', 'avi', 'mov', 'mkv']:
87
+ VideoFileClip(input_path).audio.write_audiofile(output_path, codec='pcm_s16le')
88
+ else:
89
+ output_path = input_path
90
+
91
+ audio, rate = librosa.load(output_path, sr=16000)
92
+ return pipe(output_path)["text"]
93
+ ```
94
+
95
+ ### 4. **Summarization**
96
+ Summarizes extracted text using the pre-trained BART model.
97
+ ```python
98
+ def summarize_text(text):
99
+ inputs = bart_tokenizer(text, return_tensors="pt", max_length=1024, truncation=True).to(device)
100
+ summary_ids = bart_model.generate(inputs.input_ids, max_length=150, num_beams=4, early_stopping=True)
101
+ return bart_tokenizer.decode(summary_ids[0], skip_special_tokens=True)
102
+ ```
103
+
104
+ ### 5. **Question Answering**
105
+ Users can input a question based on the extracted text, and the system finds the most relevant answer.
106
+ ```python
107
+ def answer_question(text, question):
108
+ translated_context = translate_ar_to_en(text)[0]['translation_text']
109
+ translated_question = translate_ar_to_en(question)[0]['translation_text']
110
+ results = qa_pipeline({'question': translated_question, 'context': translated_context}, top_k=3)
111
+ best_result = max(results, key=lambda res: res['score'])
112
+ return translate_en_to_ar(best_result['answer'])[0]['translation_text']
113
+ ```
114
+
115
+ ### 6. **Text-to-Speech**
116
+ Converts text into an audio file using gTTS.
117
+ ```python
118
+ def text_to_speech(text):
119
+ tts = gTTS(text=text, lang='ar' if detect(text) == 'ar' else 'en', slow=False)
120
+ output = "/tmp/tts.wav"
121
+ tts.save(output)
122
+ return output
123
+ ```
124
+
125
+ ## Installation & Setup
126
+ ### Prerequisites
127
+ - Python 3.8+
128
+ - Pip package manager
129
+ - GPU (optional but recommended for better performance)
130
+
131
+ ### Installation Steps
132
+ 1. Clone the repository:
133
+ ```sh
134
+ git clone https://github.com/your-repo/ai-audio-video-text.git
135
+ cd ai-audio-video-text
136
+ ```
137
+ 2. Install dependencies:
138
+ ```sh
139
+ pip install -r requirements.txt
140
+ ```
141
+ 3. Run the application:
142
+ ```sh
143
+ python main.py
144
+ ```
145
+
146
+ ## Contributing
147
+ Contributions are welcome! Feel free to fork the repository and submit pull requests.
148
+
149
+ ## License
150
+ This project is licensed under the MIT License.
151
+
152
+ ## Contact
153
+ For inquiries, contact [ٍShathah2030@gmail.com]
154
+
155
+