Codex commited on
Commit
6a9bc08
·
1 Parent(s): 78bc895

Deploy text summarization app

Browse files
Files changed (3) hide show
  1. README.md +16 -10
  2. requirements.txt +15 -34
  3. src/streamlit_app.py +692 -38
README.md CHANGED
@@ -1,19 +1,25 @@
1
  ---
2
  title: Text Summarization
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: red
6
  sdk: docker
7
  app_port: 8501
8
- tags:
9
- - streamlit
10
  pinned: false
11
- short_description: Summarize Text From PDF, YouTube, Website
 
12
  ---
13
 
14
- # Welcome to Streamlit!
15
 
16
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
17
 
18
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
19
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
1
  ---
2
  title: Text Summarization
3
+ emoji: 📝
4
+ colorFrom: blue
5
+ colorTo: indigo
6
  sdk: docker
7
  app_port: 8501
 
 
8
  pinned: false
9
+ license: mit
10
+ short_description: Summarize YouTube videos, webpages, and uploaded documents with LangChain and Groq.
11
  ---
12
 
13
+ # Text Summarization
14
 
15
+ This Space runs a Streamlit app for summarizing:
16
 
17
+ - YouTube videos
18
+ - website URLs
19
+ - uploaded PDF, TXT, MD, CSV, and DOCX files
20
+
21
+ ## Required Secret
22
+
23
+ Add this secret in the Space settings:
24
+
25
+ - `GROQ_API_KEY`
requirements.txt CHANGED
@@ -1,35 +1,16 @@
1
- altair
2
- pandas
3
- streamlit
4
- langchain
5
- python-dotenv
6
- ipykernel
7
- langchain-community
8
- pypdf
9
- bs4
10
- arxiv
11
- pymupdf
12
- wikipedia
13
- langchain-text-splitters
14
- langchain-openai
15
- chromadb
16
- sentence_transformers
17
- langchain_huggingface
18
- faiss-cpu
19
- langchain_chroma
20
- duckdb
21
- pandas
22
- openai
23
- langchain-groq
24
- duckduckgo_search==5.3.1b1
25
- pymupdf
26
- arxiv
27
- wikipedia
28
- mysql-connector-python
29
- SQLAlchemy
30
  validators==0.28.1
31
- youtube_transcript_api
32
- unstructured
33
- pytube
34
- numexpr
35
- huggingface_hub
 
 
 
 
 
 
 
 
 
1
+ streamlit>=1.44.0
2
+ python-dotenv>=1.0.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  validators==0.28.1
4
+ requests>=2.32.0
5
+ bs4>=0.0.2
6
+ pypdf>=6.0.0
7
+
8
+ langchain>=1.2.15
9
+ langchain-community>=0.4.1
10
+ langchain-classic>=1.0.4
11
+ langchain-groq>=1.1.2
12
+ langchain-text-splitters>=1.1.2
13
+
14
+ youtube-transcript-api>=1.2.4
15
+ unstructured>=0.22.22
16
+ pytube>=15.0.0
src/streamlit_app.py CHANGED
@@ -1,40 +1,694 @@
1
- import altair as alt
2
- import numpy as np
3
- import pandas as pd
 
 
 
 
4
  import streamlit as st
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- """
7
- # Welcome to Streamlit!
8
-
9
- Edit `/streamlit_app.py` to customize this app to your heart's desire :heart:.
10
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
11
- forums](https://discuss.streamlit.io).
12
-
13
- In the meantime, below is an example of what you can do with just a few lines of code:
14
- """
15
-
16
- num_points = st.slider("Number of points in spiral", 1, 10000, 1100)
17
- num_turns = st.slider("Number of turns in spiral", 1, 300, 31)
18
-
19
- indices = np.linspace(0, 1, num_points)
20
- theta = 2 * np.pi * num_turns * indices
21
- radius = indices
22
-
23
- x = radius * np.cos(theta)
24
- y = radius * np.sin(theta)
25
-
26
- df = pd.DataFrame({
27
- "x": x,
28
- "y": y,
29
- "idx": indices,
30
- "rand": np.random.randn(num_points),
31
- })
32
-
33
- st.altair_chart(alt.Chart(df, height=700, width=700)
34
- .mark_point(filled=True)
35
- .encode(
36
- x=alt.X("x", axis=None),
37
- y=alt.Y("y", axis=None),
38
- color=alt.Color("idx", legend=None, scale=alt.Scale()),
39
- size=alt.Size("rand", legend=None, scale=alt.Scale(range=[1, 150])),
40
- ))
 
1
+ import os
2
+ from io import BytesIO
3
+ from urllib.parse import urlparse
4
+ from xml.etree import ElementTree as ET
5
+ from zipfile import ZipFile
6
+
7
+ import requests
8
  import streamlit as st
9
+ import validators
10
+ from bs4 import BeautifulSoup
11
+ from dotenv import load_dotenv
12
+ from langchain_classic.chains.summarize import load_summarize_chain
13
+ from langchain_community.document_loaders import UnstructuredURLLoader, YoutubeLoader
14
+ from langchain_core.documents import Document
15
+ from langchain_core.prompts import PromptTemplate
16
+ from langchain_groq import ChatGroq
17
+ from langchain_text_splitters import RecursiveCharacterTextSplitter
18
+ from pypdf import PdfReader
19
+ from requests import RequestException
20
+ from youtube_transcript_api import YouTubeTranscriptApi
21
+
22
+
23
+ load_dotenv()
24
+
25
+ SAMPLE_YOUTUBE_URL = "https://youtu.be/ocBh08fjIfU"
26
+ LANGUAGE_OPTIONS = ["Original", "English", "Arabic", "French", "Bahasa Malay"]
27
+ LANGUAGE_CODE_MAP = {
28
+ "English": "en",
29
+ "Arabic": "ar",
30
+ "French": "fr",
31
+ "Bahasa Malay": "ms",
32
+ }
33
+ LANGUAGE_LABEL_MAP = {
34
+ "English": "English",
35
+ "Arabic": "Arabic",
36
+ "French": "French",
37
+ "Bahasa Malay": "Bahasa Melayu",
38
+ }
39
+
40
+ st.set_page_config(page_title="Summarize Text From PDF, YouTube, Website", page_icon="📝")
41
+ st.title("📝 Summarize Text From PDF, YouTube, Website")
42
+ st.subheader("Summarize URL")
43
+
44
+ st.markdown(
45
+ """
46
+ <style>
47
+ .source-section-label {
48
+ font-size: 1rem;
49
+ font-weight: 600;
50
+ margin-top: 0.35rem;
51
+ margin-bottom: 0.3rem;
52
+ }
53
+ </style>
54
+ """,
55
+ unsafe_allow_html=True,
56
+ )
57
+
58
+ groq_api_key = os.getenv("GROQ_API_KEY", "")
59
+
60
+ if "url_input" not in st.session_state:
61
+ st.session_state.url_input = ""
62
+ if "summary_word_limit" not in st.session_state:
63
+ st.session_state.summary_word_limit = 400
64
+ if "youtube_transcript_text" not in st.session_state:
65
+ st.session_state.youtube_transcript_text = ""
66
+ if "youtube_transcript_name" not in st.session_state:
67
+ st.session_state.youtube_transcript_name = "youtube_transcript.txt"
68
+ if "youtube_transcript_source_url" not in st.session_state:
69
+ st.session_state.youtube_transcript_source_url = ""
70
+ if "youtube_transcript_language_label" not in st.session_state:
71
+ st.session_state.youtube_transcript_language_label = "Original"
72
+
73
+ summary_language = "Original"
74
+ transcript_language = "Original"
75
+
76
+ with st.sidebar:
77
+ st.header("Options")
78
+ input_source_mode = st.radio(
79
+ "Content source",
80
+ options=["URL", "Upload documents", "Both"],
81
+ index=0,
82
+ help="Choose which source the app should use for summarization.",
83
+ )
84
+ summary_word_limit = st.slider(
85
+ "Summary word limit",
86
+ min_value=100,
87
+ max_value=1500,
88
+ step=50,
89
+ key="summary_word_limit",
90
+ help="Increase or decrease the target length of the summary.",
91
+ )
92
+ # summary_language = st.selectbox(
93
+ # "Summary language",
94
+ # options=LANGUAGE_OPTIONS,
95
+ # index=0,
96
+ # help="Choose the language for the generated summary. `Original` keeps the source language when possible.",
97
+ # )
98
+ # transcript_language = st.selectbox(
99
+ # "Transcript language",
100
+ # options=LANGUAGE_OPTIONS,
101
+ # index=0,
102
+ # help="Choose the language used for YouTube transcript fetching/export. `Original` keeps the available source transcript language.",
103
+ # )
104
+ selected_chain_type = st.radio(
105
+ "Summarization method",
106
+ options=["auto", "stuff", "map_reduce", "refine"],
107
+ index=0,
108
+ help="`auto` picks the best method based on content size and will upgrade if a simpler method is not a good fit.",
109
+ )
110
+ st.caption(
111
+ "`stuff` is fastest for short content, `map_reduce` is safer for long content, "
112
+ "and `refine` is useful when building a summary progressively across chunks."
113
+ )
114
+ st.caption(f"Sample YouTube URL: `{SAMPLE_YOUTUBE_URL}`")
115
+ if st.button("Use sample YouTube URL"):
116
+ st.session_state.url_input = SAMPLE_YOUTUBE_URL
117
+
118
+ generic_url = ""
119
+ uploaded_files = []
120
+
121
+ if input_source_mode in {"URL", "Both"}:
122
+ st.markdown('<div class="source-section-label">Summarize URL</div>', unsafe_allow_html=True)
123
+ generic_url = st.text_input(
124
+ "URL",
125
+ key="url_input",
126
+ label_visibility="collapsed",
127
+ placeholder=f"Paste a YouTube or website URL, or try {SAMPLE_YOUTUBE_URL}",
128
+ help="Enter the full YouTube or website URL you want to summarize.",
129
+ )
130
+
131
+ if input_source_mode in {"Upload documents", "Both"}:
132
+ st.markdown('<div class="source-section-label">Upload documents</div>', unsafe_allow_html=True)
133
+ uploaded_files = st.file_uploader(
134
+ "Upload documents",
135
+ type=["pdf", "txt", "md", "csv", "docx"],
136
+ accept_multiple_files=True,
137
+ label_visibility="collapsed",
138
+ help="Upload one or more documents. Supported formats: PDF, TXT, MD, CSV, DOCX.",
139
+ )
140
+ if uploaded_files:
141
+ st.caption(
142
+ "Uploaded files: " + ", ".join(uploaded_file.name for uploaded_file in uploaded_files)
143
+ )
144
+
145
+ llm = ChatGroq(model="llama-3.1-8b-instant", groq_api_key=groq_api_key)
146
+
147
+ REQUEST_HEADERS = {
148
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
149
+ "Accept-Language": "en-US,en;q=0.9",
150
+ "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
151
+ "Referer": "https://www.google.com/",
152
+ }
153
+
154
+
155
+ def _is_youtube_url(url: str) -> bool:
156
+ host = urlparse(url).netloc.lower()
157
+ return "youtube.com" in host or "youtu.be" in host
158
+
159
+
160
+ def _summary_language_instruction(selected_language: str) -> str:
161
+ if selected_language == "Original":
162
+ return "Write the summary in the original language of the source content. If the source is mixed-language, use the dominant language."
163
+ return f"Write the summary in {LANGUAGE_LABEL_MAP[selected_language]}."
164
+
165
+
166
+ def _translation_language_instruction(selected_language: str) -> str:
167
+ if selected_language == "Original":
168
+ return "Keep the text in its original language."
169
+ return f"Translate the text into {LANGUAGE_LABEL_MAP[selected_language]}."
170
+
171
+
172
+ def _get_summary_prompts(word_limit: int, selected_language: str) -> dict[str, PromptTemplate]:
173
+ language_instruction = _summary_language_instruction(selected_language)
174
+ stuff_prompt = PromptTemplate(
175
+ template=(
176
+ f"Provide a clear summary of the following content in about {word_limit} words.\n"
177
+ "Focus on the main ideas, important details, and conclusions.\n"
178
+ f"{language_instruction}\n"
179
+ "Content:\n{text}"
180
+ ),
181
+ input_variables=["text"],
182
+ )
183
+ map_prompt = PromptTemplate(
184
+ template=(
185
+ "Write a concise summary of the following section.\n"
186
+ f"{language_instruction}\n"
187
+ "Content:\n{text}"
188
+ ),
189
+ input_variables=["text"],
190
+ )
191
+ combine_prompt = PromptTemplate(
192
+ template=(
193
+ f"Combine the following partial summaries into a final summary in about {word_limit} words.\n"
194
+ "Keep the result coherent, non-repetitive, and focused on the most important points.\n"
195
+ f"{language_instruction}\n"
196
+ "Partial summaries:\n{text}"
197
+ ),
198
+ input_variables=["text"],
199
+ )
200
+ refine_question_prompt = PromptTemplate(
201
+ template=(
202
+ f"Provide an initial summary of the following content in about {word_limit} words.\n"
203
+ f"{language_instruction}\n"
204
+ "Content:\n{text}"
205
+ ),
206
+ input_variables=["text"],
207
+ )
208
+ refine_prompt = PromptTemplate(
209
+ template=(
210
+ f"We already have an existing summary:\n{{existing_answer}}\n\n"
211
+ "Refine it using the additional content below.\n"
212
+ f"Keep the final summary close to {word_limit} words, avoid repetition, and preserve the most important details.\n"
213
+ f"{language_instruction}\n"
214
+ "Additional content:\n{text}"
215
+ ),
216
+ input_variables=["existing_answer", "text"],
217
+ )
218
+ return {
219
+ "stuff": stuff_prompt,
220
+ "map": map_prompt,
221
+ "combine": combine_prompt,
222
+ "refine_question": refine_question_prompt,
223
+ "refine": refine_prompt,
224
+ }
225
+
226
+
227
+ def _extract_summary_text(result) -> str:
228
+ if isinstance(result, dict):
229
+ return result.get("output_text") or result.get("text") or str(result)
230
+ return str(result)
231
+
232
+
233
+ def _translate_documents_with_llm(docs: list[Document], target_language: str) -> list[Document]:
234
+ if target_language == "Original":
235
+ return docs
236
+
237
+ translation_prompt = PromptTemplate(
238
+ template=(
239
+ f"{_translation_language_instruction(target_language)}\n"
240
+ "Preserve the meaning faithfully. Do not summarize. Return only the translated text.\n"
241
+ "Text:\n{text}"
242
+ ),
243
+ input_variables=["text"],
244
+ )
245
+ translation_chain = load_summarize_chain(
246
+ llm,
247
+ chain_type="stuff",
248
+ prompt=translation_prompt,
249
+ )
250
+ splitter = RecursiveCharacterTextSplitter(chunk_size=2500, chunk_overlap=200)
251
+ translated_docs: list[Document] = []
252
+
253
+ for doc in docs:
254
+ chunks = splitter.split_documents([doc])
255
+ translated_chunks = []
256
+ for chunk in chunks:
257
+ translated_text = _extract_summary_text(
258
+ translation_chain.invoke({"input_documents": [chunk]})
259
+ )
260
+ translated_chunks.append(translated_text.strip())
261
+
262
+ translated_docs.append(
263
+ Document(
264
+ page_content="\n\n".join(part for part in translated_chunks if part),
265
+ metadata={
266
+ **doc.metadata,
267
+ "translated_to": target_language,
268
+ },
269
+ )
270
+ )
271
+
272
+ return translated_docs
273
+
274
+
275
+ def _resolve_transcript(video_id: str, selected_language: str):
276
+ api = YouTubeTranscriptApi()
277
+ transcript_list = api.list(video_id)
278
+ available_transcripts = list(transcript_list)
279
+
280
+ if selected_language == "Original":
281
+ if not available_transcripts:
282
+ raise ValueError("No transcript is available for this video.")
283
+ return available_transcripts[0], "Original"
284
+
285
+ if not available_transcripts:
286
+ raise ValueError("No transcript is available for this video.")
287
+
288
+ target_language_code = LANGUAGE_CODE_MAP[selected_language]
289
+ try:
290
+ return transcript_list.find_transcript([target_language_code]), selected_language
291
+ except Exception:
292
+ for base_transcript in available_transcripts:
293
+ if not base_transcript.is_translatable:
294
+ continue
295
+ try:
296
+ return base_transcript.translate(target_language_code), selected_language
297
+ except Exception:
298
+ continue
299
+
300
+ available_languages = ", ".join(
301
+ sorted(
302
+ {
303
+ f"{transcript.language} ({transcript.language_code})"
304
+ for transcript in available_transcripts
305
+ }
306
+ )
307
+ )
308
+ raise ValueError(
309
+ f"Could not provide transcript in {selected_language}. "
310
+ f"Available transcript languages: {available_languages}"
311
+ )
312
+
313
+
314
+ def _load_youtube_documents(url: str, selected_language: str) -> list[Document]:
315
+ video_id = YoutubeLoader.extract_video_id(url)
316
+ should_translate_with_llm = False
317
+ try:
318
+ transcript, transcript_language_label = _resolve_transcript(video_id, selected_language)
319
+ except ValueError:
320
+ if selected_language == "Original":
321
+ raise
322
+ transcript, transcript_language_label = _resolve_transcript(video_id, "Original")
323
+ should_translate_with_llm = True
324
+
325
+ fetched_transcript = transcript.fetch()
326
+ transcript_text = " ".join(snippet.text.strip() for snippet in fetched_transcript if snippet.text.strip())
327
+ if not transcript_text:
328
+ raise ValueError("No transcript text could be extracted from this video.")
329
+
330
+ docs = [
331
+ Document(
332
+ page_content=transcript_text,
333
+ metadata={
334
+ "source": url,
335
+ "video_id": video_id,
336
+ "language": fetched_transcript.language,
337
+ "language_code": fetched_transcript.language_code,
338
+ "is_generated": fetched_transcript.is_generated,
339
+ "transcript_language_label": transcript_language_label,
340
+ },
341
+ )
342
+ ]
343
+
344
+ if should_translate_with_llm:
345
+ docs = _translate_documents_with_llm(docs, selected_language)
346
+ for doc in docs:
347
+ doc.metadata["transcript_language_label"] = f"{selected_language} (LLM translated)"
348
+
349
+ return docs
350
+
351
+
352
+ def _make_transcript_filename(url: str) -> str:
353
+ video_id = YoutubeLoader.extract_video_id(url)
354
+ return f"youtube_transcript_{video_id}.txt"
355
+
356
+
357
+ def _store_youtube_transcript(url: str, docs: list[Document]) -> None:
358
+ st.session_state.youtube_transcript_text = "\n\n".join(
359
+ doc.page_content for doc in docs if doc.page_content.strip()
360
+ )
361
+ st.session_state.youtube_transcript_name = _make_transcript_filename(url)
362
+ st.session_state.youtube_transcript_source_url = url
363
+ st.session_state.youtube_transcript_language_label = docs[0].metadata.get(
364
+ "transcript_language_label",
365
+ docs[0].metadata.get("language", "Original"),
366
+ )
367
+
368
+
369
+ def _has_meaningful_content(docs: list[Document], min_chars: int = 300) -> bool:
370
+ combined_text = " ".join(doc.page_content.strip() for doc in docs if doc.page_content.strip())
371
+ return len(combined_text) >= min_chars
372
+
373
+
374
+ def _extract_text_from_html(html: str) -> str:
375
+ soup = BeautifulSoup(html, "html.parser")
376
+
377
+ for tag in soup(["script", "style", "noscript", "svg"]):
378
+ tag.decompose()
379
+
380
+ meta_description = ""
381
+ meta_tag = soup.find("meta", attrs={"name": "description"})
382
+ if meta_tag and meta_tag.get("content"):
383
+ meta_description = meta_tag["content"].strip()
384
+
385
+ main_candidates = soup.select("main, article, [role='main'], .content, .article-body")
386
+ text_parts = []
387
+
388
+ for candidate in main_candidates:
389
+ candidate_text = " ".join(candidate.stripped_strings)
390
+ if len(candidate_text) > 200:
391
+ text_parts.append(candidate_text)
392
+
393
+ if not text_parts:
394
+ body_text = " ".join(soup.stripped_strings)
395
+ if body_text:
396
+ text_parts.append(body_text)
397
+
398
+ if meta_description:
399
+ text_parts.insert(0, meta_description)
400
+
401
+ return "\n\n".join(dict.fromkeys(part for part in text_parts if part))
402
+
403
+
404
+ def _load_web_documents(url: str) -> list[Document]:
405
+ try:
406
+ loader = UnstructuredURLLoader(
407
+ urls=[url],
408
+ ssl_verify=False,
409
+ headers=REQUEST_HEADERS,
410
+ )
411
+ docs = loader.load()
412
+ if _has_meaningful_content(docs):
413
+ return docs
414
+ except Exception as loader_error:
415
+ last_error = loader_error
416
+ else:
417
+ last_error = ValueError("Primary URL loader returned too little readable content.")
418
+
419
+ session = requests.Session()
420
+
421
+ for candidate_url in [url, url.rstrip("/")]:
422
+ if not candidate_url:
423
+ continue
424
+
425
+ try:
426
+ response = session.get(
427
+ candidate_url,
428
+ headers=REQUEST_HEADERS,
429
+ timeout=20,
430
+ verify=False,
431
+ allow_redirects=True,
432
+ )
433
+ response.encoding = response.encoding or response.apparent_encoding or "utf-8"
434
+
435
+ if not response.text.strip():
436
+ continue
437
+
438
+ text = _extract_text_from_html(response.text)
439
+ if not text or len(text) < 300:
440
+ continue
441
+
442
+ soup = BeautifulSoup(response.text, "html.parser")
443
+ title = soup.title.string.strip() if soup.title and soup.title.string else candidate_url
444
+ st.info("Primary URL loader failed or returned too little content. Used HTML fallback extraction instead.")
445
+ return [
446
+ Document(
447
+ page_content=text,
448
+ metadata={
449
+ "source": candidate_url,
450
+ "title": title,
451
+ "http_status": response.status_code,
452
+ },
453
+ )
454
+ ]
455
+ except RequestException as request_error:
456
+ last_error = request_error
457
+
458
+ raise ValueError(
459
+ f"Could not load readable text from the URL. Last loader error: {last_error}"
460
+ )
461
+
462
+
463
+ def _load_uploaded_documents(files) -> list[Document]:
464
+ docs: list[Document] = []
465
+
466
+ for uploaded_file in files:
467
+ file_name = uploaded_file.name
468
+ extension = os.path.splitext(file_name)[1].lower()
469
+ file_bytes = uploaded_file.getvalue()
470
+
471
+ if extension == ".pdf":
472
+ reader = PdfReader(BytesIO(file_bytes))
473
+ pages = []
474
+ for page_number, page in enumerate(reader.pages, start=1):
475
+ page_text = (page.extract_text() or "").strip()
476
+ if page_text:
477
+ pages.append(
478
+ Document(
479
+ page_content=page_text,
480
+ metadata={
481
+ "source": file_name,
482
+ "page": page_number,
483
+ "type": "uploaded_file",
484
+ },
485
+ )
486
+ )
487
+ docs.extend(pages)
488
+ continue
489
+
490
+ if extension in {".txt", ".md", ".csv"}:
491
+ text = file_bytes.decode("utf-8", errors="ignore").strip()
492
+ if text:
493
+ docs.append(
494
+ Document(
495
+ page_content=text,
496
+ metadata={"source": file_name, "type": "uploaded_file"},
497
+ )
498
+ )
499
+ continue
500
+
501
+ if extension == ".docx":
502
+ with ZipFile(BytesIO(file_bytes)) as docx_zip:
503
+ document_xml = docx_zip.read("word/document.xml")
504
+ root = ET.fromstring(document_xml)
505
+ namespace = {"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}
506
+ paragraphs = []
507
+ for paragraph in root.findall(".//w:p", namespace):
508
+ texts = [
509
+ node.text
510
+ for node in paragraph.findall(".//w:t", namespace)
511
+ if node.text
512
+ ]
513
+ paragraph_text = "".join(texts).strip()
514
+ if paragraph_text:
515
+ paragraphs.append(paragraph_text)
516
+
517
+ text = "\n\n".join(paragraphs).strip()
518
+ if text:
519
+ docs.append(
520
+ Document(
521
+ page_content=text,
522
+ metadata={"source": file_name, "type": "uploaded_file"},
523
+ )
524
+ )
525
+ continue
526
+
527
+ raise ValueError(f"Unsupported file type: {file_name}")
528
+
529
+ return docs
530
+
531
+
532
+ def _build_chain(selected_chain_type: str):
533
+ prompts = _get_summary_prompts(summary_word_limit, summary_language)
534
+ if selected_chain_type == "stuff":
535
+ return load_summarize_chain(llm, chain_type="stuff", prompt=prompts["stuff"])
536
+ if selected_chain_type == "map_reduce":
537
+ return load_summarize_chain(
538
+ llm,
539
+ chain_type="map_reduce",
540
+ map_prompt=prompts["map"],
541
+ combine_prompt=prompts["combine"],
542
+ )
543
+ return load_summarize_chain(
544
+ llm,
545
+ chain_type="refine",
546
+ question_prompt=prompts["refine_question"],
547
+ refine_prompt=prompts["refine"],
548
+ )
549
+
550
+
551
+ def _prepare_summary_documents(docs: list[Document], selected_chain_type: str) -> list[Document]:
552
+ splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
553
+ split_docs = splitter.split_documents(docs)
554
+
555
+ if selected_chain_type == "stuff":
556
+ return split_docs[:3]
557
+ if selected_chain_type == "refine":
558
+ return split_docs[:10]
559
+ return split_docs[:8]
560
+
561
+
562
+ def _choose_effective_chain_type(requested_chain_type: str, docs: list[Document]) -> tuple[str, str | None]:
563
+ splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
564
+ split_docs = splitter.split_documents(docs)
565
+ chunk_count = len(split_docs)
566
+ total_chars = sum(len(doc.page_content) for doc in split_docs)
567
+
568
+ if chunk_count <= 3 and total_chars <= 6000:
569
+ recommended = "stuff"
570
+ elif chunk_count <= 10:
571
+ recommended = "refine"
572
+ else:
573
+ recommended = "map_reduce"
574
+
575
+ if requested_chain_type == "auto":
576
+ return recommended, f"Auto-selected `{recommended}` based on content size."
577
+
578
+ if requested_chain_type == "stuff" and recommended != "stuff":
579
+ return recommended, f"Switched from `stuff` to `{recommended}` because the content is too large for a reliable single-pass summary."
580
+
581
+ if requested_chain_type == "refine" and chunk_count > 12:
582
+ return "map_reduce", "Switched from `refine` to `map_reduce` because the content is large enough that map-reduce is more reliable."
583
+
584
+ return requested_chain_type, None
585
+
586
+
587
+ if input_source_mode in {"URL", "Both"} and _is_youtube_url(generic_url):
588
+ st.video(generic_url)
589
+
590
+ transcript_col, export_col = st.columns(2)
591
+ with transcript_col:
592
+ if st.button("Fetch transcript"):
593
+ if not generic_url.strip():
594
+ st.error("Please enter a YouTube URL.")
595
+ elif not validators.url(generic_url):
596
+ st.error("Please enter a valid YouTube URL.")
597
+ else:
598
+ try:
599
+ with st.spinner("Loading transcript..."):
600
+ docs = _load_youtube_documents(generic_url, transcript_language)
601
+ if not docs:
602
+ st.error("No transcript could be extracted from the provided YouTube video.")
603
+ else:
604
+ _store_youtube_transcript(generic_url, docs)
605
+ st.success(
606
+ f"Transcript ready for export in {st.session_state.youtube_transcript_language_label}."
607
+ )
608
+ except Exception as transcript_err:
609
+ st.error(f"Failed to load YouTube transcript: {transcript_err}")
610
+ with export_col:
611
+ if (
612
+ st.session_state.youtube_transcript_text
613
+ and st.session_state.youtube_transcript_source_url == generic_url
614
+ ):
615
+ st.caption(f"Prepared transcript: `{st.session_state.youtube_transcript_language_label}`")
616
+ st.download_button(
617
+ "Export transcript",
618
+ data=st.session_state.youtube_transcript_text,
619
+ file_name=st.session_state.youtube_transcript_name,
620
+ mime="text/plain",
621
+ )
622
+
623
+
624
+ if st.button("Summarize content"):
625
+ if not groq_api_key.strip():
626
+ st.error("Please provide the information to get started")
627
+ elif input_source_mode == "URL" and not generic_url.strip():
628
+ st.error("Content source is `URL`, so please provide a URL.")
629
+ elif input_source_mode == "Upload documents" and not uploaded_files:
630
+ st.error("Content source is `Upload documents`, so please upload at least one file.")
631
+ elif input_source_mode == "Both" and (not generic_url.strip() or not uploaded_files):
632
+ st.error("Content source is `Both`, so please provide a URL and upload at least one file.")
633
+ elif generic_url.strip() and not validators.url(generic_url):
634
+ st.error("Please enter a valid URL when using the URL field.")
635
+ else:
636
+ try:
637
+ with st.spinner("waiting ...."):
638
+ docs: list[Document] = []
639
+
640
+ if input_source_mode in {"URL", "Both"} and generic_url.strip():
641
+ if _is_youtube_url(generic_url):
642
+ try:
643
+ url_docs = _load_youtube_documents(generic_url, transcript_language)
644
+ _store_youtube_transcript(generic_url, url_docs)
645
+ except Exception as load_err:
646
+ st.error(f"Failed to load YouTube transcript: {load_err}")
647
+ st.stop()
648
+ else:
649
+ st.session_state.youtube_transcript_text = ""
650
+ st.session_state.youtube_transcript_name = "youtube_transcript.txt"
651
+ st.session_state.youtube_transcript_source_url = ""
652
+ try:
653
+ url_docs = _load_web_documents(generic_url)
654
+ except Exception as load_err:
655
+ st.error(f"Failed to fetch URL content: {load_err}")
656
+ st.stop()
657
+
658
+ docs.extend(url_docs)
659
+ else:
660
+ st.session_state.youtube_transcript_text = ""
661
+ st.session_state.youtube_transcript_name = "youtube_transcript.txt"
662
+ st.session_state.youtube_transcript_source_url = ""
663
+
664
+ if input_source_mode in {"Upload documents", "Both"} and uploaded_files:
665
+ try:
666
+ uploaded_docs = _load_uploaded_documents(uploaded_files)
667
+ except Exception as load_err:
668
+ st.error(f"Failed to read uploaded document(s): {load_err}")
669
+ st.stop()
670
+ docs.extend(uploaded_docs)
671
+
672
+ if input_source_mode == "Both" and generic_url.strip() and uploaded_files:
673
+ st.info("Summarizing combined content from the URL and uploaded documents.")
674
+
675
+ if not docs:
676
+ st.error("No content could be extracted from the selected source.")
677
+ st.stop()
678
+
679
+ effective_chain_type, chain_message = _choose_effective_chain_type(
680
+ selected_chain_type,
681
+ docs,
682
+ )
683
+ if chain_message:
684
+ st.info(chain_message)
685
+
686
+ docs_for_summary = _prepare_summary_documents(docs, effective_chain_type)
687
+ chain = _build_chain(effective_chain_type)
688
+ output_summary = _extract_summary_text(
689
+ chain.invoke({"input_documents": docs_for_summary})
690
+ )
691
 
692
+ st.success(output_summary)
693
+ except Exception as e:
694
+ st.error(f"Summarization failed: {e}")