{ "cells": [ { "cell_type": "markdown", "id": "1d259f9c", "metadata": { "papermill": { "duration": 0.007523, "end_time": "2025-04-21T04:41:26.513351", "exception": false, "start_time": "2025-04-21T04:41:26.505828", "status": "completed" }, "tags": [] }, "source": [ "# AI-Powered News Digest Dynamo 📰✨\n", "## Tame the News Flood with AI!\n", "## [Hugging Space Demo of AI-Powered News Digest Dynamo 🔗🤗](https://huggingface.co/spaces/Kakaarot/AI-Powered-News-Digest-Dynamo)" ] }, { "cell_type": "markdown", "id": "67d8177b", "metadata": { "papermill": { "duration": 0.005606, "end_time": "2025-04-21T04:41:26.525238", "exception": false, "start_time": "2025-04-21T04:41:26.519632", "status": "completed" }, "tags": [] }, "source": [ "## Motivation\n", "In an era of information overload, staying updated with relevant news is challenging. This AI-Powered News Summarization System fetches, analyzes, and summarizes recent news articles on user-specified topics, delivering concise, structured insights. By leveraging advanced Gen AI, it empowers users to stay informed efficiently, addressing the real-world problem of navigating vast news streams." ] }, { "cell_type": "markdown", "id": "34304120", "metadata": { "papermill": { "duration": 0.005746, "end_time": "2025-04-21T04:41:26.538408", "exception": false, "start_time": "2025-04-21T04:41:26.532662", "status": "completed" }, "tags": [] }, "source": [ "### What it is?\n", "\n", "My capstone project is an AI-powered News Summarization System that fetches and analyzes recent news articles on user-specified topics, then generates concise, structured summaries. This project addresses the problem of information overload by using advanced AI techniques to extract and present key information from multiple news sources in a unified, digestible format.\n", "\n", "### Problem Statement :\n", "\n", "This project addresses the problem of information overload by using advanced AI techniques to extract and present key information from multiple news sources in a unified, digestible format.\n", "\n", "### System Workflow\n", "The system:\n", "- Fetches relevant news articles via NewsAPI based on user topics.\n", "- Processes and chunks articles for embedding.\n", "- Creates vector representations using SentenceTransformer.\n", "- Builds a searchable FAISS index for semantic retrieval.\n", "- Retrieves the most relevant information for a query.\n", "- Generates a structured JSON summary with key points and sources.\n", "\n", "\n", "\n", "### GenAI Capabilities Demonstrated\n", "In my Kaggle notebook, I demonstrated the following GenAI capabilities:\n", "\n", "1. **Retrieval Augmented Generation (RAG)** - The system implements a complete RAG pipeline that first retrieves relevant information from the news articles before generating summaries. This ensures the summaries are based on actual, recent news content rather than the model's internal knowledge.\n", "\n", "2. **Document Understanding** - The system analyzes multiple news articles, extracts their key information, and synthesizes a coherent summary that captures the most important points across sources. The code demonstrates sophisticated document comprehension capabilities through the Gemini model.\n", "\n", "3. **Structured Output/JSON mode** - I implemented controlled generation by using Gemini's response_mime_type=\"application/json\" parameter and a defined JSON schema to enforce a consistent output structure. The system produces well-formatted JSON with predefined fields for topic, summary points, and sources.\n", "\n", "4. **Embeddings** - The project uses SentenceTransformer to create dense vector embeddings of text chunks, enabling semantic understanding and comparison of content.\n", "\n", "5. **Vector Search/Vector Store** - I implemented a vector database using FAISS to index and efficiently search through text embeddings, allowing the system to find the most semantically relevant information based on user queries." ] }, { "cell_type": "code", "execution_count": 1, "id": "4ec4f7e4", "metadata": { "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", "execution": { "iopub.execute_input": "2025-04-21T04:41:26.551425Z", "iopub.status.busy": "2025-04-21T04:41:26.551117Z", "iopub.status.idle": "2025-04-21T04:41:28.723662Z", "shell.execute_reply": "2025-04-21T04:41:28.722602Z" }, "papermill": { "duration": 2.181015, "end_time": "2025-04-21T04:41:28.725424", "exception": false, "start_time": "2025-04-21T04:41:26.544409", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# This Python 3 environment comes with many helpful analytics libraries installed\n", "# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n", "# For example, here's several helpful packages to load\n", "\n", "import numpy as np # linear algebra\n", "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", "\n", "# Input data files are available in the read-only \"../input/\" directory\n", "# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n", "\n", "import os\n", "for dirname, _, filenames in os.walk('/kaggle/input'):\n", " for filename in filenames:\n", " print(os.path.join(dirname, filename))\n", " \n", "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"\" # Disable GPU to avoid CUDA conflicts\n", "\n", "# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n", "# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session" ] }, { "cell_type": "markdown", "id": "7862395d", "metadata": { "papermill": { "duration": 0.00562, "end_time": "2025-04-21T04:41:28.737079", "exception": false, "start_time": "2025-04-21T04:41:28.731459", "status": "completed" }, "tags": [] }, "source": [ "This notebook demonstrates a comprehensive news summarization system using Retrieval Augmented Generation (RAG), Document Understanding, and Structured Output capabilities. The system fetches recent news articles based on user topics, processes them using embeddings and vector search, and generates concise, structured summaries." ] }, { "cell_type": "markdown", "id": "d9a651c0", "metadata": { "papermill": { "duration": 0.005517, "end_time": "2025-04-21T04:41:28.748451", "exception": false, "start_time": "2025-04-21T04:41:28.742934", "status": "completed" }, "tags": [] }, "source": [ "## 1. Setup and Installation" ] }, { "cell_type": "markdown", "id": "e2481cd3", "metadata": { "papermill": { "duration": 0.005354, "end_time": "2025-04-21T04:41:28.759588", "exception": false, "start_time": "2025-04-21T04:41:28.754234", "status": "completed" }, "tags": [] }, "source": [ "Installs libraries:\n", "- `newsapi-python`: Fetches news articles.\n", "- `sentence-transformers`: Generates text embeddings.\n", "- `faiss-cpu`: Enables vector search.\n", "- `google-generativeai`: Powers Gemini for summarization." ] }, { "cell_type": "code", "execution_count": 2, "id": "7e39a0dc", "metadata": { "execution": { "iopub.execute_input": "2025-04-21T04:41:28.772261Z", "iopub.status.busy": "2025-04-21T04:41:28.771814Z", "iopub.status.idle": "2025-04-21T04:43:15.856098Z", "shell.execute_reply": "2025-04-21T04:43:15.854685Z" }, "papermill": { "duration": 107.093257, "end_time": "2025-04-21T04:43:15.858526", "exception": false, "start_time": "2025-04-21T04:41:28.765269", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m30.7/30.7 MB\u001b[0m \u001b[31m48.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m363.4/363.4 MB\u001b[0m \u001b[31m4.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m664.8/664.8 MB\u001b[0m \u001b[31m1.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m211.5/211.5 MB\u001b[0m \u001b[31m2.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.3/56.3 MB\u001b[0m \u001b[31m28.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m127.9/127.9 MB\u001b[0m \u001b[31m11.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.5/207.5 MB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.1/21.1 MB\u001b[0m \u001b[31m31.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\r\n", "pylibcugraph-cu12 24.12.0 requires pylibraft-cu12==24.12.*, but you have pylibraft-cu12 25.2.0 which is incompatible.\r\n", "pylibcugraph-cu12 24.12.0 requires rmm-cu12==24.12.*, but you have rmm-cu12 25.2.0 which is incompatible.\u001b[0m\u001b[31m\r\n", "\u001b[0m" ] } ], "source": [ "!pip install newsapi-python sentence-transformers faiss-cpu google-generativeai --quiet\n" ] }, { "cell_type": "markdown", "id": "a197c5b6", "metadata": { "papermill": { "duration": 0.034548, "end_time": "2025-04-21T04:43:15.929006", "exception": false, "start_time": "2025-04-21T04:43:15.894458", "status": "completed" }, "tags": [] }, "source": [ "let's import the required libraries and configure our API keys:" ] }, { "cell_type": "code", "execution_count": 3, "id": "80c7f50d", "metadata": { "execution": { "iopub.execute_input": "2025-04-21T04:43:16.108612Z", "iopub.status.busy": "2025-04-21T04:43:16.108198Z", "iopub.status.idle": "2025-04-21T04:43:56.712868Z", "shell.execute_reply": "2025-04-21T04:43:56.712107Z" }, "papermill": { "duration": 40.750363, "end_time": "2025-04-21T04:43:56.714492", "exception": false, "start_time": "2025-04-21T04:43:15.964129", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-04-21 04:43:39.120223: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", "E0000 00:00:1745210619.398532 13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", "E0000 00:00:1745210619.474587 13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n" ] } ], "source": [ "import os\n", "import json\n", "import faiss\n", "import numpy as np\n", "import google.generativeai as genai\n", "from newsapi import NewsApiClient\n", "from sentence_transformers import SentenceTransformer\n", "from typing import List, Dict, Any, Optional\n", "\n", "# For Kaggle environment, use Kaggle secrets\n", "from kaggle_secrets import UserSecretsClient\n", "user_secrets = UserSecretsClient()\n", "secret_value_0 = user_secrets.get_secret(\"GOOGLE_API_KEY\")\n", "secret_value_1 = user_secrets.get_secret(\"NEWS_API_KEY\")\n", "\n", "GOOGLE_API_KEY = secret_value_0\n", "NEWS_API_KEY = secret_value_1 \n", "\n", "# Validate API keys\n", "if not NEWS_API_KEY:\n", " raise ValueError(\"Please set the NEWS_API_KEY in Kaggle secrets.\")\n", "if not GOOGLE_API_KEY:\n", " raise ValueError(\"Please set the GEMINI_API_KEY in Kaggle secrets.\")\n", "\n", "# Configure Google Generative AI\n", "genai.configure(api_key=GOOGLE_API_KEY)\n", "\n", "# Constants\n", "EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2' # Lightweight, efficient embedding model\n", "LLM_MODEL_NAME = 'gemini-1.5-flash' # Efficient Gemini model\n", "MAX_ARTICLES_TO_FETCH = 10 # Max news articles to retrieve initially\n", "MAX_ARTICLES_TO_PROCESS = 5 # Max articles to use for context\n", "CHUNK_SIZE = 500 # Characters per text chunk\n", "TOP_K_CHUNKS = 3 # Number of relevant chunks to retrieve\n" ] }, { "cell_type": "markdown", "id": "fb50d536", "metadata": { "papermill": { "duration": 0.033945, "end_time": "2025-04-21T04:43:56.783313", "exception": false, "start_time": "2025-04-21T04:43:56.749368", "status": "completed" }, "tags": [] }, "source": [ "## 2. News Retrieval Function\n", "\n", "**The first component of our system fetches news articles related to a user's topic of interest:**" ] }, { "cell_type": "code", "execution_count": 4, "id": "cf873c1e", "metadata": { "execution": { "iopub.execute_input": "2025-04-21T04:43:56.856380Z", "iopub.status.busy": "2025-04-21T04:43:56.855098Z", "iopub.status.idle": "2025-04-21T04:43:56.863197Z", "shell.execute_reply": "2025-04-21T04:43:56.862110Z" }, "papermill": { "duration": 0.046252, "end_time": "2025-04-21T04:43:56.864900", "exception": false, "start_time": "2025-04-21T04:43:56.818648", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "def fetch_news(topic: str) -> List[Dict[str, Any]]:\n", " \"\"\"Fetches recent news articles for a given topic using NewsAPI.\"\"\"\n", " print(f\"Fetching news for topic: {topic}...\")\n", " try:\n", " newsapi = NewsApiClient(api_key=NEWS_API_KEY)\n", " # Fetch articles related to the topic\n", " top_headlines = newsapi.get_everything(\n", " q=topic,\n", " language='en',\n", " sort_by='relevancy', # Can also use 'publishedAt' for recency\n", " page_size=MAX_ARTICLES_TO_FETCH\n", " )\n", "\n", " articles = top_headlines.get('articles', [])\n", " # Filter out articles with no content\n", " valid_articles = [\n", " {\n", " \"title\": article.get(\"title\"),\n", " \"content\": article.get(\"content\") or article.get(\"description\", \"\"), \n", " \"url\": article.get(\"url\")\n", " }\n", " for article in articles if article.get(\"content\") or article.get(\"description\")\n", " ]\n", " print(f\"Fetched {len(valid_articles)} valid articles.\")\n", " # Limit the number of articles to process further\n", " return valid_articles[:MAX_ARTICLES_TO_PROCESS]\n", " except Exception as e:\n", " print(f\"Error fetching news: {e}\")\n", " return []\n" ] }, { "cell_type": "markdown", "id": "d51ce4d8", "metadata": { "papermill": { "duration": 0.034714, "end_time": "2025-04-21T04:43:56.934158", "exception": false, "start_time": "2025-04-21T04:43:56.899444", "status": "completed" }, "tags": [] }, "source": [ "## 3. Text Processing Functions" ] }, { "cell_type": "markdown", "id": "983cc16d", "metadata": { "papermill": { "duration": 0.033888, "end_time": "2025-04-21T04:43:57.002418", "exception": false, "start_time": "2025-04-21T04:43:56.968530", "status": "completed" }, "tags": [] }, "source": [ "Next, we need functions to process the text for our RAG system:" ] }, { "cell_type": "code", "execution_count": 5, "id": "434a6447", "metadata": { "execution": { "iopub.execute_input": "2025-04-21T04:43:57.072198Z", "iopub.status.busy": "2025-04-21T04:43:57.071846Z", "iopub.status.idle": "2025-04-21T04:43:57.081202Z", "shell.execute_reply": "2025-04-21T04:43:57.080172Z" }, "papermill": { "duration": 0.046113, "end_time": "2025-04-21T04:43:57.082691", "exception": false, "start_time": "2025-04-21T04:43:57.036578", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "def chunk_text(text: str, size: int) -> List[str]:\n", " \"\"\"Splits text into chunks of approximate size, trying to preserve sentence boundaries.\"\"\"\n", " chunks = []\n", " start = 0\n", " while start < len(text):\n", " end = start + size\n", " # Try to end chunk at a sentence boundary if possible\n", " pos = text.rfind('.', start, end + 50) # Look a bit beyond chunk size\n", " if pos != -1 and pos > start + size // 2: # Ensure boundary is reasonably close\n", " end = pos + 1\n", " chunks.append(text[start:end].strip())\n", " start = end\n", " return [chunk for chunk in chunks if chunk] # Remove empty chunks\n", "\n", "def build_vector_store(articles: List[Dict[str, Any]], model: SentenceTransformer):\n", " \"\"\"Creates embeddings and builds an in-memory FAISS index.\"\"\"\n", " print(\"Building vector store...\")\n", " all_chunks = []\n", " metadata = [] # Store corresponding article info for each chunk\n", "\n", " for i, article in enumerate(articles):\n", " if article['content']:\n", " chunks = chunk_text(article['content'], CHUNK_SIZE)\n", " for chunk in chunks:\n", " all_chunks.append(chunk)\n", " metadata.append({\"article_index\": i, \"url\": article['url'], \"title\": article['title']})\n", "\n", " if not all_chunks:\n", " print(\"No text content found to build vector store.\")\n", " return None, [], []\n", "\n", " print(f\"Generated {len(all_chunks)} chunks from {len(articles)} articles.\")\n", " print(\"Creating embeddings...\")\n", " embeddings = model.encode(all_chunks, show_progress_bar=True)\n", "\n", " dimension = embeddings.shape[1]\n", " index = faiss.IndexFlatL2(dimension) # Using L2 distance for similarity\n", " index.add(np.array(embeddings).astype('float32')) # Add embeddings to FAISS index\n", "\n", " print(\"Vector store built successfully.\")\n", " return index, all_chunks, metadata\n" ] }, { "cell_type": "markdown", "id": "cb2cf577", "metadata": { "papermill": { "duration": 0.034127, "end_time": "2025-04-21T04:43:57.151356", "exception": false, "start_time": "2025-04-21T04:43:57.117229", "status": "completed" }, "tags": [] }, "source": [ "## 4. Retrieval Function (RAG)" ] }, { "cell_type": "markdown", "id": "29b7ce50", "metadata": { "papermill": { "duration": 0.033972, "end_time": "2025-04-21T04:43:57.219874", "exception": false, "start_time": "2025-04-21T04:43:57.185902", "status": "completed" }, "tags": [] }, "source": [ "This function performs the retrieval part of RAG, finding the most relevant chunks for a query:" ] }, { "cell_type": "code", "execution_count": 6, "id": "c3abde3d", "metadata": { "execution": { "iopub.execute_input": "2025-04-21T04:43:57.290168Z", "iopub.status.busy": "2025-04-21T04:43:57.289484Z", "iopub.status.idle": "2025-04-21T04:43:57.296824Z", "shell.execute_reply": "2025-04-21T04:43:57.296020Z" }, "papermill": { "duration": 0.044209, "end_time": "2025-04-21T04:43:57.298290", "exception": false, "start_time": "2025-04-21T04:43:57.254081", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "def retrieve_context(query: str, index: faiss.Index, chunks: List[str], metadata: List[Dict], model: SentenceTransformer, top_k: int) -> str:\n", " \"\"\"Retrieves the most relevant text chunks based on the query.\"\"\"\n", " if index is None or index.ntotal == 0:\n", " return \"No relevant context found (vector store is empty).\"\n", "\n", " print(f\"Retrieving top {top_k} relevant chunks for query: '{query}'...\")\n", " query_embedding = model.encode([query], show_progress_bar=False)\n", " query_embedding_np = np.array(query_embedding).astype('float32')\n", "\n", " # Search the index\n", " distances, indices = index.search(query_embedding_np, top_k)\n", "\n", " # Gather the relevant chunks and source info\n", " context_parts = []\n", " seen_urls = set()\n", " for i, idx in enumerate(indices[0]):\n", " if idx >= 0 and idx < len(chunks): # Check for valid index\n", " chunk_text = chunks[idx]\n", " meta = metadata[idx]\n", " source_info = f\"(Source: {meta['url']})\"\n", " if meta['url'] not in seen_urls:\n", " # Add title for first mention of a source URL\n", " context_parts.append(f\"From '{meta['title']}':\\n{chunk_text}\\n{source_info}\")\n", " seen_urls.add(meta['url'])\n", " else:\n", " # Avoid repeating title for subsequent chunks from same source\n", " context_parts.append(f\"{chunk_text}\\n{source_info}\")\n", "\n", " if not context_parts:\n", " return \"No relevant context found matching the query.\"\n", "\n", " print(\"Retrieved relevant context.\")\n", " return \"\\n\\n\".join(context_parts)\n" ] }, { "cell_type": "markdown", "id": "e7d64cca", "metadata": { "papermill": { "duration": 0.033939, "end_time": "2025-04-21T04:43:57.367231", "exception": false, "start_time": "2025-04-21T04:43:57.333292", "status": "completed" }, "tags": [] }, "source": [ "## 5. Summary Generation Function (Document Understanding + Structured Output)" ] }, { "cell_type": "markdown", "id": "bb401c32", "metadata": { "papermill": { "duration": 0.03634, "end_time": "2025-04-21T04:43:57.438043", "exception": false, "start_time": "2025-04-21T04:43:57.401703", "status": "completed" }, "tags": [] }, "source": [ "This function uses the LLM to understand the retrieved documents and generate a structured summary:" ] }, { "cell_type": "code", "execution_count": 7, "id": "c082b8bc", "metadata": { "execution": { "iopub.execute_input": "2025-04-21T04:43:57.508641Z", "iopub.status.busy": "2025-04-21T04:43:57.508296Z", "iopub.status.idle": "2025-04-21T04:43:57.516407Z", "shell.execute_reply": "2025-04-21T04:43:57.515409Z" }, "papermill": { "duration": 0.045708, "end_time": "2025-04-21T04:43:57.518305", "exception": false, "start_time": "2025-04-21T04:43:57.472597", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "def generate_structured_summary(context: str, topic: str) -> Optional[Dict[str, Any]]:\n", " \"\"\"Generates a summary using an LLM with structured output request.\"\"\"\n", " print(\"Generating structured summary with LLM...\")\n", " model = genai.GenerativeModel(LLM_MODEL_NAME)\n", "\n", " # Define the desired JSON structure\n", " json_schema = {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"topic\": {\"type\": \"string\", \"description\": \"The main topic summarized.\"},\n", " \"summary_points\": {\n", " \"type\": \"array\",\n", " \"items\": {\"type\": \"string\"},\n", " \"description\": \"A list of key points summarizing the context.\"\n", " },\n", " \"mentioned_sources\": {\n", " \"type\": \"array\",\n", " \"items\": {\"type\": \"string\", \"format\": \"uri\"},\n", " \"description\": \"List of unique source URLs mentioned in the context provided.\"\n", " }\n", " },\n", " \"required\": [\"topic\", \"summary_points\", \"mentioned_sources\"]\n", " }\n", "\n", " prompt = f\"\"\"\n", " Based ONLY on the following retrieved context about '{topic}', provide a concise summary.\n", " Extract the key points and list the unique source URLs mentioned.\n", " Respond ONLY with a valid JSON object matching the following schema:\n", "\n", " Schema:\n", " {json.dumps(json_schema, indent=2)}\n", "\n", " Retrieved Context:\n", " ---\n", " {context}\n", " ---\n", "\n", " JSON Output:\n", " \"\"\"\n", "\n", " try:\n", " response = model.generate_content(\n", " prompt,\n", " generation_config=genai.types.GenerationConfig(\n", " response_mime_type=\"application/json\" # Request JSON output\n", " )\n", " )\n", " # The API should return validated JSON directly when using response_mime_type\n", " summary_json = json.loads(response.text)\n", " print(\"LLM generation successful.\")\n", " return summary_json\n", " except Exception as e:\n", " print(f\"Error during LLM generation or JSON parsing: {e}\")\n", " # Attempt to see if there's partial text in case of error\n", " try:\n", " print(f\"LLM Raw Response Text (if available): {response.text}\")\n", " except:\n", " pass\n", " return None\n" ] }, { "cell_type": "markdown", "id": "4f6c9958", "metadata": { "papermill": { "duration": 0.046196, "end_time": "2025-04-21T04:43:57.600401", "exception": false, "start_time": "2025-04-21T04:43:57.554205", "status": "completed" }, "tags": [] }, "source": [ "## 6. Main Execution Function" ] }, { "cell_type": "markdown", "id": "4e0bc6fb", "metadata": { "papermill": { "duration": 0.034731, "end_time": "2025-04-21T04:43:57.669896", "exception": false, "start_time": "2025-04-21T04:43:57.635165", "status": "completed" }, "tags": [] }, "source": [ "Now, let's put everything together in a main function:" ] }, { "cell_type": "code", "execution_count": 8, "id": "8ea56020", "metadata": { "execution": { "iopub.execute_input": "2025-04-21T04:43:57.741520Z", "iopub.status.busy": "2025-04-21T04:43:57.741132Z", "iopub.status.idle": "2025-04-21T04:44:07.169504Z", "shell.execute_reply": "2025-04-21T04:44:07.168351Z" }, "papermill": { "duration": 9.466481, "end_time": "2025-04-21T04:44:07.171100", "exception": false, "start_time": "2025-04-21T04:43:57.704619", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading embedding model...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "774a04f759fd44bab64421792efa5ae2", "version_major": 2, "version_minor": 0 }, "text/plain": [ "modules.json: 0%| | 0.00/349 [00:00