Spaces:

ShivanshCodex
/

Web_Scrapping_Agent

Runtime error

App Files Files Community

Web_Scrapping_Agent / docs /design_notes.md

ShivanshCodex

Upload 47 files

f085180 verified 11 months ago

preview code

raw

history blame contribute delete

2.84 kB

	## 📈 Detailed Working of the Web Research Agent

	The Web Research Agent follows a modular, multi-step pipeline that transforms a user query into a reliable, summarized report. It leverages Google Gemini for natural language processing, Google Custom Search for information retrieval, and BeautifulSoup for scraping.

	---

	### 🔹 Step-by-Step Flow

	#### 1. User Input
	- User enters a natural language research query in the Streamlit interface.
	- Example: "India US trade deal 2025"

	#### 2. Query Analyzer (query_analyzer.py)
	- Gemini 2.5 Pro model is used to extract structured metadata from the input query.
	- Returns:
	- Intent (e.g., news, opinion, analysis)
	- Keywords (extracted for search)
	- Information Types (e.g., statistics, policy summaries)
	- Time Range (e.g., "last year")

	#### 3. Google Search Tool (search_tool.py)
	- Uses the `GOOGLE_CSE_API_KEY` and `GOOGLE_CSE_CX` to perform a search via Google Custom Search API.
	- Pulls top `n` results (default = 15) based on relevance to query keywords.
	- Returns list of dictionaries with:
	- Title
	- URL
	- Snippet

	#### 4. Web Scraper Tool (scraper_tool.py)
	- Visits each URL and extracts readable `<p>` tags using BeautifulSoup.
	- Clips long text to 5000 characters for optimal LLM processing.
	- Returns:
	- Page content
	- URL

	#### 5. Content Analyzer (content_analyzer.py)
	- Uses Gemini 2.5 to summarize scraped content.
	- Adds metadata such as:
	- Summary (in bullet points)
	- Content Type (e.g., "news report")
	- Relevance Rating (high, medium, low)

	#### 6. Synthesizer (synthesizer.py)
	- Receives all article summaries and the original query.
	- Synthesizes content using Gemini with the following logic:
	- Group similar insights across sources.
	- Highlight contradictions.
	- End with a unified "Final Takeaway".
	- Returns the final report in Markdown format.

	#### 7. Streamlit Output (app.py)
	- Displays the following to the user:
	- Sidebar: Query analysis & article summaries.
	- Main view: Top links and final synthesized report.

	---

	### 🛠️ Tech Stack & Tools

	\| Component \| Technology/Library \|
	\|------------------\|----------------------------\|
	\| UI \| Streamlit \|
	\| LLM \| Google Gemini 2.5 Pro \|
	\| Search API \| Google Custom Search (CSE) \|
	\| Scraper \| BeautifulSoup, Requests \|
	\| Config Management\| Python-dotenv \|

	---

	### 🚀 Example End-to-End Flow

	Query: "Electric vehicle subsidies in Europe 2024"

	Result:
	- 15 relevant articles scraped
	- 12 summaries processed
	- Synthesized final markdown with policy trends, contradictions in subsidy effectiveness, and a closing insight.