Web_Scrapping_Agent / docs /design_notes.md
ShivanshCodex's picture
Upload 47 files
f085180 verified

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade

πŸ“ˆ Detailed Working of the Web Research Agent

The Web Research Agent follows a modular, multi-step pipeline that transforms a user query into a reliable, summarized report. It leverages Google Gemini for natural language processing, Google Custom Search for information retrieval, and BeautifulSoup for scraping.


πŸ”Ή Step-by-Step Flow

1. User Input

  • User enters a natural language research query in the Streamlit interface.
  • Example: "India US trade deal 2025"

2. Query Analyzer (query_analyzer.py)

  • Gemini 2.5 Pro model is used to extract structured metadata from the input query.
  • Returns:
    • Intent (e.g., news, opinion, analysis)
    • Keywords (extracted for search)
    • Information Types (e.g., statistics, policy summaries)
    • Time Range (e.g., "last year")

3. Google Search Tool (search_tool.py)

  • Uses the GOOGLE_CSE_API_KEY and GOOGLE_CSE_CX to perform a search via Google Custom Search API.
  • Pulls top n results (default = 15) based on relevance to query keywords.
  • Returns list of dictionaries with:
    • Title
    • URL
    • Snippet

4. Web Scraper Tool (scraper_tool.py)

  • Visits each URL and extracts readable <p> tags using BeautifulSoup.
  • Clips long text to 5000 characters for optimal LLM processing.
  • Returns:
    • Page content
    • URL

5. Content Analyzer (content_analyzer.py)

  • Uses Gemini 2.5 to summarize scraped content.
  • Adds metadata such as:
    • Summary (in bullet points)
    • Content Type (e.g., "news report")
    • Relevance Rating (high, medium, low)

6. Synthesizer (synthesizer.py)

  • Receives all article summaries and the original query.
  • Synthesizes content using Gemini with the following logic:
    • Group similar insights across sources.
    • Highlight contradictions.
    • End with a unified "Final Takeaway".
  • Returns the final report in Markdown format.

7. Streamlit Output (app.py)

  • Displays the following to the user:
    • Sidebar: Query analysis & article summaries.
    • Main view: Top links and final synthesized report.

πŸ› οΈ Tech Stack & Tools

Component Technology/Library
UI Streamlit
LLM Google Gemini 2.5 Pro
Search API Google Custom Search (CSE)
Scraper BeautifulSoup, Requests
Config Management Python-dotenv

πŸš€ Example End-to-End Flow

Query: "Electric vehicle subsidies in Europe 2024"

Result:

  • 15 relevant articles scraped
  • 12 summaries processed
  • Synthesized final markdown with policy trends, contradictions in subsidy effectiveness, and a closing insight.