Spaces:

ShivanshCodex
/

Web_Scrapping_Agent

Runtime error

File size: 2,840 Bytes

f085180

 ## 📈 Detailed Working of the Web Research Agent

The Web Research Agent follows a modular, multi-step pipeline that transforms a user query into a reliable, summarized report. It leverages Google Gemini for natural language processing, Google Custom Search for information retrieval, and BeautifulSoup for scraping.

---

### 🔹 Step-by-Step Flow

#### 1. **User Input**
- User enters a natural language research query in the Streamlit interface.
- Example: "India US trade deal 2025"

#### 2. **Query Analyzer (query_analyzer.py)**

- Gemini 2.5 Pro model is used to extract structured metadata from the input query.

- Returns:

  - **Intent** (e.g., news, opinion, analysis)

  - **Keywords** (extracted for search)

  - **Information Types** (e.g., statistics, policy summaries)

  - **Time Range** (e.g., "last year")



#### 3. **Google Search Tool (search_tool.py)**
- Uses the `GOOGLE_CSE_API_KEY` and `GOOGLE_CSE_CX` to perform a search via Google Custom Search API.
- Pulls top `n` results (default = 15) based on relevance to query keywords.
- Returns list of dictionaries with:
  - Title
  - URL
  - Snippet

#### 4. **Web Scraper Tool (scraper_tool.py)**

- Visits each URL and extracts readable `<p>` tags using BeautifulSoup.

- Clips long text to 5000 characters for optimal LLM processing.

- Returns:

  - Page content

  - URL



#### 5. **Content Analyzer (content_analyzer.py)**
- Uses Gemini 2.5 to summarize scraped content.
- Adds metadata such as:
  - **Summary** (in bullet points)
  - **Content Type** (e.g., "news report")
  - **Relevance Rating** (high, medium, low)

#### 6. **Synthesizer (synthesizer.py)**
- Receives all article summaries and the original query.
- Synthesizes content using Gemini with the following logic:
  - Group similar insights across sources.
  - Highlight contradictions.
  - End with a unified **"Final Takeaway"**.
- Returns the final report in Markdown format.

#### 7. **Streamlit Output (app.py)**
- Displays the following to the user:
  - Sidebar: Query analysis & article summaries.
  - Main view: Top links and final synthesized report.

---

### 🛠️ Tech Stack & Tools

| Component         | Technology/Library         |
|------------------|----------------------------|
| UI               | Streamlit                  |
| LLM              | Google Gemini 2.5 Pro      |
| Search API       | Google Custom Search (CSE) |
| Scraper          | BeautifulSoup, Requests    |
| Config Management| Python-dotenv              |

---

### 🚀 Example End-to-End Flow

**Query:** "Electric vehicle subsidies in Europe 2024"

**Result:**
- 15 relevant articles scraped
- 12 summaries processed
- Synthesized final markdown with policy trends, contradictions in subsidy effectiveness, and a closing insight.