File size: 2,840 Bytes
f085180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
 ## ๐Ÿ“ˆ Detailed Working of the Web Research Agent

The Web Research Agent follows a modular, multi-step pipeline that transforms a user query into a reliable, summarized report. It leverages Google Gemini for natural language processing, Google Custom Search for information retrieval, and BeautifulSoup for scraping.

---

### ๐Ÿ”น Step-by-Step Flow

#### 1. **User Input**
- User enters a natural language research query in the Streamlit interface.
- Example: "India US trade deal 2025"

#### 2. **Query Analyzer (query_analyzer.py)**

- Gemini 2.5 Pro model is used to extract structured metadata from the input query.

- Returns:

  - **Intent** (e.g., news, opinion, analysis)

  - **Keywords** (extracted for search)

  - **Information Types** (e.g., statistics, policy summaries)

  - **Time Range** (e.g., "last year")



#### 3. **Google Search Tool (search_tool.py)**
- Uses the `GOOGLE_CSE_API_KEY` and `GOOGLE_CSE_CX` to perform a search via Google Custom Search API.
- Pulls top `n` results (default = 15) based on relevance to query keywords.
- Returns list of dictionaries with:
  - Title
  - URL
  - Snippet

#### 4. **Web Scraper Tool (scraper_tool.py)**

- Visits each URL and extracts readable `<p>` tags using BeautifulSoup.

- Clips long text to 5000 characters for optimal LLM processing.

- Returns:

  - Page content

  - URL



#### 5. **Content Analyzer (content_analyzer.py)**
- Uses Gemini 2.5 to summarize scraped content.
- Adds metadata such as:
  - **Summary** (in bullet points)
  - **Content Type** (e.g., "news report")
  - **Relevance Rating** (high, medium, low)

#### 6. **Synthesizer (synthesizer.py)**
- Receives all article summaries and the original query.
- Synthesizes content using Gemini with the following logic:
  - Group similar insights across sources.
  - Highlight contradictions.
  - End with a unified **"Final Takeaway"**.
- Returns the final report in Markdown format.

#### 7. **Streamlit Output (app.py)**
- Displays the following to the user:
  - Sidebar: Query analysis & article summaries.
  - Main view: Top links and final synthesized report.

---

### ๐Ÿ› ๏ธ Tech Stack & Tools

| Component         | Technology/Library         |
|------------------|----------------------------|
| UI               | Streamlit                  |
| LLM              | Google Gemini 2.5 Pro      |
| Search API       | Google Custom Search (CSE) |
| Scraper          | BeautifulSoup, Requests    |
| Config Management| Python-dotenv              |

---

### ๐Ÿš€ Example End-to-End Flow

**Query:** "Electric vehicle subsidies in Europe 2024"

**Result:**
- 15 relevant articles scraped
- 12 summaries processed
- Synthesized final markdown with policy trends, contradictions in subsidy effectiveness, and a closing insight.