File size: 3,480 Bytes
bceb417
6b33a86
aeee18b
bceb417
 
6b33a86
aeee18b
bceb417
 
6b33a86
eb220be
6b33a86
 
 
aeee18b
6b33a86
bceb417
 
eb220be
e10689b
eb220be
bceb417
eb220be
 
 
 
 
aeee18b
eb220be
 
 
 
 
 
aeee18b
eb220be
aeee18b
eb220be
 
 
 
 
6b33a86
aeee18b
eb220be
 
 
aeee18b
bceb417
eb220be
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2178f0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
title: Web Scraping with Selenium + RAG
emoji: 🕷️
colorFrom: red
colorTo: red
sdk: docker
app_file: app.py
app_port: 8501
tags:
  - streamlit
  - playwright
  - rag
  - flan-t5
  - web-scraping
pinned: true
short_description: Selenium RAG using FLAN-T5-small
---

# Web Scraping + RAG Chatbot

This is a Streamlit-based web application that combines web scraping with Retrieval-Augmented Generation (RAG) to create an intelligent chatbot. It scrapes content from a specified URL, indexes it using FAISS, and answers questions about the content using a Hugging Face model (`google/flan-t5-small`).

## Features
- **Web Scraping**: Uses Playwright to extract text from websites in headless Chromium.
- **RAG Pipeline**: Indexes scraped content with `sentence-transformers/all-MiniLM-L6-v2` and FAISS, then answers questions using `google/flan-t5-small`.
- **Interactive UI**: Built with Streamlit, offering modes for scraping, chatting, and viewing app details.
- **Dockerized**: Runs in a containerized environment, optimized for Hugging Face Spaces or local deployment.

## Tech Stack
- **Python**: 3.10
- **Web Scraping**: Playwright (`playwright==1.48.0`)
- **RAG**: LangChain (`langchain==0.3.27`), FAISS (`faiss-cpu==1.7.4`), Hugging Face Transformers (`transformers==4.44.2`)
- **Frontend**: Streamlit (`streamlit==1.32.0`)
- **Container**: Docker (`python:3.10-slim` base image)

## Setup Instructions

### Local Development
1. **Clone the Repository**:
   ```bash
   git clone <your-repo-url>
   cd <your-repo-name>


Build and Run with Docker:
docker build --no-cache -t web-scraping-rag .
docker run -p 8501:8501 web-scraping-rag


Access the App:

Open http://localhost:8501 in your browser.
Enter a URL (e.g., https://example.com) to scrape and ask questions about the content.


Check Logs:
docker exec -it <container-id> cat /app/cache/app.log



Deploy to Hugging Face Spaces

Create a Space:

Go to Hugging Face Spaces.
Create a new Space with the Docker template.


Push Code:
git add app.py Dockerfile requirements.txt README.md
git commit -m "Deploy Playwright-based web scraping RAG app"
git push


Configure Space:

Ensure at least 4GB RAM and 2 CPU cores.
Set the Space to public or private as needed.
Monitor build logs for errors.


Access the App:

Visit https://<your-username>-<space-name>.hf.space.



Usage

Web Scraping Mode:

Enter a valid URL (e.g., https://example.com).
Click "Scrape Website" to extract and index content.
View scraped content in the expandable text area.


Chat with Content Mode:

Ask questions about the scraped content via the chat input.
The app retrieves relevant chunks using FAISS and generates answers with FLAN-T5.


About Mode:

Learn about the app’s tech stack and functionality.



Dependencies
See requirements.txt for the full list. Key dependencies:

streamlit==1.32.0
playwright==1.48.0
transformers==4.44.2
sentence-transformers==3.1.1
langchain==0.3.27
faiss-cpu==1.7.4
torch==2.2.0
tokenizers==0.19.1

Troubleshooting

Build Errors: Check Hugging Face Spaces build logs or local Docker build output.
Scraping Failures: Verify the URL is accessible and not blocked by CAPTCHAs. Check /app/cache/app.log.
Model Loading Issues: Ensure transformers==4.44.2 and tokenizers==0.19.1 are installed correctly.
Resource Limits: Confirm at least 4GB RAM and 2 CPU cores in Hugging Face Spaces settings.
Logs: Run docker exec -it <container-id> cat /app/cache/app.log to diagnose issues.

License
MIT License