---
title: Web Scraping with Selenium + RAG
emoji: 🕷️
colorFrom: red
colorTo: red
sdk: docker
app_file: app.py
app_port: 8501
tags:
  - streamlit
  - playwright
  - rag
  - flan-t5
  - web-scraping
pinned: true
short_description: Selenium RAG using FLAN-T5-small
---

# Web Scraping + RAG Chatbot

This is a Streamlit-based web application that combines web scraping with Retrieval-Augmented Generation (RAG) to create an intelligent chatbot. It scrapes content from a specified URL, indexes it using FAISS, and answers questions about the content using a Hugging Face model (`google/flan-t5-small`).

## Features
- **Web Scraping**: Uses Playwright to extract text from websites in headless Chromium.
- **RAG Pipeline**: Indexes scraped content with `sentence-transformers/all-MiniLM-L6-v2` and FAISS, then answers questions using `google/flan-t5-small`.
- **Interactive UI**: Built with Streamlit, offering modes for scraping, chatting, and viewing app details.
- **Dockerized**: Runs in a containerized environment, optimized for Hugging Face Spaces or local deployment.

## Tech Stack
- **Python**: 3.10
- **Web Scraping**: Playwright (`playwright==1.48.0`)
- **RAG**: LangChain (`langchain==0.3.27`), FAISS (`faiss-cpu==1.7.4`), Hugging Face Transformers (`transformers==4.44.2`)
- **Frontend**: Streamlit (`streamlit==1.32.0`)
- **Container**: Docker (`python:3.10-slim` base image)

## Setup Instructions

### Local Development
1. **Clone the Repository**:
   ```bash
   git clone <your-repo-url>
   cd <your-repo-name>


Build and Run with Docker:
docker build --no-cache -t web-scraping-rag .
docker run -p 8501:8501 web-scraping-rag


Access the App:

Open http://localhost:8501 in your browser.
Enter a URL (e.g., https://example.com) to scrape and ask questions about the content.


Check Logs:
docker exec -it <container-id> cat /app/cache/app.log


Deploy to Hugging Face Spaces

Create a Space:

Go to Hugging Face Spaces.
Create a new Space with the Docker template.


Push Code:
git add app.py Dockerfile requirements.txt README.md
git commit -m "Deploy Playwright-based web scraping RAG app"
git push


Configure Space:

Ensure at least 4GB RAM and 2 CPU cores.
Set the Space to public or private as needed.
Monitor build logs for errors.


Access the App:

Visit https://<your-username>-<space-name>.hf.space.


Usage

Web Scraping Mode:

Enter a valid URL (e.g., https://example.com).
Click "Scrape Website" to extract and index content.
View scraped content in the expandable text area.


Chat with Content Mode:

Ask questions about the scraped content via the chat input.
The app retrieves relevant chunks using FAISS and generates answers with FLAN-T5.


About Mode:

Learn about the app’s tech stack and functionality.


Dependencies
See requirements.txt for the full list. Key dependencies:

streamlit==1.32.0
playwright==1.48.0
transformers==4.44.2
sentence-transformers==3.1.1
langchain==0.3.27
faiss-cpu==1.7.4
torch==2.2.0
tokenizers==0.19.1

Troubleshooting

Build Errors: Check Hugging Face Spaces build logs or local Docker build output.
Scraping Failures: Verify the URL is accessible and not blocked by CAPTCHAs. Check /app/cache/app.log.
Model Loading Issues: Ensure transformers==4.44.2 and tokenizers==0.19.1 are installed correctly.
Resource Limits: Confirm at least 4GB RAM and 2 CPU cores in Hugging Face Spaces settings.
Logs: Run docker exec -it <container-id> cat /app/cache/app.log to diagnose issues.

License
MIT License