| | --- |
| | title: Web Scraping with Selenium + RAG |
| | emoji: 🕷️ |
| | colorFrom: red |
| | colorTo: red |
| | sdk: docker |
| | app_file: app.py |
| | app_port: 8501 |
| | tags: |
| | - streamlit |
| | - playwright |
| | - rag |
| | - flan-t5 |
| | - web-scraping |
| | pinned: true |
| | short_description: Selenium RAG using FLAN-T5-small |
| | --- |
| | |
| | # Web Scraping + RAG Chatbot |
| |
|
| | This is a Streamlit-based web application that combines web scraping with Retrieval-Augmented Generation (RAG) to create an intelligent chatbot. It scrapes content from a specified URL, indexes it using FAISS, and answers questions about the content using a Hugging Face model (`google/flan-t5-small`). |
| |
|
| | ## Features |
| | - **Web Scraping**: Uses Playwright to extract text from websites in headless Chromium. |
| | - **RAG Pipeline**: Indexes scraped content with `sentence-transformers/all-MiniLM-L6-v2` and FAISS, then answers questions using `google/flan-t5-small`. |
| | - **Interactive UI**: Built with Streamlit, offering modes for scraping, chatting, and viewing app details. |
| | - **Dockerized**: Runs in a containerized environment, optimized for Hugging Face Spaces or local deployment. |
| |
|
| | ## Tech Stack |
| | - **Python**: 3.10 |
| | - **Web Scraping**: Playwright (`playwright==1.48.0`) |
| | - **RAG**: LangChain (`langchain==0.3.27`), FAISS (`faiss-cpu==1.7.4`), Hugging Face Transformers (`transformers==4.44.2`) |
| | - **Frontend**: Streamlit (`streamlit==1.32.0`) |
| | - **Container**: Docker (`python:3.10-slim` base image) |
| |
|
| | ## Setup Instructions |
| |
|
| | ### Local Development |
| | 1. **Clone the Repository**: |
| | ```bash |
| | git clone <your-repo-url> |
| | cd <your-repo-name> |
| | |
| | |
| | Build and Run with Docker: |
| | docker build --no-cache -t web-scraping-rag . |
| | docker run -p 8501:8501 web-scraping-rag |
| | |
| | |
| | Access the App: |
| | |
| | Open http://localhost:8501 in your browser. |
| | Enter a URL (e.g., https://example.com) to scrape and ask questions about the content. |
| | |
| | |
| | Check Logs: |
| | docker exec -it <container-id> cat /app/cache/app.log |
| | |
| | |
| | |
| | Deploy to Hugging Face Spaces |
| | |
| | Create a Space: |
| | |
| | Go to Hugging Face Spaces. |
| | Create a new Space with the Docker template. |
| | |
| | |
| | Push Code: |
| | git add app.py Dockerfile requirements.txt README.md |
| | git commit -m "Deploy Playwright-based web scraping RAG app" |
| | git push |
| | |
| | |
| | Configure Space: |
| | |
| | Ensure at least 4GB RAM and 2 CPU cores. |
| | Set the Space to public or private as needed. |
| | Monitor build logs for errors. |
| | |
| | |
| | Access the App: |
| | |
| | Visit https://<your-username>-<space-name>.hf.space. |
| | |
| | |
| | |
| | Usage |
| | |
| | Web Scraping Mode: |
| | |
| | Enter a valid URL (e.g., https://example.com). |
| | Click "Scrape Website" to extract and index content. |
| | View scraped content in the expandable text area. |
| | |
| | |
| | Chat with Content Mode: |
| | |
| | Ask questions about the scraped content via the chat input. |
| | The app retrieves relevant chunks using FAISS and generates answers with FLAN-T5. |
| | |
| | |
| | About Mode: |
| | |
| | Learn about the app’s tech stack and functionality. |
| | |
| | |
| | |
| | Dependencies |
| | See requirements.txt for the full list. Key dependencies: |
| | |
| | streamlit==1.32.0 |
| | playwright==1.48.0 |
| | transformers==4.44.2 |
| | sentence-transformers==3.1.1 |
| | langchain==0.3.27 |
| | faiss-cpu==1.7.4 |
| | torch==2.2.0 |
| | tokenizers==0.19.1 |
| | |
| | Troubleshooting |
| | |
| | Build Errors: Check Hugging Face Spaces build logs or local Docker build output. |
| | Scraping Failures: Verify the URL is accessible and not blocked by CAPTCHAs. Check /app/cache/app.log. |
| | Model Loading Issues: Ensure transformers==4.44.2 and tokenizers==0.19.1 are installed correctly. |
| | Resource Limits: Confirm at least 4GB RAM and 2 CPU cores in Hugging Face Spaces settings. |
| | Logs: Run docker exec -it <container-id> cat /app/cache/app.log to diagnose issues. |
| | |
| | License |
| | MIT License |
| | |
| | |
| | |