Spaces:
Sleeping
Sleeping
File size: 3,480 Bytes
bceb417 6b33a86 aeee18b bceb417 6b33a86 aeee18b bceb417 6b33a86 eb220be 6b33a86 aeee18b 6b33a86 bceb417 eb220be e10689b eb220be bceb417 eb220be aeee18b eb220be aeee18b eb220be aeee18b eb220be 6b33a86 aeee18b eb220be aeee18b bceb417 eb220be e2178f0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | ---
title: Web Scraping with Selenium + RAG
emoji: 🕷️
colorFrom: red
colorTo: red
sdk: docker
app_file: app.py
app_port: 8501
tags:
- streamlit
- playwright
- rag
- flan-t5
- web-scraping
pinned: true
short_description: Selenium RAG using FLAN-T5-small
---
# Web Scraping + RAG Chatbot
This is a Streamlit-based web application that combines web scraping with Retrieval-Augmented Generation (RAG) to create an intelligent chatbot. It scrapes content from a specified URL, indexes it using FAISS, and answers questions about the content using a Hugging Face model (`google/flan-t5-small`).
## Features
- **Web Scraping**: Uses Playwright to extract text from websites in headless Chromium.
- **RAG Pipeline**: Indexes scraped content with `sentence-transformers/all-MiniLM-L6-v2` and FAISS, then answers questions using `google/flan-t5-small`.
- **Interactive UI**: Built with Streamlit, offering modes for scraping, chatting, and viewing app details.
- **Dockerized**: Runs in a containerized environment, optimized for Hugging Face Spaces or local deployment.
## Tech Stack
- **Python**: 3.10
- **Web Scraping**: Playwright (`playwright==1.48.0`)
- **RAG**: LangChain (`langchain==0.3.27`), FAISS (`faiss-cpu==1.7.4`), Hugging Face Transformers (`transformers==4.44.2`)
- **Frontend**: Streamlit (`streamlit==1.32.0`)
- **Container**: Docker (`python:3.10-slim` base image)
## Setup Instructions
### Local Development
1. **Clone the Repository**:
```bash
git clone <your-repo-url>
cd <your-repo-name>
Build and Run with Docker:
docker build --no-cache -t web-scraping-rag .
docker run -p 8501:8501 web-scraping-rag
Access the App:
Open http://localhost:8501 in your browser.
Enter a URL (e.g., https://example.com) to scrape and ask questions about the content.
Check Logs:
docker exec -it <container-id> cat /app/cache/app.log
Deploy to Hugging Face Spaces
Create a Space:
Go to Hugging Face Spaces.
Create a new Space with the Docker template.
Push Code:
git add app.py Dockerfile requirements.txt README.md
git commit -m "Deploy Playwright-based web scraping RAG app"
git push
Configure Space:
Ensure at least 4GB RAM and 2 CPU cores.
Set the Space to public or private as needed.
Monitor build logs for errors.
Access the App:
Visit https://<your-username>-<space-name>.hf.space.
Usage
Web Scraping Mode:
Enter a valid URL (e.g., https://example.com).
Click "Scrape Website" to extract and index content.
View scraped content in the expandable text area.
Chat with Content Mode:
Ask questions about the scraped content via the chat input.
The app retrieves relevant chunks using FAISS and generates answers with FLAN-T5.
About Mode:
Learn about the app’s tech stack and functionality.
Dependencies
See requirements.txt for the full list. Key dependencies:
streamlit==1.32.0
playwright==1.48.0
transformers==4.44.2
sentence-transformers==3.1.1
langchain==0.3.27
faiss-cpu==1.7.4
torch==2.2.0
tokenizers==0.19.1
Troubleshooting
Build Errors: Check Hugging Face Spaces build logs or local Docker build output.
Scraping Failures: Verify the URL is accessible and not blocked by CAPTCHAs. Check /app/cache/app.log.
Model Loading Issues: Ensure transformers==4.44.2 and tokenizers==0.19.1 are installed correctly.
Resource Limits: Confirm at least 4GB RAM and 2 CPU cores in Hugging Face Spaces settings.
Logs: Run docker exec -it <container-id> cat /app/cache/app.log to diagnose issues.
License
MIT License
|