--- title: Web Scraping with Selenium + RAG emoji: 🕷️ colorFrom: red colorTo: red sdk: docker app_file: app.py app_port: 8501 tags: - streamlit - playwright - rag - flan-t5 - web-scraping pinned: true short_description: Selenium RAG using FLAN-T5-small --- # Web Scraping + RAG Chatbot This is a Streamlit-based web application that combines web scraping with Retrieval-Augmented Generation (RAG) to create an intelligent chatbot. It scrapes content from a specified URL, indexes it using FAISS, and answers questions about the content using a Hugging Face model (`google/flan-t5-small`). ## Features - **Web Scraping**: Uses Playwright to extract text from websites in headless Chromium. - **RAG Pipeline**: Indexes scraped content with `sentence-transformers/all-MiniLM-L6-v2` and FAISS, then answers questions using `google/flan-t5-small`. - **Interactive UI**: Built with Streamlit, offering modes for scraping, chatting, and viewing app details. - **Dockerized**: Runs in a containerized environment, optimized for Hugging Face Spaces or local deployment. ## Tech Stack - **Python**: 3.10 - **Web Scraping**: Playwright (`playwright==1.48.0`) - **RAG**: LangChain (`langchain==0.3.27`), FAISS (`faiss-cpu==1.7.4`), Hugging Face Transformers (`transformers==4.44.2`) - **Frontend**: Streamlit (`streamlit==1.32.0`) - **Container**: Docker (`python:3.10-slim` base image) ## Setup Instructions ### Local Development 1. **Clone the Repository**: ```bash git clone cd Build and Run with Docker: docker build --no-cache -t web-scraping-rag . docker run -p 8501:8501 web-scraping-rag Access the App: Open http://localhost:8501 in your browser. Enter a URL (e.g., https://example.com) to scrape and ask questions about the content. Check Logs: docker exec -it cat /app/cache/app.log Deploy to Hugging Face Spaces Create a Space: Go to Hugging Face Spaces. Create a new Space with the Docker template. Push Code: git add app.py Dockerfile requirements.txt README.md git commit -m "Deploy Playwright-based web scraping RAG app" git push Configure Space: Ensure at least 4GB RAM and 2 CPU cores. Set the Space to public or private as needed. Monitor build logs for errors. Access the App: Visit https://-.hf.space. Usage Web Scraping Mode: Enter a valid URL (e.g., https://example.com). Click "Scrape Website" to extract and index content. View scraped content in the expandable text area. Chat with Content Mode: Ask questions about the scraped content via the chat input. The app retrieves relevant chunks using FAISS and generates answers with FLAN-T5. About Mode: Learn about the app’s tech stack and functionality. Dependencies See requirements.txt for the full list. Key dependencies: streamlit==1.32.0 playwright==1.48.0 transformers==4.44.2 sentence-transformers==3.1.1 langchain==0.3.27 faiss-cpu==1.7.4 torch==2.2.0 tokenizers==0.19.1 Troubleshooting Build Errors: Check Hugging Face Spaces build logs or local Docker build output. Scraping Failures: Verify the URL is accessible and not blocked by CAPTCHAs. Check /app/cache/app.log. Model Loading Issues: Ensure transformers==4.44.2 and tokenizers==0.19.1 are installed correctly. Resource Limits: Confirm at least 4GB RAM and 2 CPU cores in Hugging Face Spaces settings. Logs: Run docker exec -it cat /app/cache/app.log to diagnose issues. License MIT License