Spaces:

muddasser
/

Webscrapping_Playwright

Running

App Files Files Community

Webscrapping_Playwright / README.md

muddasser

Update README.md

e2178f0 verified 7 months ago

preview code

raw

history blame contribute delete

3.48 kB

metadata

title: Web Scraping with Selenium + RAG
emoji: 🕷️
colorFrom: red
colorTo: red
sdk: docker
app_file: app.py
app_port: 8501
tags:
  - streamlit
  - playwright
  - rag
  - flan-t5
  - web-scraping
pinned: true
short_description: Selenium RAG using FLAN-T5-small

Web Scraping + RAG Chatbot

This is a Streamlit-based web application that combines web scraping with Retrieval-Augmented Generation (RAG) to create an intelligent chatbot. It scrapes content from a specified URL, indexes it using FAISS, and answers questions about the content using a Hugging Face model (google/flan-t5-small).

Features

Web Scraping: Uses Playwright to extract text from websites in headless Chromium.
RAG Pipeline: Indexes scraped content with sentence-transformers/all-MiniLM-L6-v2 and FAISS, then answers questions using google/flan-t5-small.
Interactive UI: Built with Streamlit, offering modes for scraping, chatting, and viewing app details.
Dockerized: Runs in a containerized environment, optimized for Hugging Face Spaces or local deployment.

Tech Stack

Python: 3.10
Web Scraping: Playwright (playwright==1.48.0)
RAG: LangChain (langchain==0.3.27), FAISS (faiss-cpu==1.7.4), Hugging Face Transformers (transformers==4.44.2)
Frontend: Streamlit (streamlit==1.32.0)
Container: Docker (python:3.10-slim base image)

Setup Instructions

Local Development

Clone the Repository:

git clone <your-repo-url>
cd <your-repo-name>

Build and Run with Docker: docker build --no-cache -t web-scraping-rag . docker run -p 8501:8501 web-scraping-rag

Access the App:

Open http://localhost:8501 in your browser. Enter a URL (e.g., https://example.com) to scrape and ask questions about the content.

Check Logs: docker exec -it cat /app/cache/app.log

Deploy to Hugging Face Spaces

Create a Space:

Go to Hugging Face Spaces. Create a new Space with the Docker template.

Push Code: git add app.py Dockerfile requirements.txt README.md git commit -m "Deploy Playwright-based web scraping RAG app" git push

Configure Space:

Ensure at least 4GB RAM and 2 CPU cores. Set the Space to public or private as needed. Monitor build logs for errors.

Access the App:

Visit https://-.hf.space.

Usage

Web Scraping Mode:

Enter a valid URL (e.g., https://example.com). Click "Scrape Website" to extract and index content. View scraped content in the expandable text area.

Chat with Content Mode:

Ask questions about the scraped content via the chat input. The app retrieves relevant chunks using FAISS and generates answers with FLAN-T5.

About Mode:

Learn about the app’s tech stack and functionality.

Dependencies See requirements.txt for the full list. Key dependencies:

streamlit==1.32.0 playwright==1.48.0 transformers==4.44.2 sentence-transformers==3.1.1 langchain==0.3.27 faiss-cpu==1.7.4 torch==2.2.0 tokenizers==0.19.1

Troubleshooting

Build Errors: Check Hugging Face Spaces build logs or local Docker build output. Scraping Failures: Verify the URL is accessible and not blocked by CAPTCHAs. Check /app/cache/app.log. Model Loading Issues: Ensure transformers==4.44.2 and tokenizers==0.19.1 are installed correctly. Resource Limits: Confirm at least 4GB RAM and 2 CPU cores in Hugging Face Spaces settings. Logs: Run docker exec -it cat /app/cache/app.log to diagnose issues.

License MIT License