Spaces:

muddasser
/

Webscrapping_Playwright

Running

App Files Files Community

Webscrapping_Playwright / README.md

muddasser

Update README.md

e2178f0 verified 7 months ago

preview code

raw

history blame contribute delete

3.48 kB

	---
	title: Web Scraping with Selenium + RAG
	emoji: 🕷️
	colorFrom: red
	colorTo: red
	sdk: docker
	app_file: app.py
	app_port: 8501
	tags:
	- streamlit
	- playwright
	- rag
	- flan-t5
	- web-scraping
	pinned: true
	short_description: Selenium RAG using FLAN-T5-small
	---

	# Web Scraping + RAG Chatbot

	This is a Streamlit-based web application that combines web scraping with Retrieval-Augmented Generation (RAG) to create an intelligent chatbot. It scrapes content from a specified URL, indexes it using FAISS, and answers questions about the content using a Hugging Face model (`google/flan-t5-small`).

	## Features
	- Web Scraping: Uses Playwright to extract text from websites in headless Chromium.
	- RAG Pipeline: Indexes scraped content with `sentence-transformers/all-MiniLM-L6-v2` and FAISS, then answers questions using `google/flan-t5-small`.
	- Interactive UI: Built with Streamlit, offering modes for scraping, chatting, and viewing app details.
	- Dockerized: Runs in a containerized environment, optimized for Hugging Face Spaces or local deployment.

	## Tech Stack
	- Python: 3.10
	- Web Scraping: Playwright (`playwright==1.48.0`)
	- RAG: LangChain (`langchain==0.3.27`), FAISS (`faiss-cpu==1.7.4`), Hugging Face Transformers (`transformers==4.44.2`)
	- Frontend: Streamlit (`streamlit==1.32.0`)
	- Container: Docker (`python:3.10-slim` base image)

	## Setup Instructions

	### Local Development
	1. Clone the Repository:
	```bash
	git clone <your-repo-url>
	cd <your-repo-name>


	Build and Run with Docker:
	docker build --no-cache -t web-scraping-rag .
	docker run -p 8501:8501 web-scraping-rag


	Access the App:

	Open http://localhost:8501 in your browser.
	Enter a URL (e.g., https://example.com) to scrape and ask questions about the content.


	Check Logs:
	docker exec -it <container-id> cat /app/cache/app.log



	Deploy to Hugging Face Spaces

	Create a Space:

	Go to Hugging Face Spaces.
	Create a new Space with the Docker template.


	Push Code:
	git add app.py Dockerfile requirements.txt README.md
	git commit -m "Deploy Playwright-based web scraping RAG app"
	git push


	Configure Space:

	Ensure at least 4GB RAM and 2 CPU cores.
	Set the Space to public or private as needed.
	Monitor build logs for errors.


	Access the App:

	Visit https://<your-username>-<space-name>.hf.space.



	Usage

	Web Scraping Mode:

	Enter a valid URL (e.g., https://example.com).
	Click "Scrape Website" to extract and index content.
	View scraped content in the expandable text area.


	Chat with Content Mode:

	Ask questions about the scraped content via the chat input.
	The app retrieves relevant chunks using FAISS and generates answers with FLAN-T5.


	About Mode:

	Learn about the app’s tech stack and functionality.



	Dependencies
	See requirements.txt for the full list. Key dependencies:

	streamlit==1.32.0
	playwright==1.48.0
	transformers==4.44.2
	sentence-transformers==3.1.1
	langchain==0.3.27
	faiss-cpu==1.7.4
	torch==2.2.0
	tokenizers==0.19.1

	Troubleshooting

	Build Errors: Check Hugging Face Spaces build logs or local Docker build output.
	Scraping Failures: Verify the URL is accessible and not blocked by CAPTCHAs. Check /app/cache/app.log.
	Model Loading Issues: Ensure transformers==4.44.2 and tokenizers==0.19.1 are installed correctly.
	Resource Limits: Confirm at least 4GB RAM and 2 CPU cores in Hugging Face Spaces settings.
	Logs: Run docker exec -it <container-id> cat /app/cache/app.log to diagnose issues.

	License
	MIT License