Spaces:

muddasser
/

Webscrapping_Playwright

Running

App Files Files Community

muddasser commited on Aug 28, 2025

Commit

eb220be

verified ·

1 Parent(s): 27e4e42

Update README.md

Browse files

Files changed (1) hide show

README.md +105 -22

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ app_file: app.py
 app_port: 8501
 tags:
   - streamlit
-  - selenium
   - rag
   - flan-t5
   - web-scraping
@@ -16,33 +16,116 @@ pinned: true
 short_description: Selenium RAG using FLAN-T5-small
 ---
-# 🕷️ Web Scraping + RAG Chatbot
-This project combines **Selenium web scraping** with **Retrieval-Augmented Generation (RAG)** to build an intelligent chatbot that can extract information from websites and answer questions about the content.
-![Demo](https://img.shields.io/badge/Demo-Live%20Demo-blue)
-![Python](https://img.shields.io/badge/Python-3.10%2B-blue)
-![License](https://img.shields.io/badge/License-MIT-green)
-## ✨ Features
-- 🌐 **Web Scraping**: Extract content from dynamic websites using Selenium
-- 📚 **Vector Storage**: Index and retrieve content using FAISS embeddings
-- 🧠 **Question Answering**: Generate answers using FLAN-T5-small model
-- 🎨 **User-Friendly Interface**: Simple Streamlit UI for interaction
-- 🐳 **Dockerized**: Ready for deployment on Hugging Face Spaces
-## 🚀 Quick Start
-### Prerequisites
-- Python 3.10+
-- Docker (for containerized deployment)
-- Hugging Face account (for deployment)
-### Local Installation
-1. Clone the repository:
-```bash
-git clone https://huggingface.co/spaces/your-username/your-space-name
-cd your-space-name

 app_port: 8501
 tags:
   - streamlit
+  - playwright
   - rag
   - flan-t5
   - web-scraping
 short_description: Selenium RAG using FLAN-T5-small
 ---
+# Web Scraping + RAG Chatbot
+This is a Streamlit-based web application that combines web scraping with Retrieval-Augmented Generation (RAG) to create an intelligent chatbot. It scrapes content from a specified URL, indexes it using FAISS, and answers questions about the content using a Hugging Face model (`google/flan-t5-small`).
+## Features
+- **Web Scraping**: Uses Playwright to extract text from websites in headless Chromium.
+- **RAG Pipeline**: Indexes scraped content with `sentence-transformers/all-MiniLM-L6-v2` and FAISS, then answers questions using `google/flan-t5-small`.
+- **Interactive UI**: Built with Streamlit, offering modes for scraping, chatting, and viewing app details.
+- **Dockerized**: Runs in a containerized environment, optimized for Hugging Face Spaces or local deployment.
+## Tech Stack
+- **Python**: 3.10
+- **Web Scraping**: Playwright (`playwright==1.48.0`)
+- **RAG**: LangChain (`langchain==0.3.27`), FAISS (`faiss-cpu==1.7.4`), Hugging Face Transformers (`transformers==4.44.2`)
+- **Frontend**: Streamlit (`streamlit==1.32.0`)
+- **Container**: Docker (`python:3.10-slim` base image)
+## Setup Instructions
+### Local Development
+1. **Clone the Repository**:
+   ```bash
+   git clone <your-repo-url>
+   cd <your-repo-name>
+Build and Run with Docker:
+docker build --no-cache -t web-scraping-rag .
+docker run -p 8501:8501 web-scraping-rag
+Access the App:
+Open http://localhost:8501 in your browser.
+Enter a URL (e.g., https://example.com) to scrape and ask questions about the content.
+Check Logs:
+docker exec -it <container-id> cat /app/cache/app.log
+Deploy to Hugging Face Spaces
+Create a Space:
+Go to Hugging Face Spaces.
+Create a new Space with the Docker template.
+Push Code:
+git add app.py Dockerfile requirements.txt README.md
+git commit -m "Deploy Playwright-based web scraping RAG app"
+git push
+Configure Space:
+Ensure at least 4GB RAM and 2 CPU cores.
+Set the Space to public or private as needed.
+Monitor build logs for errors.
+Access the App:
+Visit https://<your-username>-<space-name>.hf.space.
+Usage
+Web Scraping Mode:
+Enter a valid URL (e.g., https://example.com).
+Click "Scrape Website" to extract and index content.
+View scraped content in the expandable text area.
+Chat with Content Mode:
+Ask questions about the scraped content via the chat input.
+The app retrieves relevant chunks using FAISS and generates answers with FLAN-T5.
+About Mode:
+Learn about the app’s tech stack and functionality.
+Dependencies
+See requirements.txt for the full list. Key dependencies:
+streamlit==1.32.0
+playwright==1.48.0
+transformers==4.44.2
+sentence-transformers==3.1.1
+langchain==0.3.27
+faiss-cpu==1.7.4
+torch==2.2.0
+tokenizers==0.19.1
+Troubleshooting
+Build Errors: Check Hugging Face Spaces build logs or local Docker build output.
+Scraping Failures: Verify the URL is accessible and not blocked by CAPTCHAs. Check /app/cache/app.log.
+Model Loading Issues: Ensure transformers==4.44.2 and tokenizers==0.19.1 are installed correctly.
+Resource Limits: Confirm at least 4GB RAM and 2 CPU cores in Hugging Face Spaces settings.
+Logs: Run docker exec -it <container-id> cat /app/cache/app.log to diagnose issues.
+License
+MIT License