muddasser commited on
Commit
eb220be
·
verified ·
1 Parent(s): 27e4e42

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -22
README.md CHANGED
@@ -8,7 +8,7 @@ app_file: app.py
8
  app_port: 8501
9
  tags:
10
  - streamlit
11
- - selenium
12
  - rag
13
  - flan-t5
14
  - web-scraping
@@ -16,33 +16,116 @@ pinned: true
16
  short_description: Selenium RAG using FLAN-T5-small
17
  ---
18
 
19
- # 🕷️ Web Scraping + RAG Chatbot
20
 
21
- This project combines **Selenium web scraping** with **Retrieval-Augmented Generation (RAG)** to build an intelligent chatbot that can extract information from websites and answer questions about the content.
22
 
23
- ![Demo](https://img.shields.io/badge/Demo-Live%20Demo-blue)
24
- ![Python](https://img.shields.io/badge/Python-3.10%2B-blue)
25
- ![License](https://img.shields.io/badge/License-MIT-green)
 
 
26
 
27
- ## Features
 
 
 
 
 
28
 
29
- - 🌐 **Web Scraping**: Extract content from dynamic websites using Selenium
30
- - 📚 **Vector Storage**: Index and retrieve content using FAISS embeddings
31
- - 🧠 **Question Answering**: Generate answers using FLAN-T5-small model
32
- - 🎨 **User-Friendly Interface**: Simple Streamlit UI for interaction
33
- - 🐳 **Dockerized**: Ready for deployment on Hugging Face Spaces
34
 
35
- ## 🚀 Quick Start
 
 
 
 
36
 
37
- ### Prerequisites
38
 
39
- - Python 3.10+
40
- - Docker (for containerized deployment)
41
- - Hugging Face account (for deployment)
42
 
43
- ### Local Installation
44
 
45
- 1. Clone the repository:
46
- ```bash
47
- git clone https://huggingface.co/spaces/your-username/your-space-name
48
- cd your-space-name
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  app_port: 8501
9
  tags:
10
  - streamlit
11
+ - playwright
12
  - rag
13
  - flan-t5
14
  - web-scraping
 
16
  short_description: Selenium RAG using FLAN-T5-small
17
  ---
18
 
19
+ # Web Scraping + RAG Chatbot
20
 
21
+ This is a Streamlit-based web application that combines web scraping with Retrieval-Augmented Generation (RAG) to create an intelligent chatbot. It scrapes content from a specified URL, indexes it using FAISS, and answers questions about the content using a Hugging Face model (`google/flan-t5-small`).
22
 
23
+ ## Features
24
+ - **Web Scraping**: Uses Playwright to extract text from websites in headless Chromium.
25
+ - **RAG Pipeline**: Indexes scraped content with `sentence-transformers/all-MiniLM-L6-v2` and FAISS, then answers questions using `google/flan-t5-small`.
26
+ - **Interactive UI**: Built with Streamlit, offering modes for scraping, chatting, and viewing app details.
27
+ - **Dockerized**: Runs in a containerized environment, optimized for Hugging Face Spaces or local deployment.
28
 
29
+ ## Tech Stack
30
+ - **Python**: 3.10
31
+ - **Web Scraping**: Playwright (`playwright==1.48.0`)
32
+ - **RAG**: LangChain (`langchain==0.3.27`), FAISS (`faiss-cpu==1.7.4`), Hugging Face Transformers (`transformers==4.44.2`)
33
+ - **Frontend**: Streamlit (`streamlit==1.32.0`)
34
+ - **Container**: Docker (`python:3.10-slim` base image)
35
 
36
+ ## Setup Instructions
 
 
 
 
37
 
38
+ ### Local Development
39
+ 1. **Clone the Repository**:
40
+ ```bash
41
+ git clone <your-repo-url>
42
+ cd <your-repo-name>
43
 
 
44
 
45
+ Build and Run with Docker:
46
+ docker build --no-cache -t web-scraping-rag .
47
+ docker run -p 8501:8501 web-scraping-rag
48
 
 
49
 
50
+ Access the App:
51
+
52
+ Open http://localhost:8501 in your browser.
53
+ Enter a URL (e.g., https://example.com) to scrape and ask questions about the content.
54
+
55
+
56
+ Check Logs:
57
+ docker exec -it <container-id> cat /app/cache/app.log
58
+
59
+
60
+
61
+ Deploy to Hugging Face Spaces
62
+
63
+ Create a Space:
64
+
65
+ Go to Hugging Face Spaces.
66
+ Create a new Space with the Docker template.
67
+
68
+
69
+ Push Code:
70
+ git add app.py Dockerfile requirements.txt README.md
71
+ git commit -m "Deploy Playwright-based web scraping RAG app"
72
+ git push
73
+
74
+
75
+ Configure Space:
76
+
77
+ Ensure at least 4GB RAM and 2 CPU cores.
78
+ Set the Space to public or private as needed.
79
+ Monitor build logs for errors.
80
+
81
+
82
+ Access the App:
83
+
84
+ Visit https://<your-username>-<space-name>.hf.space.
85
+
86
+
87
+
88
+ Usage
89
+
90
+ Web Scraping Mode:
91
+
92
+ Enter a valid URL (e.g., https://example.com).
93
+ Click "Scrape Website" to extract and index content.
94
+ View scraped content in the expandable text area.
95
+
96
+
97
+ Chat with Content Mode:
98
+
99
+ Ask questions about the scraped content via the chat input.
100
+ The app retrieves relevant chunks using FAISS and generates answers with FLAN-T5.
101
+
102
+
103
+ About Mode:
104
+
105
+ Learn about the app’s tech stack and functionality.
106
+
107
+
108
+
109
+ Dependencies
110
+ See requirements.txt for the full list. Key dependencies:
111
+
112
+ streamlit==1.32.0
113
+ playwright==1.48.0
114
+ transformers==4.44.2
115
+ sentence-transformers==3.1.1
116
+ langchain==0.3.27
117
+ faiss-cpu==1.7.4
118
+ torch==2.2.0
119
+ tokenizers==0.19.1
120
+
121
+ Troubleshooting
122
+
123
+ Build Errors: Check Hugging Face Spaces build logs or local Docker build output.
124
+ Scraping Failures: Verify the URL is accessible and not blocked by CAPTCHAs. Check /app/cache/app.log.
125
+ Model Loading Issues: Ensure transformers==4.44.2 and tokenizers==0.19.1 are installed correctly.
126
+ Resource Limits: Confirm at least 4GB RAM and 2 CPU cores in Hugging Face Spaces settings.
127
+ Logs: Run docker exec -it <container-id> cat /app/cache/app.log to diagnose issues.
128
+
129
+ License
130
+ MIT License
131
+