Spaces:
Build error
Build error
Update README.md
Browse files
README.md
CHANGED
|
@@ -12,3 +12,107 @@ short_description: Crawls urls and the related pages
|
|
| 12 |
---
|
| 13 |
|
| 14 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
| 15 |
+
|
| 16 |
+
RAG Service: Grounded Q&A from Crawled Websites
|
| 17 |
+
Overview
|
| 18 |
+
This repository contains a small-scale Retrieval-Augmented Generation (RAG) service designed to crawl a given website, index its content, and answer user questions strictly based on the collected information. The system is built to be a robust, practical demonstration of key RAG principles, including:
|
| 19 |
+
• Polite Web Crawling: Respects robots.txt and domain boundaries.
|
| 20 |
+
• Efficient Indexing: Chunks text and builds a fast vector search index using FAISS.
|
| 21 |
+
• Grounded Generation: Answers are strictly supported by the retrieved content, with citations and clear refusals when information is insufficient.
|
| 22 |
+
• Observability: Tracks key performance metrics like retrieval and generation latency.
|
| 23 |
+
• Safety: Employs prompt hardening to prevent instruction hijacking and maintains content boundaries.
|
| 24 |
+
The service is deployed on a Hugging Face Space with an interactive Gradio UI, providing a simple, three-step pipeline: Crawl ➡️ Index ➡️ Ask.
|
| 25 |
+
________________________________________
|
| 26 |
+
Setup and Installation
|
| 27 |
+
This project is designed to be run on a Hugging Face Space, but it can also be run locally.
|
| 28 |
+
1. Clone the repository:
|
| 29 |
+
Bash
|
| 30 |
+
git clone https://github.com/your_username/your_repo.git
|
| 31 |
+
cd your_repo
|
| 32 |
+
2. Install dependencies: The project relies on a few key libraries. Install them using the requirements.txt file provided.
|
| 33 |
+
Bash
|
| 34 |
+
pip install -r requirements.txt
|
| 35 |
+
The requirements.txt file is as follows:
|
| 36 |
+
gradio==4.44.0
|
| 37 |
+
beautifulsoup4==4.12.3
|
| 38 |
+
requests==2.31.0
|
| 39 |
+
trafilatura==1.12.0
|
| 40 |
+
sentence-transformers==2.7.0
|
| 41 |
+
faiss-cpu==1.8.0
|
| 42 |
+
transformers==4.44.0
|
| 43 |
+
accelerate==0.33.0
|
| 44 |
+
torch==2.4.0
|
| 45 |
+
numpy==1.26.4
|
| 46 |
+
Note: The models used (all-MiniLM-L6-v2 and flan-t5-base) will be automatically downloaded the first time you run the application.
|
| 47 |
+
3. Run the application:
|
| 48 |
+
Bash
|
| 49 |
+
python app.py
|
| 50 |
+
This will launch the Gradio web interface, which you can access in your browser at the specified local URL.
|
| 51 |
+
________________________________________
|
| 52 |
+
How to Use
|
| 53 |
+
The application is structured into a simple three-tab workflow.
|
| 54 |
+
1. 🕷️ Crawl Website
|
| 55 |
+
• Input: Enter a starting URL (e.g., https://docs.python.org/3/tutorial/).
|
| 56 |
+
• Parameters: Adjust the Max Pages and Crawl Delay (in seconds) to control the scope and politeness of the crawl.
|
| 57 |
+
• Action: Click "🚀 Start Crawling". The system will begin crawling, saving all extracted text to ./data/crawled_pages.json. A summary of the crawl will be displayed upon completion.
|
| 58 |
+
2. 🗂️ Build Index
|
| 59 |
+
• Input: No input required. This step uses the data from the crawl.
|
| 60 |
+
• Action: Click "🔨 Build Index". The system will load the crawled data, chunk the text, generate embeddings, and build a FAISS vector index. The index and metadata are saved to the ./index/ directory.
|
| 61 |
+
3. 💬 Ask Questions
|
| 62 |
+
• Input: Enter a question related to the crawled content.
|
| 63 |
+
• Parameters: Adjust top-k to change the number of most relevant chunks retrieved for grounding the answer.
|
| 64 |
+
• Action: Click "🔍 Ask". The system will:
|
| 65 |
+
o Retrieve relevant text chunks.
|
| 66 |
+
o Form a grounded prompt.
|
| 67 |
+
o Generate an answer based on the prompt.
|
| 68 |
+
o Display the answer, source URLs, snippets, and performance timings.
|
| 69 |
+
• Refusals: The system will refuse to answer if the retrieved content is irrelevant or insufficient. For example, asking "What is the capital of France?" after crawling a Python documentation site will result in a refusal.
|
| 70 |
+
________________________________________
|
| 71 |
+
Architecture and Design Decisions
|
| 72 |
+
Component Technology Used Rationale & Tradeoffs
|
| 73 |
+
Crawler requests, BeautifulSoup4, trafilatura requests for robust HTTP, trafilatura for high-quality content extraction (removes boilerplate), BeautifulSoup as a reliable fallback.
|
| 74 |
+
Embedding Model sentence-transformers/all-MiniLM-L6-v2 A highly efficient and performant model for its size. Provides a good balance between speed, resource usage, and semantic quality.
|
| 75 |
+
Chunking Strategy Size: 800 chars, Overlap: 100 chars This size balances the need for sufficient context for the LLM against the risk of retrieving irrelevant information. Overlap prevents splitting sentences.
|
| 76 |
+
Vector Index FAISS (IndexFlatIP) Extremely fast and efficient for dense vector search, especially for a small-scale, in-memory index. No external database or complex setup is required.
|
| 77 |
+
Generator (LLM) google/flan-t5-base A solid, open-source text-to-text model that is small enough to run on a consumer-grade GPU or even CPU, making it suitable for a Hugging Face Space.
|
| 78 |
+
Interface/API Gradio Provides a fast way to build an interactive UI while also exposing a REST API (/api/predict), fulfilling both CLI and API requirements.
|
| 79 |
+
Grounding & Safety Prompt engineering, similarity threshold (0.25) Prompts are hardened to ignore instructions in scraped text. A low relevance score automatically triggers a refusal, ensuring answers are truly grounded.
|
| 80 |
+
Observability In-memory logging, np.percentile Simple, built-in logging captures query timings and calculates p50/p95 metrics. This is sufficient for basic evaluation without complex external systems.
|
| 81 |
+
Limitations This system does not support JavaScript-heavy sites, binary file types (PDFs, images), or multi-domain crawling.
|
| 82 |
+
Export to Sheets
|
| 83 |
+
________________________________________
|
| 84 |
+
Tooling & Prompts
|
| 85 |
+
• LLMs: google/flan-t5-base
|
| 86 |
+
• Embedding Models: sentence-transformers/all-MiniLM-L6-v2
|
| 87 |
+
• Libraries: requests, urllib, bs4, trafilatura, torch, transformers, sentence-transformers, faiss-cpu, gradio, numpy, accelerate.
|
| 88 |
+
• Prompt Template:
|
| 89 |
+
• You are a helpful assistant that answers questions STRICTLY based on the provided documents. Follow these rules:
|
| 90 |
+
• ...
|
| 91 |
+
•
|
| 92 |
+
• Documents:
|
| 93 |
+
• {context}
|
| 94 |
+
•
|
| 95 |
+
• Question: {query}
|
| 96 |
+
•
|
| 97 |
+
• Answer (based only on the documents above):
|
| 98 |
+
________________________________________
|
| 99 |
+
Evaluation & Examples
|
| 100 |
+
Here are example requests and responses demonstrating the service's functionality.
|
| 101 |
+
Example 1: Answerable Query
|
| 102 |
+
Scenario: Crawled a website about a fictional company called "StellarTech."
|
| 103 |
+
• Question: What are the core products of StellarTech?
|
| 104 |
+
• Answer: The core products of StellarTech are StellarFlow, a data visualization tool, and StellarSync, a cloud storage solution.
|
| 105 |
+
• Sources:
|
| 106 |
+
o Source 1: https://stellartech.com/about (Snippet about StellarTech's mission and product suite)
|
| 107 |
+
o Source 2: https://stellartech.com/products/stellarflow (Snippet describing StellarFlow as a data tool)
|
| 108 |
+
o Source 3: https://stellartech.com/products/stellarsync (Snippet describing StellarSync as a cloud storage solution)
|
| 109 |
+
• Timings: retrieval_ms: 12.34ms, generation_ms: 1245.67ms, total_ms: 1258.01ms
|
| 110 |
+
Example 2: Unanswerable Query (Refusal)
|
| 111 |
+
• Question: What is the latest stock price for StellarTech?
|
| 112 |
+
• Answer: I couldn't find relevant information in the crawled content to answer this question. The closest match had a relevance score of 0.18, which is below the threshold.
|
| 113 |
+
• Sources:
|
| 114 |
+
o Source 1: https://stellartech.com/contact (Snippet about company's contact information)
|
| 115 |
+
• Timings: retrieval_ms: 9.87ms, generation_ms: 12.11ms, total_ms: 21.98ms
|
| 116 |
+
This example demonstrates the grounding check and a clear refusal when the retrieved information does not support an answer, which is a key requirement of the project.
|
| 117 |
+
|
| 118 |
+
|