KavyaBansal commited on
Commit
c301224
·
verified ·
1 Parent(s): 3d6c7fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -0
README.md CHANGED
@@ -12,3 +12,107 @@ short_description: Crawls urls and the related pages
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
+
16
+ RAG Service: Grounded Q&A from Crawled Websites
17
+ Overview
18
+ This repository contains a small-scale Retrieval-Augmented Generation (RAG) service designed to crawl a given website, index its content, and answer user questions strictly based on the collected information. The system is built to be a robust, practical demonstration of key RAG principles, including:
19
+ • Polite Web Crawling: Respects robots.txt and domain boundaries.
20
+ • Efficient Indexing: Chunks text and builds a fast vector search index using FAISS.
21
+ • Grounded Generation: Answers are strictly supported by the retrieved content, with citations and clear refusals when information is insufficient.
22
+ • Observability: Tracks key performance metrics like retrieval and generation latency.
23
+ • Safety: Employs prompt hardening to prevent instruction hijacking and maintains content boundaries.
24
+ The service is deployed on a Hugging Face Space with an interactive Gradio UI, providing a simple, three-step pipeline: Crawl ➡️ Index ➡️ Ask.
25
+ ________________________________________
26
+ Setup and Installation
27
+ This project is designed to be run on a Hugging Face Space, but it can also be run locally.
28
+ 1. Clone the repository:
29
+ Bash
30
+ git clone https://github.com/your_username/your_repo.git
31
+ cd your_repo
32
+ 2. Install dependencies: The project relies on a few key libraries. Install them using the requirements.txt file provided.
33
+ Bash
34
+ pip install -r requirements.txt
35
+ The requirements.txt file is as follows:
36
+ gradio==4.44.0
37
+ beautifulsoup4==4.12.3
38
+ requests==2.31.0
39
+ trafilatura==1.12.0
40
+ sentence-transformers==2.7.0
41
+ faiss-cpu==1.8.0
42
+ transformers==4.44.0
43
+ accelerate==0.33.0
44
+ torch==2.4.0
45
+ numpy==1.26.4
46
+ Note: The models used (all-MiniLM-L6-v2 and flan-t5-base) will be automatically downloaded the first time you run the application.
47
+ 3. Run the application:
48
+ Bash
49
+ python app.py
50
+ This will launch the Gradio web interface, which you can access in your browser at the specified local URL.
51
+ ________________________________________
52
+ How to Use
53
+ The application is structured into a simple three-tab workflow.
54
+ 1. 🕷️ Crawl Website
55
+ • Input: Enter a starting URL (e.g., https://docs.python.org/3/tutorial/).
56
+ • Parameters: Adjust the Max Pages and Crawl Delay (in seconds) to control the scope and politeness of the crawl.
57
+ • Action: Click "🚀 Start Crawling". The system will begin crawling, saving all extracted text to ./data/crawled_pages.json. A summary of the crawl will be displayed upon completion.
58
+ 2. 🗂️ Build Index
59
+ • Input: No input required. This step uses the data from the crawl.
60
+ • Action: Click "🔨 Build Index". The system will load the crawled data, chunk the text, generate embeddings, and build a FAISS vector index. The index and metadata are saved to the ./index/ directory.
61
+ 3. 💬 Ask Questions
62
+ • Input: Enter a question related to the crawled content.
63
+ • Parameters: Adjust top-k to change the number of most relevant chunks retrieved for grounding the answer.
64
+ • Action: Click "🔍 Ask". The system will:
65
+ o Retrieve relevant text chunks.
66
+ o Form a grounded prompt.
67
+ o Generate an answer based on the prompt.
68
+ o Display the answer, source URLs, snippets, and performance timings.
69
+ • Refusals: The system will refuse to answer if the retrieved content is irrelevant or insufficient. For example, asking "What is the capital of France?" after crawling a Python documentation site will result in a refusal.
70
+ ________________________________________
71
+ Architecture and Design Decisions
72
+ Component Technology Used Rationale & Tradeoffs
73
+ Crawler requests, BeautifulSoup4, trafilatura requests for robust HTTP, trafilatura for high-quality content extraction (removes boilerplate), BeautifulSoup as a reliable fallback.
74
+ Embedding Model sentence-transformers/all-MiniLM-L6-v2 A highly efficient and performant model for its size. Provides a good balance between speed, resource usage, and semantic quality.
75
+ Chunking Strategy Size: 800 chars, Overlap: 100 chars This size balances the need for sufficient context for the LLM against the risk of retrieving irrelevant information. Overlap prevents splitting sentences.
76
+ Vector Index FAISS (IndexFlatIP) Extremely fast and efficient for dense vector search, especially for a small-scale, in-memory index. No external database or complex setup is required.
77
+ Generator (LLM) google/flan-t5-base A solid, open-source text-to-text model that is small enough to run on a consumer-grade GPU or even CPU, making it suitable for a Hugging Face Space.
78
+ Interface/API Gradio Provides a fast way to build an interactive UI while also exposing a REST API (/api/predict), fulfilling both CLI and API requirements.
79
+ Grounding & Safety Prompt engineering, similarity threshold (0.25) Prompts are hardened to ignore instructions in scraped text. A low relevance score automatically triggers a refusal, ensuring answers are truly grounded.
80
+ Observability In-memory logging, np.percentile Simple, built-in logging captures query timings and calculates p50/p95 metrics. This is sufficient for basic evaluation without complex external systems.
81
+ Limitations This system does not support JavaScript-heavy sites, binary file types (PDFs, images), or multi-domain crawling.
82
+ Export to Sheets
83
+ ________________________________________
84
+ Tooling & Prompts
85
+ • LLMs: google/flan-t5-base
86
+ • Embedding Models: sentence-transformers/all-MiniLM-L6-v2
87
+ • Libraries: requests, urllib, bs4, trafilatura, torch, transformers, sentence-transformers, faiss-cpu, gradio, numpy, accelerate.
88
+ • Prompt Template:
89
+ • You are a helpful assistant that answers questions STRICTLY based on the provided documents. Follow these rules:
90
+ • ...
91
+
92
+ • Documents:
93
+ • {context}
94
+
95
+ • Question: {query}
96
+
97
+ • Answer (based only on the documents above):
98
+ ________________________________________
99
+ Evaluation & Examples
100
+ Here are example requests and responses demonstrating the service's functionality.
101
+ Example 1: Answerable Query
102
+ Scenario: Crawled a website about a fictional company called "StellarTech."
103
+ • Question: What are the core products of StellarTech?
104
+ • Answer: The core products of StellarTech are StellarFlow, a data visualization tool, and StellarSync, a cloud storage solution.
105
+ • Sources:
106
+ o Source 1: https://stellartech.com/about (Snippet about StellarTech's mission and product suite)
107
+ o Source 2: https://stellartech.com/products/stellarflow (Snippet describing StellarFlow as a data tool)
108
+ o Source 3: https://stellartech.com/products/stellarsync (Snippet describing StellarSync as a cloud storage solution)
109
+ • Timings: retrieval_ms: 12.34ms, generation_ms: 1245.67ms, total_ms: 1258.01ms
110
+ Example 2: Unanswerable Query (Refusal)
111
+ • Question: What is the latest stock price for StellarTech?
112
+ • Answer: I couldn't find relevant information in the crawled content to answer this question. The closest match had a relevance score of 0.18, which is below the threshold.
113
+ • Sources:
114
+ o Source 1: https://stellartech.com/contact (Snippet about company's contact information)
115
+ • Timings: retrieval_ms: 9.87ms, generation_ms: 12.11ms, total_ms: 21.98ms
116
+ This example demonstrates the grounding check and a clear refusal when the retrieved information does not support an answer, which is a key requirement of the project.
117
+
118
+