Spaces:

KavyaBansal
/

Crawler

Build error

App Files Files Community

KavyaBansal commited on Oct 14, 2025

Commit

c301224

verified ·

1 Parent(s): 3d6c7fb

Update README.md

Browse files

Files changed (1) hide show

README.md +104 -0

README.md CHANGED Viewed

@@ -12,3 +12,107 @@ short_description: Crawls urls and the related pages
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+RAG Service: Grounded Q&A from Crawled Websites
+Overview
+This repository contains a small-scale Retrieval-Augmented Generation (RAG) service designed to crawl a given website, index its content, and answer user questions strictly based on the collected information. The system is built to be a robust, practical demonstration of key RAG principles, including:
+•	Polite Web Crawling: Respects robots.txt and domain boundaries.
+•	Efficient Indexing: Chunks text and builds a fast vector search index using FAISS.
+•	Grounded Generation: Answers are strictly supported by the retrieved content, with citations and clear refusals when information is insufficient.
+•	Observability: Tracks key performance metrics like retrieval and generation latency.
+•	Safety: Employs prompt hardening to prevent instruction hijacking and maintains content boundaries.
+The service is deployed on a Hugging Face Space with an interactive Gradio UI, providing a simple, three-step pipeline: Crawl ➡️ Index ➡️ Ask.
+________________________________________
+Setup and Installation
+This project is designed to be run on a Hugging Face Space, but it can also be run locally.
+1.	Clone the repository:
+Bash
+git clone https://github.com/your_username/your_repo.git
+cd your_repo
+2.	Install dependencies: The project relies on a few key libraries. Install them using the requirements.txt file provided.
+Bash
+pip install -r requirements.txt
+The requirements.txt file is as follows:
+gradio==4.44.0
+beautifulsoup4==4.12.3
+requests==2.31.0
+trafilatura==1.12.0
+sentence-transformers==2.7.0
+faiss-cpu==1.8.0
+transformers==4.44.0
+accelerate==0.33.0
+torch==2.4.0
+numpy==1.26.4
+Note: The models used (all-MiniLM-L6-v2 and flan-t5-base) will be automatically downloaded the first time you run the application.
+3.	Run the application:
+Bash
+python app.py
+This will launch the Gradio web interface, which you can access in your browser at the specified local URL.
+________________________________________
+How to Use
+The application is structured into a simple three-tab workflow.
+1. 🕷️ Crawl Website
+•	Input: Enter a starting URL (e.g., https://docs.python.org/3/tutorial/).
+•	Parameters: Adjust the Max Pages and Crawl Delay (in seconds) to control the scope and politeness of the crawl.
+•	Action: Click "🚀 Start Crawling". The system will begin crawling, saving all extracted text to ./data/crawled_pages.json. A summary of the crawl will be displayed upon completion.
+2. 🗂️ Build Index
+•	Input: No input required. This step uses the data from the crawl.
+•	Action: Click "🔨 Build Index". The system will load the crawled data, chunk the text, generate embeddings, and build a FAISS vector index. The index and metadata are saved to the ./index/ directory.
+3. 💬 Ask Questions
+•	Input: Enter a question related to the crawled content.
+•	Parameters: Adjust top-k to change the number of most relevant chunks retrieved for grounding the answer.
+•	Action: Click "🔍 Ask". The system will:
+o	Retrieve relevant text chunks.
+o	Form a grounded prompt.
+o	Generate an answer based on the prompt.
+o	Display the answer, source URLs, snippets, and performance timings.
+•	Refusals: The system will refuse to answer if the retrieved content is irrelevant or insufficient. For example, asking "What is the capital of France?" after crawling a Python documentation site will result in a refusal.
+________________________________________
+Architecture and Design Decisions
+Component	Technology Used	Rationale & Tradeoffs
+Crawler	requests, BeautifulSoup4, trafilatura	requests for robust HTTP, trafilatura for high-quality content extraction (removes boilerplate), BeautifulSoup as a reliable fallback.
+Embedding Model	sentence-transformers/all-MiniLM-L6-v2	A highly efficient and performant model for its size. Provides a good balance between speed, resource usage, and semantic quality.
+Chunking Strategy	Size: 800 chars, Overlap: 100 chars	This size balances the need for sufficient context for the LLM against the risk of retrieving irrelevant information. Overlap prevents splitting sentences.
+Vector Index	FAISS (IndexFlatIP)	Extremely fast and efficient for dense vector search, especially for a small-scale, in-memory index. No external database or complex setup is required.
+Generator (LLM)	google/flan-t5-base	A solid, open-source text-to-text model that is small enough to run on a consumer-grade GPU or even CPU, making it suitable for a Hugging Face Space.
+Interface/API	Gradio	Provides a fast way to build an interactive UI while also exposing a REST API (/api/predict), fulfilling both CLI and API requirements.
+Grounding & Safety	Prompt engineering, similarity threshold (0.25)	Prompts are hardened to ignore instructions in scraped text. A low relevance score automatically triggers a refusal, ensuring answers are truly grounded.
+Observability	In-memory logging, np.percentile	Simple, built-in logging captures query timings and calculates p50/p95 metrics. This is sufficient for basic evaluation without complex external systems.
+Limitations		This system does not support JavaScript-heavy sites, binary file types (PDFs, images), or multi-domain crawling.
+Export to Sheets
+________________________________________
+Tooling & Prompts
+•	LLMs: google/flan-t5-base
+•	Embedding Models: sentence-transformers/all-MiniLM-L6-v2
+•	Libraries: requests, urllib, bs4, trafilatura, torch, transformers, sentence-transformers, faiss-cpu, gradio, numpy, accelerate.
+•	Prompt Template:
+•	You are a helpful assistant that answers questions STRICTLY based on the provided documents. Follow these rules:
+•	...
+•
+•	Documents:
+•	{context}
+•
+•	Question: {query}
+•
+•	Answer (based only on the documents above):
+________________________________________
+Evaluation & Examples
+Here are example requests and responses demonstrating the service's functionality.
+Example 1: Answerable Query
+Scenario: Crawled a website about a fictional company called "StellarTech."
+•	Question: What are the core products of StellarTech?
+•	Answer: The core products of StellarTech are StellarFlow, a data visualization tool, and StellarSync, a cloud storage solution.
+•	Sources:
+o	Source 1: https://stellartech.com/about (Snippet about StellarTech's mission and product suite)
+o	Source 2: https://stellartech.com/products/stellarflow (Snippet describing StellarFlow as a data tool)
+o	Source 3: https://stellartech.com/products/stellarsync (Snippet describing StellarSync as a cloud storage solution)
+•	Timings: retrieval_ms: 12.34ms, generation_ms: 1245.67ms, total_ms: 1258.01ms
+Example 2: Unanswerable Query (Refusal)
+•	Question: What is the latest stock price for StellarTech?
+•	Answer: I couldn't find relevant information in the crawled content to answer this question. The closest match had a relevance score of 0.18, which is below the threshold.
+•	Sources:
+o	Source 1: https://stellartech.com/contact (Snippet about company's contact information)
+•	Timings: retrieval_ms: 9.87ms, generation_ms: 12.11ms, total_ms: 21.98ms
+This example demonstrates the grounding check and a clear refusal when the retrieved information does not support an answer, which is a key requirement of the project.