Spaces:

KavyaBansal
/

Crawler

Running

File size: 7,507 Bytes

---
title: Crawler
emoji: 🌖
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
short_description: Crawls urls and the related pages
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

RAG Service: Grounded Q&A from Crawled Websites
Overview
This repository contains a small-scale Retrieval-Augmented Generation (RAG) service designed to crawl a given website, index its content, and answer user questions strictly based on the collected information. The system is built to be a robust, practical demonstration of key RAG principles, including:
•	Polite Web Crawling: Respects robots.txt and domain boundaries.
•	Efficient Indexing: Chunks text and builds a fast vector search index using FAISS.
•	Grounded Generation: Answers are strictly supported by the retrieved content, with citations and clear refusals when information is insufficient.
•	Observability: Tracks key performance metrics like retrieval and generation latency.
•	Safety: Employs prompt hardening to prevent instruction hijacking and maintains content boundaries.
The service is deployed on a Hugging Face Space with an interactive Gradio UI, providing a simple, three-step pipeline: Crawl ➡️ Index ➡️ Ask.
________________________________________
Setup and Installation
This project is designed to be run on a Hugging Face Space, but it can also be run locally.
1.	Clone the repository:
Bash
git clone https://github.com/your_username/your_repo.git
cd your_repo
2.	Install dependencies: The project relies on a few key libraries. Install them using the requirements.txt file provided.
Bash
pip install -r requirements.txt
The requirements.txt file is as follows:
gradio==4.44.0
beautifulsoup4==4.12.3
requests==2.31.0
trafilatura==1.12.0
sentence-transformers==2.7.0
faiss-cpu==1.8.0
transformers==4.44.0
accelerate==0.33.0
torch==2.4.0
numpy==1.26.4
Note: The models used (all-MiniLM-L6-v2 and flan-t5-base) will be automatically downloaded the first time you run the application.
4.	Run the application:
Bash
python app.py
This will launch the Gradio web interface, which you can access in your browser at the specified local URL.
________________________________________
How to Use
The application is structured into a simple three-tab workflow.
1. 🕷️ Crawl Website
•	Input: Enter a starting URL (e.g., https://docs.python.org/3/tutorial/).
•	Parameters: Adjust the Max Pages and Crawl Delay (in seconds) to control the scope and politeness of the crawl.
•	Action: Click "🚀 Start Crawling". The system will begin crawling, saving all extracted text to ./data/crawled_pages.json. A summary of the crawl will be displayed upon completion.
2. 🗂️ Build Index
•	Input: No input required. This step uses the data from the crawl.
•	Action: Click "🔨 Build Index". The system will load the crawled data, chunk the text, generate embeddings, and build a FAISS vector index. The index and metadata are saved to the ./index/ directory.
3. 💬 Ask Questions
•	Input: Enter a question related to the crawled content.
•	Parameters: Adjust top-k to change the number of most relevant chunks retrieved for grounding the answer.
•	Action: Click "🔍 Ask". The system will:
o	Retrieve relevant text chunks.
o	Form a grounded prompt.
o	Generate an answer based on the prompt.
o	Display the answer, source URLs, snippets, and performance timings.
•	Refusals: The system will refuse to answer if the retrieved content is irrelevant or insufficient. For example, asking "What is the capital of France?" after crawling a Python documentation site will result in a refusal.
________________________________________
Architecture and Design Decisions
Component	Technology Used	Rationale & Tradeoffs
Crawler	requests, BeautifulSoup4, trafilatura	requests for robust HTTP, trafilatura for high-quality content extraction (removes boilerplate), BeautifulSoup as a reliable fallback.
Embedding Model	sentence-transformers/all-MiniLM-L6-v2	A highly efficient and performant model for its size. Provides a good balance between speed, resource usage, and semantic quality.
Chunking Strategy	Size: 800 chars, Overlap: 100 chars	This size balances the need for sufficient context for the LLM against the risk of retrieving irrelevant information. Overlap prevents splitting sentences.
Vector Index	FAISS (IndexFlatIP)	Extremely fast and efficient for dense vector search, especially for a small-scale, in-memory index. No external database or complex setup is required.
Generator (LLM)	google/flan-t5-base	A solid, open-source text-to-text model that is small enough to run on a consumer-grade GPU or even CPU, making it suitable for a Hugging Face Space.
Interface/API	Gradio	Provides a fast way to build an interactive UI while also exposing a REST API (/api/predict), fulfilling both CLI and API requirements.
Grounding & Safety	Prompt engineering, similarity threshold (0.25)	Prompts are hardened to ignore instructions in scraped text. A low relevance score automatically triggers a refusal, ensuring answers are truly grounded.
Observability	In-memory logging, np.percentile	Simple, built-in logging captures query timings and calculates p50/p95 metrics. This is sufficient for basic evaluation without complex external systems.
Limitations		This system does not support JavaScript-heavy sites, binary file types (PDFs, images), or multi-domain crawling.
Export to Sheets
________________________________________
Tooling & Prompts
•	LLMs: google/flan-t5-base
•	Embedding Models: sentence-transformers/all-MiniLM-L6-v2
•	Libraries: requests, urllib, bs4, trafilatura, torch, transformers, sentence-transformers, faiss-cpu, gradio, numpy, accelerate.
•	Prompt Template:
•	You are a helpful assistant that answers questions STRICTLY based on the provided documents. Follow these rules:
•	...
•	
•	Documents:
•	{context}
•	
•	Question: {query}
•	
•	Answer (based only on the documents above):
________________________________________
Evaluation & Examples
Here are example requests and responses demonstrating the service's functionality.
Example 1: Answerable Query
Scenario: Crawled a website about a fictional company called "StellarTech."
•	Question: What are the core products of StellarTech?
•	Answer: The core products of StellarTech are StellarFlow, a data visualization tool, and StellarSync, a cloud storage solution.
•	Sources:
o	Source 1: https://stellartech.com/about (Snippet about StellarTech's mission and product suite)
o	Source 2: https://stellartech.com/products/stellarflow (Snippet describing StellarFlow as a data tool)
o	Source 3: https://stellartech.com/products/stellarsync (Snippet describing StellarSync as a cloud storage solution)
•	Timings: retrieval_ms: 12.34ms, generation_ms: 1245.67ms, total_ms: 1258.01ms
Example 2: Unanswerable Query (Refusal)
•	Question: What is the latest stock price for StellarTech?
•	Answer: I couldn't find relevant information in the crawled content to answer this question. The closest match had a relevance score of 0.18, which is below the threshold.
•	Sources:
o	Source 1: https://stellartech.com/contact (Snippet about company's contact information)
•	Timings: retrieval_ms: 9.87ms, generation_ms: 12.11ms, total_ms: 21.98ms
This example demonstrates the grounding check and a clear refusal when the retrieved information does not support an answer, which is a key requirement of the project.