Spaces:

KavyaBansal
/

Crawler

Sleeping

App Files Files Community

Crawler / README.md

KavyaBansal

Update README.md

e45d6e9 verified 3 months ago

preview code

raw

history blame contribute delete

7.51 kB

	---
	title: Crawler
	emoji: 🌖
	colorFrom: purple
	colorTo: indigo
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	license: mit
	short_description: Crawls urls and the related pages
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

	RAG Service: Grounded Q&A from Crawled Websites
	Overview
	This repository contains a small-scale Retrieval-Augmented Generation (RAG) service designed to crawl a given website, index its content, and answer user questions strictly based on the collected information. The system is built to be a robust, practical demonstration of key RAG principles, including:
	• Polite Web Crawling: Respects robots.txt and domain boundaries.
	• Efficient Indexing: Chunks text and builds a fast vector search index using FAISS.
	• Grounded Generation: Answers are strictly supported by the retrieved content, with citations and clear refusals when information is insufficient.
	• Observability: Tracks key performance metrics like retrieval and generation latency.
	• Safety: Employs prompt hardening to prevent instruction hijacking and maintains content boundaries.
	The service is deployed on a Hugging Face Space with an interactive Gradio UI, providing a simple, three-step pipeline: Crawl ➡️ Index ➡️ Ask.
	________________________________________
	Setup and Installation
	This project is designed to be run on a Hugging Face Space, but it can also be run locally.
	1. Clone the repository:
	Bash
	git clone https://github.com/your_username/your_repo.git
	cd your_repo
	2. Install dependencies: The project relies on a few key libraries. Install them using the requirements.txt file provided.
	Bash
	pip install -r requirements.txt
	The requirements.txt file is as follows:
	gradio==4.44.0
	beautifulsoup4==4.12.3
	requests==2.31.0
	trafilatura==1.12.0
	sentence-transformers==2.7.0
	faiss-cpu==1.8.0
	transformers==4.44.0
	accelerate==0.33.0
	torch==2.4.0
	numpy==1.26.4
	Note: The models used (all-MiniLM-L6-v2 and flan-t5-base) will be automatically downloaded the first time you run the application.
	4. Run the application:
	Bash
	python app.py
	This will launch the Gradio web interface, which you can access in your browser at the specified local URL.
	________________________________________
	How to Use
	The application is structured into a simple three-tab workflow.
	1. 🕷️ Crawl Website
	• Input: Enter a starting URL (e.g., https://docs.python.org/3/tutorial/).
	• Parameters: Adjust the Max Pages and Crawl Delay (in seconds) to control the scope and politeness of the crawl.
	• Action: Click "🚀 Start Crawling". The system will begin crawling, saving all extracted text to ./data/crawled_pages.json. A summary of the crawl will be displayed upon completion.
	2. 🗂️ Build Index
	• Input: No input required. This step uses the data from the crawl.
	• Action: Click "🔨 Build Index". The system will load the crawled data, chunk the text, generate embeddings, and build a FAISS vector index. The index and metadata are saved to the ./index/ directory.
	3. 💬 Ask Questions
	• Input: Enter a question related to the crawled content.
	• Parameters: Adjust top-k to change the number of most relevant chunks retrieved for grounding the answer.
	• Action: Click "🔍 Ask". The system will:
	o Retrieve relevant text chunks.
	o Form a grounded prompt.
	o Generate an answer based on the prompt.
	o Display the answer, source URLs, snippets, and performance timings.
	• Refusals: The system will refuse to answer if the retrieved content is irrelevant or insufficient. For example, asking "What is the capital of France?" after crawling a Python documentation site will result in a refusal.
	________________________________________
	Architecture and Design Decisions
	Component Technology Used Rationale & Tradeoffs
	Crawler requests, BeautifulSoup4, trafilatura requests for robust HTTP, trafilatura for high-quality content extraction (removes boilerplate), BeautifulSoup as a reliable fallback.
	Embedding Model sentence-transformers/all-MiniLM-L6-v2 A highly efficient and performant model for its size. Provides a good balance between speed, resource usage, and semantic quality.
	Chunking Strategy Size: 800 chars, Overlap: 100 chars This size balances the need for sufficient context for the LLM against the risk of retrieving irrelevant information. Overlap prevents splitting sentences.
	Vector Index FAISS (IndexFlatIP) Extremely fast and efficient for dense vector search, especially for a small-scale, in-memory index. No external database or complex setup is required.
	Generator (LLM) google/flan-t5-base A solid, open-source text-to-text model that is small enough to run on a consumer-grade GPU or even CPU, making it suitable for a Hugging Face Space.
	Interface/API Gradio Provides a fast way to build an interactive UI while also exposing a REST API (/api/predict), fulfilling both CLI and API requirements.
	Grounding & Safety Prompt engineering, similarity threshold (0.25) Prompts are hardened to ignore instructions in scraped text. A low relevance score automatically triggers a refusal, ensuring answers are truly grounded.
	Observability In-memory logging, np.percentile Simple, built-in logging captures query timings and calculates p50/p95 metrics. This is sufficient for basic evaluation without complex external systems.
	Limitations This system does not support JavaScript-heavy sites, binary file types (PDFs, images), or multi-domain crawling.
	Export to Sheets
	________________________________________
	Tooling & Prompts
	• LLMs: google/flan-t5-base
	• Embedding Models: sentence-transformers/all-MiniLM-L6-v2
	• Libraries: requests, urllib, bs4, trafilatura, torch, transformers, sentence-transformers, faiss-cpu, gradio, numpy, accelerate.
	• Prompt Template:
	• You are a helpful assistant that answers questions STRICTLY based on the provided documents. Follow these rules:
	• ...
	•
	• Documents:
	• {context}
	•
	• Question: {query}
	•
	• Answer (based only on the documents above):
	________________________________________
	Evaluation & Examples
	Here are example requests and responses demonstrating the service's functionality.
	Example 1: Answerable Query
	Scenario: Crawled a website about a fictional company called "StellarTech."
	• Question: What are the core products of StellarTech?
	• Answer: The core products of StellarTech are StellarFlow, a data visualization tool, and StellarSync, a cloud storage solution.
	• Sources:
	o Source 1: https://stellartech.com/about (Snippet about StellarTech's mission and product suite)
	o Source 2: https://stellartech.com/products/stellarflow (Snippet describing StellarFlow as a data tool)
	o Source 3: https://stellartech.com/products/stellarsync (Snippet describing StellarSync as a cloud storage solution)
	• Timings: retrieval_ms: 12.34ms, generation_ms: 1245.67ms, total_ms: 1258.01ms
	Example 2: Unanswerable Query (Refusal)
	• Question: What is the latest stock price for StellarTech?
	• Answer: I couldn't find relevant information in the crawled content to answer this question. The closest match had a relevance score of 0.18, which is below the threshold.
	• Sources:
	o Source 1: https://stellartech.com/contact (Snippet about company's contact information)
	• Timings: retrieval_ms: 9.87ms, generation_ms: 12.11ms, total_ms: 21.98ms
	This example demonstrates the grounding check and a clear refusal when the retrieved information does not support an answer, which is a key requirement of the project.