Spaces:
Running
Running
File size: 7,507 Bytes
ac0242e c301224 e45d6e9 c301224 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
title: Crawler
emoji: π
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
short_description: Crawls urls and the related pages
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
RAG Service: Grounded Q&A from Crawled Websites
Overview
This repository contains a small-scale Retrieval-Augmented Generation (RAG) service designed to crawl a given website, index its content, and answer user questions strictly based on the collected information. The system is built to be a robust, practical demonstration of key RAG principles, including:
β’ Polite Web Crawling: Respects robots.txt and domain boundaries.
β’ Efficient Indexing: Chunks text and builds a fast vector search index using FAISS.
β’ Grounded Generation: Answers are strictly supported by the retrieved content, with citations and clear refusals when information is insufficient.
β’ Observability: Tracks key performance metrics like retrieval and generation latency.
β’ Safety: Employs prompt hardening to prevent instruction hijacking and maintains content boundaries.
The service is deployed on a Hugging Face Space with an interactive Gradio UI, providing a simple, three-step pipeline: Crawl β‘οΈ Index β‘οΈ Ask.
________________________________________
Setup and Installation
This project is designed to be run on a Hugging Face Space, but it can also be run locally.
1. Clone the repository:
Bash
git clone https://github.com/your_username/your_repo.git
cd your_repo
2. Install dependencies: The project relies on a few key libraries. Install them using the requirements.txt file provided.
Bash
pip install -r requirements.txt
The requirements.txt file is as follows:
gradio==4.44.0
beautifulsoup4==4.12.3
requests==2.31.0
trafilatura==1.12.0
sentence-transformers==2.7.0
faiss-cpu==1.8.0
transformers==4.44.0
accelerate==0.33.0
torch==2.4.0
numpy==1.26.4
Note: The models used (all-MiniLM-L6-v2 and flan-t5-base) will be automatically downloaded the first time you run the application.
4. Run the application:
Bash
python app.py
This will launch the Gradio web interface, which you can access in your browser at the specified local URL.
________________________________________
How to Use
The application is structured into a simple three-tab workflow.
1. π·οΈ Crawl Website
β’ Input: Enter a starting URL (e.g., https://docs.python.org/3/tutorial/).
β’ Parameters: Adjust the Max Pages and Crawl Delay (in seconds) to control the scope and politeness of the crawl.
β’ Action: Click "π Start Crawling". The system will begin crawling, saving all extracted text to ./data/crawled_pages.json. A summary of the crawl will be displayed upon completion.
2. ποΈ Build Index
β’ Input: No input required. This step uses the data from the crawl.
β’ Action: Click "π¨ Build Index". The system will load the crawled data, chunk the text, generate embeddings, and build a FAISS vector index. The index and metadata are saved to the ./index/ directory.
3. π¬ Ask Questions
β’ Input: Enter a question related to the crawled content.
β’ Parameters: Adjust top-k to change the number of most relevant chunks retrieved for grounding the answer.
β’ Action: Click "π Ask". The system will:
o Retrieve relevant text chunks.
o Form a grounded prompt.
o Generate an answer based on the prompt.
o Display the answer, source URLs, snippets, and performance timings.
β’ Refusals: The system will refuse to answer if the retrieved content is irrelevant or insufficient. For example, asking "What is the capital of France?" after crawling a Python documentation site will result in a refusal.
________________________________________
Architecture and Design Decisions
Component Technology Used Rationale & Tradeoffs
Crawler requests, BeautifulSoup4, trafilatura requests for robust HTTP, trafilatura for high-quality content extraction (removes boilerplate), BeautifulSoup as a reliable fallback.
Embedding Model sentence-transformers/all-MiniLM-L6-v2 A highly efficient and performant model for its size. Provides a good balance between speed, resource usage, and semantic quality.
Chunking Strategy Size: 800 chars, Overlap: 100 chars This size balances the need for sufficient context for the LLM against the risk of retrieving irrelevant information. Overlap prevents splitting sentences.
Vector Index FAISS (IndexFlatIP) Extremely fast and efficient for dense vector search, especially for a small-scale, in-memory index. No external database or complex setup is required.
Generator (LLM) google/flan-t5-base A solid, open-source text-to-text model that is small enough to run on a consumer-grade GPU or even CPU, making it suitable for a Hugging Face Space.
Interface/API Gradio Provides a fast way to build an interactive UI while also exposing a REST API (/api/predict), fulfilling both CLI and API requirements.
Grounding & Safety Prompt engineering, similarity threshold (0.25) Prompts are hardened to ignore instructions in scraped text. A low relevance score automatically triggers a refusal, ensuring answers are truly grounded.
Observability In-memory logging, np.percentile Simple, built-in logging captures query timings and calculates p50/p95 metrics. This is sufficient for basic evaluation without complex external systems.
Limitations This system does not support JavaScript-heavy sites, binary file types (PDFs, images), or multi-domain crawling.
Export to Sheets
________________________________________
Tooling & Prompts
β’ LLMs: google/flan-t5-base
β’ Embedding Models: sentence-transformers/all-MiniLM-L6-v2
β’ Libraries: requests, urllib, bs4, trafilatura, torch, transformers, sentence-transformers, faiss-cpu, gradio, numpy, accelerate.
β’ Prompt Template:
β’ You are a helpful assistant that answers questions STRICTLY based on the provided documents. Follow these rules:
β’ ...
β’
β’ Documents:
β’ {context}
β’
β’ Question: {query}
β’
β’ Answer (based only on the documents above):
________________________________________
Evaluation & Examples
Here are example requests and responses demonstrating the service's functionality.
Example 1: Answerable Query
Scenario: Crawled a website about a fictional company called "StellarTech."
β’ Question: What are the core products of StellarTech?
β’ Answer: The core products of StellarTech are StellarFlow, a data visualization tool, and StellarSync, a cloud storage solution.
β’ Sources:
o Source 1: https://stellartech.com/about (Snippet about StellarTech's mission and product suite)
o Source 2: https://stellartech.com/products/stellarflow (Snippet describing StellarFlow as a data tool)
o Source 3: https://stellartech.com/products/stellarsync (Snippet describing StellarSync as a cloud storage solution)
β’ Timings: retrieval_ms: 12.34ms, generation_ms: 1245.67ms, total_ms: 1258.01ms
Example 2: Unanswerable Query (Refusal)
β’ Question: What is the latest stock price for StellarTech?
β’ Answer: I couldn't find relevant information in the crawled content to answer this question. The closest match had a relevance score of 0.18, which is below the threshold.
β’ Sources:
o Source 1: https://stellartech.com/contact (Snippet about company's contact information)
β’ Timings: retrieval_ms: 9.87ms, generation_ms: 12.11ms, total_ms: 21.98ms
This example demonstrates the grounding check and a clear refusal when the retrieved information does not support an answer, which is a key requirement of the project.
|