Spaces:
Sleeping
Sleeping
| title: Crawler | |
| emoji: π | |
| colorFrom: purple | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Crawls urls and the related pages | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
| RAG Service: Grounded Q&A from Crawled Websites | |
| Overview | |
| This repository contains a small-scale Retrieval-Augmented Generation (RAG) service designed to crawl a given website, index its content, and answer user questions strictly based on the collected information. The system is built to be a robust, practical demonstration of key RAG principles, including: | |
| β’ Polite Web Crawling: Respects robots.txt and domain boundaries. | |
| β’ Efficient Indexing: Chunks text and builds a fast vector search index using FAISS. | |
| β’ Grounded Generation: Answers are strictly supported by the retrieved content, with citations and clear refusals when information is insufficient. | |
| β’ Observability: Tracks key performance metrics like retrieval and generation latency. | |
| β’ Safety: Employs prompt hardening to prevent instruction hijacking and maintains content boundaries. | |
| The service is deployed on a Hugging Face Space with an interactive Gradio UI, providing a simple, three-step pipeline: Crawl β‘οΈ Index β‘οΈ Ask. | |
| ________________________________________ | |
| Setup and Installation | |
| This project is designed to be run on a Hugging Face Space, but it can also be run locally. | |
| 1. Clone the repository: | |
| Bash | |
| git clone https://github.com/your_username/your_repo.git | |
| cd your_repo | |
| 2. Install dependencies: The project relies on a few key libraries. Install them using the requirements.txt file provided. | |
| Bash | |
| pip install -r requirements.txt | |
| The requirements.txt file is as follows: | |
| gradio==4.44.0 | |
| beautifulsoup4==4.12.3 | |
| requests==2.31.0 | |
| trafilatura==1.12.0 | |
| sentence-transformers==2.7.0 | |
| faiss-cpu==1.8.0 | |
| transformers==4.44.0 | |
| accelerate==0.33.0 | |
| torch==2.4.0 | |
| numpy==1.26.4 | |
| Note: The models used (all-MiniLM-L6-v2 and flan-t5-base) will be automatically downloaded the first time you run the application. | |
| 4. Run the application: | |
| Bash | |
| python app.py | |
| This will launch the Gradio web interface, which you can access in your browser at the specified local URL. | |
| ________________________________________ | |
| How to Use | |
| The application is structured into a simple three-tab workflow. | |
| 1. π·οΈ Crawl Website | |
| β’ Input: Enter a starting URL (e.g., https://docs.python.org/3/tutorial/). | |
| β’ Parameters: Adjust the Max Pages and Crawl Delay (in seconds) to control the scope and politeness of the crawl. | |
| β’ Action: Click "π Start Crawling". The system will begin crawling, saving all extracted text to ./data/crawled_pages.json. A summary of the crawl will be displayed upon completion. | |
| 2. ποΈ Build Index | |
| β’ Input: No input required. This step uses the data from the crawl. | |
| β’ Action: Click "π¨ Build Index". The system will load the crawled data, chunk the text, generate embeddings, and build a FAISS vector index. The index and metadata are saved to the ./index/ directory. | |
| 3. π¬ Ask Questions | |
| β’ Input: Enter a question related to the crawled content. | |
| β’ Parameters: Adjust top-k to change the number of most relevant chunks retrieved for grounding the answer. | |
| β’ Action: Click "π Ask". The system will: | |
| o Retrieve relevant text chunks. | |
| o Form a grounded prompt. | |
| o Generate an answer based on the prompt. | |
| o Display the answer, source URLs, snippets, and performance timings. | |
| β’ Refusals: The system will refuse to answer if the retrieved content is irrelevant or insufficient. For example, asking "What is the capital of France?" after crawling a Python documentation site will result in a refusal. | |
| ________________________________________ | |
| Architecture and Design Decisions | |
| Component Technology Used Rationale & Tradeoffs | |
| Crawler requests, BeautifulSoup4, trafilatura requests for robust HTTP, trafilatura for high-quality content extraction (removes boilerplate), BeautifulSoup as a reliable fallback. | |
| Embedding Model sentence-transformers/all-MiniLM-L6-v2 A highly efficient and performant model for its size. Provides a good balance between speed, resource usage, and semantic quality. | |
| Chunking Strategy Size: 800 chars, Overlap: 100 chars This size balances the need for sufficient context for the LLM against the risk of retrieving irrelevant information. Overlap prevents splitting sentences. | |
| Vector Index FAISS (IndexFlatIP) Extremely fast and efficient for dense vector search, especially for a small-scale, in-memory index. No external database or complex setup is required. | |
| Generator (LLM) google/flan-t5-base A solid, open-source text-to-text model that is small enough to run on a consumer-grade GPU or even CPU, making it suitable for a Hugging Face Space. | |
| Interface/API Gradio Provides a fast way to build an interactive UI while also exposing a REST API (/api/predict), fulfilling both CLI and API requirements. | |
| Grounding & Safety Prompt engineering, similarity threshold (0.25) Prompts are hardened to ignore instructions in scraped text. A low relevance score automatically triggers a refusal, ensuring answers are truly grounded. | |
| Observability In-memory logging, np.percentile Simple, built-in logging captures query timings and calculates p50/p95 metrics. This is sufficient for basic evaluation without complex external systems. | |
| Limitations This system does not support JavaScript-heavy sites, binary file types (PDFs, images), or multi-domain crawling. | |
| Export to Sheets | |
| ________________________________________ | |
| Tooling & Prompts | |
| β’ LLMs: google/flan-t5-base | |
| β’ Embedding Models: sentence-transformers/all-MiniLM-L6-v2 | |
| β’ Libraries: requests, urllib, bs4, trafilatura, torch, transformers, sentence-transformers, faiss-cpu, gradio, numpy, accelerate. | |
| β’ Prompt Template: | |
| β’ You are a helpful assistant that answers questions STRICTLY based on the provided documents. Follow these rules: | |
| β’ ... | |
| β’ | |
| β’ Documents: | |
| β’ {context} | |
| β’ | |
| β’ Question: {query} | |
| β’ | |
| β’ Answer (based only on the documents above): | |
| ________________________________________ | |
| Evaluation & Examples | |
| Here are example requests and responses demonstrating the service's functionality. | |
| Example 1: Answerable Query | |
| Scenario: Crawled a website about a fictional company called "StellarTech." | |
| β’ Question: What are the core products of StellarTech? | |
| β’ Answer: The core products of StellarTech are StellarFlow, a data visualization tool, and StellarSync, a cloud storage solution. | |
| β’ Sources: | |
| o Source 1: https://stellartech.com/about (Snippet about StellarTech's mission and product suite) | |
| o Source 2: https://stellartech.com/products/stellarflow (Snippet describing StellarFlow as a data tool) | |
| o Source 3: https://stellartech.com/products/stellarsync (Snippet describing StellarSync as a cloud storage solution) | |
| β’ Timings: retrieval_ms: 12.34ms, generation_ms: 1245.67ms, total_ms: 1258.01ms | |
| Example 2: Unanswerable Query (Refusal) | |
| β’ Question: What is the latest stock price for StellarTech? | |
| β’ Answer: I couldn't find relevant information in the crawled content to answer this question. The closest match had a relevance score of 0.18, which is below the threshold. | |
| β’ Sources: | |
| o Source 1: https://stellartech.com/contact (Snippet about company's contact information) | |
| β’ Timings: retrieval_ms: 9.87ms, generation_ms: 12.11ms, total_ms: 21.98ms | |
| This example demonstrates the grounding check and a clear refusal when the retrieved information does not support an answer, which is a key requirement of the project. | |