Spaces:
Sleeping
Sleeping
File size: 2,926 Bytes
7d34386 ad86493 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 fd01d7b 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 a0f20a0 7d34386 a0f20a0 ad86493 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | ---
title: 80,000 Hours RAG Q&A
emoji: π―
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
---
# π― 80,000 Hours Career Advice Q&A
A Retrieval-Augmented Generation (RAG) system that answers career-related questions using content from [80,000 Hours](https://80000hours.org/), with validated citations.
## Features
- π **Semantic Search**: Retrieves relevant content from 80,000 Hours articles
- π€ **AI-Powered Answers**: Uses GPT-4o-mini to generate comprehensive responses
- β
**Citation Validation**: Automatically validates that quotes exist in source material
- π **Source Attribution**: Every answer includes validated citations with URLs
## How It Works
1. Your question is converted to a vector embedding
2. Relevant article chunks are retrieved from Qdrant vector database
3. GPT-4o generates an answer with citations
4. Citations are validated against source material
5. You get an answer with verified quotes and source links
## Configuration for Hugging Face Spaces
To deploy this app, you need to configure the following **Secrets** in your Space settings:
1. Go to your Space β Settings β Variables and Secrets
2. Add these secrets:
- `QDRANT_URL`: Your Qdrant cloud instance URL
- `QDRANT_API_KEY`: Your Qdrant API key
- `OPENAI_API_KEY`: Your OpenAI API key
## Local Development
### Setup
1. Install dependencies:
```bash
pip install -r requirements.txt
```
2. Create `.env` file with:
```
QDRANT_URL=your_url
QDRANT_API_KEY=your_key
OPENAI_API_KEY=your_key
```
### First Time Setup (run in order):
1. **Extract articles** β `python extract_articles_cli.py`
- Scrapes 80,000 Hours articles from sitemap
- Only needed once (or to refresh content)
2. **Chunk articles** β `python chunk_articles_cli.py`
- Splits articles into semantic chunks
3. **Upload to Qdrant** β `python upload_to_qdrant_cli.py`
- Generates embeddings and uploads to vector DB
### Running Locally
**Web Interface:**
```bash
python app.py
```
**Command Line:**
```bash
python rag_chat.py "your question here"
python rag_chat.py "your question" --show-context
```
## Project Structure
- `app.py` - Main Gradio web interface
- `rag_chat.py` - RAG logic and CLI interface
- `citation_validator.py` - Citation validation system
- `extract_articles_cli.py` - Article scraper
- `chunk_articles_cli.py` - Article chunking
- `upload_to_qdrant_cli.py` - Vector DB uploader
- `config.py` - Shared configuration
## Tech Stack
- **Frontend**: Gradio 4.0+
- **LLM**: OpenAI GPT-4o-mini
- **Vector DB**: Qdrant Cloud
- **Embeddings**: sentence-transformers (all-MiniLM-L6-v2)
- **Citation Validation**: rapidfuzz for fuzzy matching
## Credits
Content sourced from [80,000 Hours](https://80000hours.org/), a nonprofit that provides research and support to help people find careers that effectively tackle the world's most pressing problems. |