Spaces:

Automaton9
/

80000_Hours_AI_Assistant

Sleeping

App Files Files Community

Ryan commited on Oct 10, 2025

Commit

2132962

1 Parent(s): 7d34386

simplify changes

Browse files

Files changed (12) hide show

.gitignore +3 -1
CHANGES_SUMMARY.md +0 -143
HF_SPACES_CHECKLIST.md +0 -140
QUICK_START.md +0 -86
hf_spaces_deploy/README.md +0 -102
hf_spaces_deploy/app.py +0 -102
hf_spaces_deploy/citation_validator.py +0 -338
hf_spaces_deploy/config.py +0 -11
hf_spaces_deploy/rag_chat.py +0 -181
hf_spaces_deploy/requirements.txt +0 -10
prepare_deployment.sh +0 -34
upload_to_qdrant_cli.py +5 -1

.gitignore CHANGED Viewed

@@ -35,7 +35,7 @@ env/
 *.swo
 *~
-# Data files (don't commit large data files to HF Spaces)
 articles.json
 article_chunks.jsonl
 validation_results.json
@@ -46,3 +46,5 @@ Thumbs.db
 # Gradio
 flagged/

 *.swo
 *~
+# Data files (large, don't commit)
 articles.json
 article_chunks.jsonl
 validation_results.json
 # Gradio
 flagged/
+TODO.txt

CHANGES_SUMMARY.md DELETED Viewed

@@ -1,143 +0,0 @@
-# 🎯 Summary: Hugging Face Spaces Setup Complete
-## ✅ What Was Done
-Your project is now **fully configured** for deployment to Hugging Face Spaces!
-### Files Created/Modified
-1. **`app.py`** ✨ NEW
-   - Main Gradio interface optimized for HF Spaces
-   - Removed server configuration (HF handles this)
-   - Clean launch() call for HF environment
-2. **`README.md`** ✏️ UPDATED
-   - Added HF Spaces YAML frontmatter
-   - Included deployment instructions
-   - Added configuration guide for Secrets
-3. **`.gitignore`** ✨ NEW
-   - Excludes sensitive files (.env, data files)
-   - HF Spaces best practices
-4. **`requirements.txt`** ✏️ UPDATED
-   - Added torch dependency (needed by sentence-transformers)
-   - All dependencies verified for HF Spaces
-### Documentation Created
-5. **`DEPLOYMENT.md`** ✨ NEW
-   - Complete step-by-step deployment guide
-   - Troubleshooting section
-   - Cost breakdown
-6. **`HF_SPACES_CHECKLIST.md`** ✨ NEW
-   - Detailed checklist for deployment
-   - File exclusion list
-   - Common issues and solutions
-7. **`QUICK_START.md`** ✨ NEW
-   - 5-minute quick start guide
-   - TL;DR version for fast deployment
-   - Quick reference table
-8. **`CHANGES_SUMMARY.md`** ✨ NEW (this file)
-   - Overview of all changes made
-### Helper Scripts
-9. **`prepare_deployment.sh`** ✨ NEW
-   - Automated script to copy deployment files
-   - Already tested and working!
-   - Creates `hf_spaces_deploy/` directory
-### Ready-to-Deploy Files
-The `hf_spaces_deploy/` directory contains exactly what you need:
-```
-hf_spaces_deploy/
-├── app.py                    (3.3K)
-├── rag_chat.py              (5.6K)
-├── citation_validator.py    (13K)
-├── config.py                (252B)
-├── requirements.txt         (170B)
-└── README.md                (2.9K)
-```
-## 🚀 Next Steps (Your Action Required)
-### Quick Deploy (5 minutes):
-1. **Create HF Space**: https://huggingface.co/spaces → "Create new Space"
-   - Choose Gradio SDK
-   - Use CPU basic (free)
-2. **Add Secrets** in Space Settings:
-   - `QDRANT_URL`
-   - `QDRANT_API_KEY`
-   - `OPENAI_API_KEY`
-3. **Upload files** from `hf_spaces_deploy/` folder
-4. **Done!** Your app will be live in 2-5 minutes
-### Detailed Instructions:
-- See `QUICK_START.md` for step-by-step guide
-- See `HF_SPACES_CHECKLIST.md` for complete checklist
-- See `DEPLOYMENT.md` for troubleshooting
-## 📊 What Stays Local
-These files are for local development only (NOT uploaded to HF Spaces):
-- `.env` - Your secrets (use HF Secrets instead)
-- `articles.json` - Source data (already in Qdrant)
-- `article_chunks.jsonl` - Chunked data (already in Qdrant)
-- `web_app.py` - Old version (replaced by app.py)
-- `*_cli.py` - Setup scripts (not needed in deployment)
-## ✨ Key Features of Your Deployment
-- ✅ **Free hosting** on HF Spaces CPU tier
-- ✅ **Secure** - API keys stored as Secrets
-- ✅ **Fast** - Optimized for Gradio 4.0+
-- ✅ **Professional** - Beautiful UI with Soft theme
-- ✅ **Validated citations** - Every quote is verified
-- ✅ **Easy updates** - Just git push to redeploy
-## 🎓 Architecture
-```
-User Question
-    ↓
-[Gradio UI (app.py)]
-    ↓
-[RAG Logic (rag_chat.py)]
-    ↓
-[Qdrant Vector DB] → Retrieve relevant chunks
-    ↓
-[OpenAI GPT-4o-mini] → Generate answer with citations
-    ↓
-[Citation Validator] → Verify quotes against sources
-    ↓
-[Formatted Response] → Display to user
-```
-## 📝 Notes
-- Environment variables work automatically on HF Spaces (no .env needed)
-- `load_dotenv()` gracefully handles missing .env file
-- All code is production-ready and tested
-- Deployment is reversible (just delete the Space)
-## 🤔 Questions?
-Refer to:
-1. `QUICK_START.md` - Fast deployment
-2. `HF_SPACES_CHECKLIST.md` - Detailed checklist
-3. `DEPLOYMENT.md` - Complete guide
-4. HF Spaces docs: https://huggingface.co/docs/hub/spaces
----
-**Your project is 100% ready for Hugging Face Spaces! 🚀**

HF_SPACES_CHECKLIST.md DELETED Viewed

@@ -1,140 +0,0 @@
-# Hugging Face Spaces Deployment Checklist
-## ✅ Files Ready for Upload
-These are the ONLY files you need to upload to Hugging Face Spaces:
-- [ ] `app.py` - Main Gradio interface (✓ Created)
-- [ ] `rag_chat.py` - RAG logic
-- [ ] `citation_validator.py` - Citation validation
-- [ ] `config.py` - Configuration constants
-- [ ] `requirements.txt` - Python dependencies (✓ Updated)
-- [ ] `README.md` - Documentation with HF metadata (✓ Updated)
-## ❌ Files to EXCLUDE (Do NOT upload)
-- `.env` - Contains secrets (use HF Spaces Secrets instead)
-- `articles.json` - Large data file (not needed, data is in Qdrant)
-- `article_chunks.jsonl` - Large data file (not needed, data is in Qdrant)
-- `validation_results.json` - Runtime output file
-- `__pycache__/` - Python cache
-- `web_app.py` - Old version (replaced by app.py)
-- `extract_articles_cli.py` - Setup script (not needed for deployed app)
-- `chunk_articles_cli.py` - Setup script (not needed for deployed app)
-- `upload_to_qdrant_cli.py` - Setup script (not needed for deployed app)
-## 🔧 Pre-Deployment Steps
-### 1. Verify Data is in Qdrant
-```bash
-# Make sure you've already run these locally:
-python extract_articles_cli.py
-python chunk_articles_cli.py
-python upload_to_qdrant_cli.py
-```
-### 2. Test Locally (Optional)
-```bash
-# Set up virtual environment
-python -m venv venv
-source venv/bin/activate  # On Windows: venv\Scripts\activate
-# Install dependencies
-pip install -r requirements.txt
-# Run the app
-python app.py
-```
-### 3. Create Hugging Face Space
-1. Go to https://huggingface.co/spaces
-2. Click "Create new Space"
-3. Configure:
-   - **Space name**: Your choice (e.g., `80k-career-advisor`)
-   - **SDK**: Gradio
-   - **Hardware**: CPU basic (free tier is sufficient)
-   - **Visibility**: Public or Private
-### 4. Configure Secrets (CRITICAL!)
-In your Space Settings → Variables and Secrets, add:
-- **QDRANT_URL**: `https://your-cluster-url.aws.cloud.qdrant.io`
-- **QDRANT_API_KEY**: `your-qdrant-api-key`
-- **OPENAI_API_KEY**: `sk-...your-openai-key`
-### 5. Upload Files
-**Option A: Git**
-```bash
-git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
-cd YOUR_SPACE_NAME
-# Copy only the necessary files
-cp /home/ryan/Documents/80k_rag/app.py .
-cp /home/ryan/Documents/80k_rag/rag_chat.py .
-cp /home/ryan/Documents/80k_rag/citation_validator.py .
-cp /home/ryan/Documents/80k_rag/config.py .
-cp /home/ryan/Documents/80k_rag/requirements.txt .
-cp /home/ryan/Documents/80k_rag/README.md .
-git add .
-git commit -m "Initial deployment"
-git push
-```
-**Option B: Web Interface**
-1. Click "Files and versions" tab
-2. Click "Upload files"
-3. Drag and drop the 6 files listed above
-4. Click "Commit"
-### 6. Monitor Build
-- Watch the build logs in the App tab
-- Build typically takes 2-5 minutes
-- Look for any errors in dependencies or imports
-## 🚀 Post-Deployment
-### Testing
-1. Visit your Space URL: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
-2. Try the example questions
-3. Test with a custom question
-4. Verify citations are displaying correctly
-### Common Issues
-**Problem**: Build fails with "Module not found"
-- **Solution**: Check that all imports in `app.py`, `rag_chat.py`, and `citation_validator.py` are in `requirements.txt`
-**Problem**: Runtime error about missing API keys
-- **Solution**: Verify secrets are set correctly in Space Settings
-**Problem**: Slow responses
-- **Solution**: Consider upgrading to a better hardware tier
-**Problem**: "No relevant sources found"
-- **Solution**: Verify your Qdrant instance is accessible and contains data
-## 📊 Estimated Costs
-- **HF Space (CPU basic)**: Free
-- **OpenAI API**: ~$0.01-0.05 per query (GPT-4o-mini)
-- **Qdrant Cloud**: Free tier supports up to 1GB
-## 🔄 Updating Your Deployed App
-```bash
-# Make changes locally
-# Then push updates
-cd YOUR_SPACE_NAME
-git add .
-git commit -m "Update: description of changes"
-git push
-```
-## 📝 Notes
-- The app will save `validation_results.json` during runtime (this is fine, stored in Space's temporary storage)
-- Secrets in HF Spaces are injected as environment variables (compatible with your code)
-- The `.env` file is only for local development

QUICK_START.md DELETED Viewed

@@ -1,86 +0,0 @@
-# 🚀 Quick Start: Deploy to Hugging Face Spaces
-## TL;DR - 5 Minute Deploy
-### Step 1: Prepare Files (Already Done! ✓)
-```bash
-./prepare_deployment.sh
-```
-### Step 2: Create HF Space
-1. Go to https://huggingface.co/spaces
-2. Click **"Create new Space"**
-3. Settings:
-   - Space name: `80k-career-advisor` (or your choice)
-   - SDK: **Gradio**
-   - Hardware: **CPU basic** (free)
-   - Visibility: Public or Private
-4. Click **"Create Space"**
-### Step 3: Add Secrets (CRITICAL!)
-On your Space page → **Settings** → **Variables and Secrets**:
-| Name | Value |
-|------|-------|
-| `QDRANT_URL` | Your Qdrant instance URL |
-| `QDRANT_API_KEY` | Your Qdrant API key |
-| `OPENAI_API_KEY` | Your OpenAI API key |
-### Step 4: Upload Files
-**Easy Way (Web Upload):**
-1. Go to **Files and versions** tab
-2. Click **"Upload files"**
-3. Drag these 6 files from `hf_spaces_deploy/`:
-   - app.py
-   - rag_chat.py
-   - citation_validator.py
-   - config.py
-   - requirements.txt
-   - README.md
-4. Click **"Commit changes to main"**
-**Git Way:**
-```bash
-# Clone your new space
-git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
-cd YOUR_SPACE_NAME
-# Copy files
-cp ../80k_rag/hf_spaces_deploy/* .
-# Push
-git add .
-git commit -m "Initial deployment"
-git push
-```
-### Step 5: Wait & Test
-- Build takes 2-5 minutes
-- Your app will be live at: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
-- Test with example questions!
-## Troubleshooting
-| Problem | Solution |
-|---------|----------|
-| Build fails | Check build logs, verify requirements.txt |
-| "Module not found" | Ensure all dependencies in requirements.txt |
-| No API response | Verify secrets are set correctly |
-| "No relevant sources" | Check Qdrant instance is accessible |
-## Cost
-- **HF Space**: FREE (CPU basic tier)
-- **OpenAI**: ~$0.01-0.05 per query
-- **Qdrant**: FREE (up to 1GB)
-Total: Essentially free for moderate usage!
-## Need Help?
-See detailed guides:
-- `HF_SPACES_CHECKLIST.md` - Complete checklist
-- `DEPLOYMENT.md` - Detailed deployment guide
-- `README.md` - Full project documentation

hf_spaces_deploy/README.md DELETED Viewed

@@ -1,102 +0,0 @@
----
-title: 80,000 Hours RAG Q&A
-emoji: 🎯
-colorFrom: blue
-colorTo: purple
-sdk: gradio
-sdk_version: 4.0.0
-app_file: app.py
-pinned: false
----
-# 🎯 80,000 Hours Career Advice Q&A
-A Retrieval-Augmented Generation (RAG) system that answers career-related questions using content from [80,000 Hours](https://80000hours.org/), with validated citations.
-## Features
-- 🔍 **Semantic Search**: Retrieves relevant content from 80,000 Hours articles
-- 🤖 **AI-Powered Answers**: Uses GPT-4o-mini to generate comprehensive responses
-- ✅ **Citation Validation**: Automatically validates that quotes exist in source material
-- 📚 **Source Attribution**: Every answer includes validated citations with URLs
-## How It Works
-1. Your question is converted to a vector embedding
-2. Relevant article chunks are retrieved from Qdrant vector database
-3. GPT-4o-mini generates an answer with citations
-4. Citations are validated against source material
-5. You get an answer with verified quotes and source links
-## Configuration for Hugging Face Spaces
-To deploy this app, you need to configure the following **Secrets** in your Space settings:
-1. Go to your Space → Settings → Variables and Secrets
-2. Add these secrets:
-   - `QDRANT_URL`: Your Qdrant cloud instance URL
-   - `QDRANT_API_KEY`: Your Qdrant API key
-   - `OPENAI_API_KEY`: Your OpenAI API key
-## Local Development
-### Setup
-1. Install dependencies:
-```bash
-pip install -r requirements.txt
-```
-2. Create `.env` file with:
-```
-QDRANT_URL=your_url
-QDRANT_API_KEY=your_key
-OPENAI_API_KEY=your_key
-```
-### First Time Setup (run in order):
-1. **Extract articles** → `python extract_articles_cli.py`
-   - Scrapes 80,000 Hours articles from sitemap
-   - Only needed once (or to refresh content)
-2. **Chunk articles** → `python chunk_articles_cli.py`
-   - Splits articles into semantic chunks
-3. **Upload to Qdrant** → `python upload_to_qdrant_cli.py`
-   - Generates embeddings and uploads to vector DB
-### Running Locally
-**Web Interface:**
-```bash
-python app.py
-```
-**Command Line:**
-```bash
-python rag_chat.py "your question here"
-python rag_chat.py "your question" --show-context
-```
-## Project Structure
-- `app.py` - Main Gradio web interface
-- `rag_chat.py` - RAG logic and CLI interface
-- `citation_validator.py` - Citation validation system
-- `extract_articles_cli.py` - Article scraper
-- `chunk_articles_cli.py` - Article chunking
-- `upload_to_qdrant_cli.py` - Vector DB uploader
-- `config.py` - Shared configuration
-## Tech Stack
-- **Frontend**: Gradio 4.0+
-- **LLM**: OpenAI GPT-4o-mini
-- **Vector DB**: Qdrant Cloud
-- **Embeddings**: sentence-transformers (all-MiniLM-L6-v2)
-- **Citation Validation**: rapidfuzz for fuzzy matching
-## Credits
-Content sourced from [80,000 Hours](https://80000hours.org/), a nonprofit that provides research and support to help people find careers that effectively tackle the world's most pressing problems.

hf_spaces_deploy/app.py DELETED Viewed

@@ -1,102 +0,0 @@
-import gradio as gr
-import os
-from rag_chat import ask
-def chat_interface(question: str, show_context: bool = False):
-    """Process question and return formatted response."""
-    if not question.strip():
-        return "Please enter a question.", ""
-    result = ask(question, show_context=show_context)
-    # Format main response
-    answer = result["answer"]
-    # Format citations
-    citations_text = ""
-    if result["citations"]:
-        citations_text += "\n\n---\n\n### 📚 Citations\n\n"
-        for i, citation in enumerate(result["citations"], 1):
-            citations_text += f"**[{i}]** {citation['title']}\n"
-            citations_text += f"> \"{citation['quote']}\"\n"
-            citations_text += f"🔗 [{citation['url']}]({citation['url']})\n\n"
-    # Add validation warnings if any
-    if result.get("validation_errors"):
-        citations_text += "\n⚠️ **Validation Warnings:**\n"
-        for error in result["validation_errors"]:
-            citations_text += f"- {error}\n"
-    # Add stats
-    if result["citations"]:
-        valid_count = len([c for c in result["citations"] if c.get("validated", True)])
-        total_count = len(result["citations"])
-        citations_text += f"\n✓ {valid_count}/{total_count} citations validated"
-    return answer, citations_text
-# Create Gradio interface
-with gr.Blocks(title="80,000 Hours Q&A", theme=gr.themes.Soft()) as demo:
-    gr.Markdown(
-        """
-        # 🎯 80,000 Hours Career Advice Q&A
-        Ask questions about career planning and get answers backed by citations from 80,000 Hours articles.
-        This RAG system retrieves relevant content from the 80,000 Hours knowledge base and generates answers with validated citations.
-        """
-    )
-    with gr.Row():
-        with gr.Column():
-            question_input = gr.Textbox(
-                label="Your Question",
-                placeholder="e.g., Should I plan my entire career?",
-                lines=2
-            )
-            show_context_checkbox = gr.Checkbox(
-                label="Show retrieved context (for debugging)",
-                value=False
-            )
-            submit_btn = gr.Button("Ask", variant="primary")
-    with gr.Row():
-        with gr.Column():
-            answer_output = gr.Textbox(
-                label="Answer",
-                lines=10,
-                show_copy_button=True
-            )
-        with gr.Column():
-            citations_output = gr.Markdown(
-                label="Citations & Sources"
-            )
-    # Event handlers
-    submit_btn.click(
-        fn=chat_interface,
-        inputs=[question_input, show_context_checkbox],
-        outputs=[answer_output, citations_output]
-    )
-    question_input.submit(
-        fn=chat_interface,
-        inputs=[question_input, show_context_checkbox],
-        outputs=[answer_output, citations_output]
-    )
-    # Example questions
-    gr.Examples(
-        examples=[
-            "Should I plan my entire career?",
-            "What career advice does 80k give?",
-            "How can I have more impact with my career?",
-            "What are the world's most pressing problems?",
-        ],
-        inputs=question_input
-    )
-if __name__ == "__main__":
-    # HF Spaces handles the server configuration
-    demo.launch()

hf_spaces_deploy/citation_validator.py DELETED Viewed

@@ -1,338 +0,0 @@
-"""Citation validation and formatting for RAG system.
-This module handles structured citations with validation to prevent hallucination.
-"""
-import json
-import time
-from typing import List, Dict, Any
-from urllib.parse import quote
-from openai import OpenAI
-from rapidfuzz import fuzz
-FUZZY_THRESHOLD = 95
-def create_highlighted_url(base_url: str, quote_text: str) -> str:
-    """Create a URL with text fragment that highlights the quoted text.
-    Uses the :~:text= URL fragment feature to scroll to and highlight text.
-    Args:
-        base_url: The base article URL
-        quote_text: The text to highlight
-    Returns:
-        URL with text fragment
-    """
-    # Take first ~100 characters of quote for the URL (browsers have limits)
-    # and clean up for URL encoding
-    text_fragment = quote_text[:100].strip()
-    encoded_text = quote(text_fragment)
-    return f"{base_url}#:~:text={encoded_text}"
-def normalize_text(text: str) -> str:
-    """Normalize text for comparison by handling whitespace and punctuation variants."""
-    # Normalize different dash types to standard hyphen
-    text = text.replace('–', '-')  # en-dash
-    text = text.replace('—', '-')  # em-dash
-    text = text.replace('−', '-')  # minus sign
-    # Normalize different apostrophe/quote types to standard ASCII
-    text = text.replace(''', "'")  # curly apostrophe
-    text = text.replace(''', "'")  # left single quote
-    text = text.replace('"', '"')  # left double quote
-    text = text.replace('"', '"')  # right double quote
-    # Normalize whitespace
-    text = " ".join(text.split())
-    return text
-def validate_citation(quote: str, source_chunks: List[Any], source_id: int) -> Dict[str, Any]:
-    """Validate that a quote exists in the specified source chunk.
-    Args:
-        quote: The quoted text to validate
-        source_chunks: List of source chunks from Qdrant
-        source_id: 1-indexed source ID
-    Returns:
-        Dict with validation result and metadata
-    """
-    if source_id < 1 or source_id > len(source_chunks):
-        return {
-            "valid": False,
-            "quote": quote,
-            "source_id": source_id,
-            "reason": "Invalid source ID",
-            "source_text": None
-        }
-    quote_clean = normalize_text(quote).lower()
-    # Step 1: Check claimed source first (fast path)
-    source_text = normalize_text(source_chunks[source_id - 1].payload['text']).lower()
-    claimed_score = fuzz.partial_ratio(quote_clean, source_text)
-    if claimed_score >= FUZZY_THRESHOLD:
-        return {
-            "valid": True,
-            "quote": quote,
-            "source_id": source_id,
-            "title": source_chunks[source_id - 1].payload['title'],
-            "url": source_chunks[source_id - 1].payload['url'],
-            "similarity_score": claimed_score
-        }
-    for idx, chunk in enumerate(source_chunks, 1):
-        if idx == source_id:
-            continue  # Already checked
-        chunk_text = normalize_text(chunk.payload['text']).lower()
-        score = fuzz.partial_ratio(quote_clean, chunk_text)
-        if score >= FUZZY_THRESHOLD:
-            return {
-                "valid": True,
-                "quote": quote,
-                "source_id": idx,
-                "title": chunk.payload['title'],
-                "url": chunk.payload['url'],
-                "similarity_score": score,
-                "remapped": True,
-                "original_source_id": source_id
-            }
-    # Validation failed - report best score from claimed source
-    return {
-        "valid": False,
-        "quote": quote,
-        "source_id": source_id,
-        "reason": f"Quote not found in any source (claimed source: {claimed_score:.1f}% similarity)",
-        "source_text": source_chunks[source_id - 1].payload['text']
-    }
-def generate_answer_with_citations(
-    question: str,
-    context: str,
-    results: List[Any],
-    llm_model: str,
-    openai_api_key: str
-) -> Dict[str, Any]:
-    """Generate answer with structured citations using OpenAI.
-    Args:
-        question: User's question
-        context: Formatted context from source chunks
-        results: Source chunks from Qdrant
-        llm_model: OpenAI model name
-        openai_api_key: OpenAI API key
-    Returns:
-        Dict with answer and validated citations
-    """
-    client = OpenAI(api_key=openai_api_key)
-    system_prompt = """You are a helpful assistant that answers questions based on 80,000 Hours articles.
-You MUST return your response in valid JSON format with this exact structure:
-{
-  "answer": "Your conversational answer with inline citation markers like [1], [2]",
-  "citations": [
-    {
-      "citation_id": 1,
-      "source_id": 1,
-      "quote": "exact sentence or sentences from the source that support your claim"
-    }
-  ]
-}
-CITATION HARD RULES:
-1. Copy quotes EXACTLY as they appear in the provided context
-   - NO ellipses (...)
-   - NO paraphrasing
-   - NO punctuation changes
-   - Word-for-word, character-for-character accuracy required
-2. If the needed support is in two places, use TWO SEPARATE citation entries
-   - Do NOT combine quotes from different sources or different parts of text
-   - Each citation must contain a continuous, unmodified quote
-3. Use the CORRECT source_id from the provided list
-   - Source IDs are numbered [Source 1], [Source 2], etc. in the context
-   - Verify the source_id matches where you found the quote
-CRITICAL RULES FOR CITATIONS:
-- For EVERY claim (advice, fact, statistic, recommendation), add an inline citation [1], [2], etc.
-- For each citation, extract and quote the EXACT sentence(s) from the source that directly support your claim
-- Find the specific sentence(s) in the source that contain the relevant information
-- Each quote should be at least 20 characters and contain complete sentence(s)
-- Multiple consecutive sentences can be quoted if needed to fully support the claim
-WRITING STYLE:
-- Write concisely in a natural, conversational tone
-- You may paraphrase information in your answer, but always cite the source with exact quotes
-- You can add brief context/transitions without citations, but cite all substantive claims
-- If the sources don't fully answer the question, acknowledge that briefly
-- Only use information from the provided sources - don't add external knowledge
-EXAMPLES:
-Example 1 - Single claim:
-{
-  "answer": "One of the most effective ways to build career capital is to work at a high-performing organization where you can learn from talented colleagues [1].",
-  "citations": [
-    {
-      "citation_id": 1,
-      "source_id": 2,
-      "quote": "Working at a high-performing organization is one of the fastest ways to build career capital because you learn from talented colleagues and develop strong professional networks."
-    }
-  ]
-}
-Example 2 - Multiple claims:
-{
-  "answer": "AI safety is considered one of the most pressing problems of our time [1]. Experts estimate that advanced AI could be developed within the next few decades [2], and there's a significant talent gap in the field [3]. This means your contributions could have an outsized impact.",
-  "citations": [
-    {
-      "citation_id": 1,
-      "source_id": 1,
-      "quote": "We believe that risks from artificial intelligence are one of the most pressing problems facing humanity today."
-    },
-    {
-      "citation_id": 2,
-      "source_id": 1,
-      "quote": "Many AI researchers believe there's a 10-50% chance of human-level AI being developed by 2050."
-    },
-    {
-      "citation_id": 3,
-      "source_id": 3,
-      "quote": "There are currently fewer than 300 people working full-time on technical AI safety research, despite the field's critical importance."
-    }
-  ]
-}"""
-    user_prompt = f"""Context from 80,000 Hours articles:
-{context}
-Question: {question}
-Provide your answer in JSON format with exact quotes from the sources."""
-    llm_call_start = time.time()
-    response = client.chat.completions.create(
-        model=llm_model,
-        messages=[
-            {"role": "system", "content": system_prompt},
-            {"role": "user", "content": user_prompt}
-        ],
-        response_format={"type": "json_object"}
-    )
-    print(f"[TIMING] OpenAI call: {(time.time() - llm_call_start)*1000:.2f}ms")
-    # Parse the JSON response
-    try:
-        result = json.loads(response.choices[0].message.content)
-        # Enforce strict shape: must have 'answer' (str) and 'citations' (list of dicts)
-        if not isinstance(result, dict) or 'answer' not in result or 'citations' not in result:
-            return {
-                "answer": response.choices[0].message.content,
-                "citations": [],
-                "validation_errors": ["Response JSON missing required keys 'answer' and/or 'citations'."]
-            }
-        if not isinstance(result['answer'], str) or not isinstance(result['citations'], list):
-            return {
-                "answer": response.choices[0].message.content,
-                "citations": [],
-                "validation_errors": ["Response JSON has incorrect types for 'answer' or 'citations'."]
-            }
-        answer = result.get("answer", "")
-        citations = result.get("citations", [])
-    except json.JSONDecodeError:
-        return {
-            "answer": response.choices[0].message.content,
-            "citations": [],
-            "validation_errors": ["Failed to parse JSON response"]
-        }
-    # Validate each citation
-    validation_start = time.time()
-    validated_citations = []
-    validation_errors = []
-    for citation in citations:
-        quote = citation.get("quote", "")
-        source_id = citation.get("source_id", 0)
-        citation_id = citation.get("citation_id", 0)
-        validation_result = validate_citation(quote, results, source_id)
-        if validation_result["valid"]:
-            # Create URL with text fragment to highlight the quote
-            highlighted_url = create_highlighted_url(
-                validation_result["url"],
-                quote
-            )
-            citation_entry = {
-                "citation_id": citation_id,
-                "source_id": validation_result["source_id"],
-                "quote": quote,
-                "title": validation_result["title"],
-                "url": highlighted_url,
-                "similarity_score": validation_result["similarity_score"]
-            }
-            if validation_result.get("remapped"):
-                citation_entry["remapped_from"] = validation_result["original_source_id"]
-            validated_citations.append(citation_entry)
-        else:
-            validation_errors.append({
-                "citation_id": citation_id,
-                "reason": validation_result['reason'],
-                "claimed_quote": quote,
-                "source_text": validation_result.get('source_text')
-            })
-    print(f"[TIMING] Validation: {(time.time() - validation_start)*1000:.2f}ms")
-    return {
-        "answer": answer,
-        "citations": validated_citations,
-        "validation_errors": validation_errors,
-        "total_citations": len(citations),
-        "valid_citations": len(validated_citations)
-    }
-def format_citations_display(citations: List[Dict[str, Any]]) -> str:
-    """Format validated citations in order with article title, URL, and quoted text.
-    Args:
-        citations: List of validated citation dicts
-    Returns:
-        Formatted string for display
-    """
-    if not citations:
-        return "No citations available."
-    # Sort citations by citation_id to display in order
-    sorted_citations = sorted(citations, key=lambda x: x.get('citation_id', 0))
-    citation_parts = []
-    for cit in sorted_citations:
-        marker = f"[{cit['citation_id']}]"
-        score = cit.get('similarity_score', 100)
-        if cit.get('remapped_from'):
-            note = f" ({score:.1f}% match, remapped: source {cit['remapped_from']} → {cit['source_id']})"
-        else:
-            note = f" ({score:.1f}% match)"
-        citation_parts.append(
-            f"{marker} {cit['title']}{note}\n"
-            f"    URL: {cit['url']}\n"
-            f"    Quote: \"{cit['quote']}\"\n"
-        )
-    return "\n".join(citation_parts)

hf_spaces_deploy/config.py DELETED Viewed

@@ -1,11 +0,0 @@
-"""Shared configuration constants for the 80k RAG system."""
-# Embedding model used across the system
-MODEL_NAME = 'all-MiniLM-L6-v2'
-# Qdrant collection name
-COLLECTION_NAME = "80k_articles"
-# Embedding dimension for the model
-EMBEDDING_DIM = 384

hf_spaces_deploy/rag_chat.py DELETED Viewed

@@ -1,181 +0,0 @@
-import os
-import time
-from typing import Dict, Any
-from dotenv import load_dotenv
-from qdrant_client import QdrantClient
-from sentence_transformers import SentenceTransformer
-from citation_validator import generate_answer_with_citations, format_citations_display, normalize_text
-from config import MODEL_NAME, COLLECTION_NAME
-load_dotenv()
-LLM_MODEL = "gpt-4o-mini"
-SOURCE_COUNT = 10
-SCORE_THRESHOLD = 0.4
-def retrieve_context(question):
-    """Retrieve relevant chunks from Qdrant."""
-    start = time.time()
-    client = QdrantClient(
-        url=os.getenv("QDRANT_URL"),
-        api_key=os.getenv("QDRANT_API_KEY"),
-    )
-    model = SentenceTransformer(MODEL_NAME)
-    query_vector = model.encode(question).tolist()
-    results = client.query_points(
-        collection_name=COLLECTION_NAME,
-        query=query_vector,
-        limit=SOURCE_COUNT,
-        score_threshold=SCORE_THRESHOLD,
-    )
-    print(f"[TIMING] Retrieval: {(time.time() - start)*1000:.2f}ms")
-    return results.points
-def format_context(results):
-    """Format retrieved chunks into context string for LLM."""
-    context_parts = []
-    for i, hit in enumerate(results, 1):
-        context_parts.append(
-            f"[Source {i}]\n"
-            f"Title: {hit.payload['title']}\n"
-            f"URL: {hit.payload['url']}\n"
-            f"Content: {hit.payload['text']}\n"
-        )
-    return "\n---\n".join(context_parts)
-def ask(question: str, show_context: bool = False) -> Dict[str, Any]:
-    """Main RAG function: retrieve context and generate answer with validated citations."""
-    total_start = time.time()
-    print(f"Question: {question}\n")
-    # Retrieve relevant chunks
-    results = retrieve_context(question)
-    if not results:
-        print("No relevant sources found above the score threshold.")
-        return {
-            "question": question,
-            "answer": "No relevant information found in the knowledge base.",
-            "citations": [],
-            "sources": []
-        }
-    context = format_context(results)
-    print(f"[TIMING] First chunk ready: {(time.time() - total_start)*1000:.2f}ms")
-    if show_context:
-        print("=" * 80)
-        print("RETRIEVED CONTEXT:")
-        print("=" * 80)
-        print(context)
-        print("\n")
-    # Generate answer with citations
-    llm_start = time.time()
-    result = generate_answer_with_citations(
-        question=question,
-        context=context,
-        results=results,
-        llm_model=LLM_MODEL,
-        openai_api_key=os.getenv("OPENAI_API_KEY")
-    )
-    total_time = (time.time() - total_start) * 1000
-    print(f"[TIMING] Total: {total_time:.2f}ms ({total_time/1000:.2f}s)")
-    # Display answer
-    print("\n" + "=" * 80)
-    print("ANSWER:")
-    print("=" * 80)
-    print(result["answer"])
-    print("\n")
-    # Display citations
-    print("=" * 80)
-    print("CITATIONS (Verified Quotes):")
-    print("=" * 80)
-    print(format_citations_display(result["citations"]))
-    # Show validation stats
-    if result["validation_errors"]:
-        print("\n" + "=" * 80)
-        print("VALIDATION WARNINGS:")
-        print("=" * 80)
-        for error in result["validation_errors"]:
-            print(f"⚠ [Citation {error['citation_id']}] {error['reason']}")
-    print("\n" + "=" * 80)
-    print(f"Citation Stats: {result['valid_citations']}/{result['total_citations']} citations validated")
-    print("=" * 80)
-    # Save validation results to JSON
-    def normalize_dict(obj):
-        """Recursively normalize all strings in a dict/list structure."""
-        if isinstance(obj, dict):
-            return {k: normalize_dict(v) for k, v in obj.items()}
-        elif isinstance(obj, list):
-            return [normalize_dict(item) for item in obj]
-        elif isinstance(obj, str):
-            return normalize_text(obj)
-        return obj
-    validation_output = {
-        "question": question,
-        "answer": result["answer"],
-        "citations": result["citations"],
-        "validation_errors": result["validation_errors"],
-        "stats": {
-            "total_citations": result["total_citations"],
-            "valid_citations": result["valid_citations"],
-            "total_time_ms": total_time
-        },
-        "sources": [
-            {
-                "source_id": i,
-                "title": hit.payload['title'],
-                "url": hit.payload['url'],
-                "chunk_id": hit.payload.get('chunk_id'),
-                "text": hit.payload['text']
-            }
-            for i, hit in enumerate(results, 1)
-        ]
-    }
-    # Normalize all text in the output
-    validation_output = normalize_dict(validation_output)
-    import json
-    with open("validation_results.json", "w", encoding="utf-8") as f:
-        json.dump(validation_output, f, ensure_ascii=False, indent=2)
-    print("\n[INFO] Validation results saved to validation_results.json")
-    return {
-        "question": question,
-        "answer": result["answer"],
-        "citations": result["citations"],
-        "validation_errors": result["validation_errors"],
-        "sources": results
-    }
-def main():
-    import sys
-    # Default test query if no args provided
-    if len(sys.argv) < 2:
-        question = "Should I plan my entire career?"
-        show_context = False
-        print(f"[INFO] No query provided, using test query: '{question}'\n")
-    else:
-        show_context = "--show-context" in sys.argv
-        question_parts = [arg for arg in sys.argv[1:] if arg != "--show-context"]
-        question = " ".join(question_parts)
-    ask(question, show_context=show_context)
-if __name__ == "__main__":
-    main()

hf_spaces_deploy/requirements.txt DELETED Viewed

@@ -1,10 +0,0 @@
-openai>=1.0.0
-qdrant-client>=1.7.0
-sentence-transformers>=2.2.0
-python-dotenv>=1.0.0
-beautifulsoup4>=4.12.0
-requests>=2.31.0
-gradio>=4.0.0
-rapidfuzz>=3.0.0
-torch>=2.0.0

prepare_deployment.sh DELETED Viewed

@@ -1,34 +0,0 @@
-#!/bin/bash
-# Helper script to prepare files for Hugging Face Spaces deployment
-echo "📦 Preparing files for Hugging Face Spaces deployment..."
-echo ""
-# Create deployment directory
-DEPLOY_DIR="hf_spaces_deploy"
-rm -rf $DEPLOY_DIR
-mkdir -p $DEPLOY_DIR
-# Copy necessary files
-echo "Copying files..."
-cp app.py $DEPLOY_DIR/
-cp rag_chat.py $DEPLOY_DIR/
-cp citation_validator.py $DEPLOY_DIR/
-cp config.py $DEPLOY_DIR/
-cp requirements.txt $DEPLOY_DIR/
-cp README.md $DEPLOY_DIR/
-echo "✅ Files copied to $DEPLOY_DIR/"
-echo ""
-echo "Files ready for deployment:"
-ls -lh $DEPLOY_DIR/
-echo ""
-echo "📋 Next steps:"
-echo "1. Create your Hugging Face Space at https://huggingface.co/spaces"
-echo "2. Configure secrets (QDRANT_URL, QDRANT_API_KEY, OPENAI_API_KEY)"
-echo "3. Clone your space: git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE"
-echo "4. Copy files: cp hf_spaces_deploy/* YOUR_SPACE/"
-echo "5. Push: cd YOUR_SPACE && git add . && git commit -m 'Initial deployment' && git push"
-echo ""
-echo "📖 See HF_SPACES_CHECKLIST.md for detailed instructions"

upload_to_qdrant_cli.py CHANGED Viewed

@@ -29,7 +29,11 @@ def load_embedding_model():
 def create_collection(client, collection_name=COLLECTION_NAME, embedding_dim=EMBEDDING_DIM):
     print(f"Creating collection '{collection_name}'...")
     try:
-        client.recreate_collection(
             collection_name=collection_name,
             vectors_config=VectorParams(size=embedding_dim, distance=Distance.COSINE),
         )

 def create_collection(client, collection_name=COLLECTION_NAME, embedding_dim=EMBEDDING_DIM):
     print(f"Creating collection '{collection_name}'...")
     try:
+        if client.collection_exists(collection_name):
+            print(f"Collection '{collection_name}' exists. Deleting...")
+            client.delete_collection(collection_name)
+        client.create_collection(
             collection_name=collection_name,
             vectors_config=VectorParams(size=embedding_dim, distance=Distance.COSINE),
         )