Spaces:
Sleeping
Sleeping
File size: 15,521 Bytes
d43e9d6 4d592a4 d43e9d6 4d592a4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 |
---
title: Web-Based RAG System
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---
# Web-Based RAG System
A production-ready Retrieval-Augmented Generation (RAG) system that combines PDF document processing and web search capabilities to provide intelligent answers to user queries.
## Table of Contents
- [Features](#features)
- [Tech Stack](#tech-stack)
- [Architecture](#architecture)
- [Installation](#installation)
- [Configuration](#configuration)
- [Usage](#usage)
- [Project Structure](#project-structure)
- [API Endpoints](#api-endpoints)
- [Frontend Components](#frontend-components)
- [Contributing](#contributing)
- [License](#license)
## Features
- **Multi-Modal Query Processing**: Supports queries against both uploaded PDF documents and live web search
- **PDF Document Management**: Upload, store, and process PDF documents with advanced extraction techniques
- **OCR Support for Scanned PDFs**: Automatically extracts text from image-based/scanned PDFs using Tesseract OCR
- **Hybrid Search**: Combine PDF-based retrieval with web search for comprehensive answers
- **Confidence Scoring**: Provides confidence scores for generated responses
- **Vector Storage**: Efficient similarity search using ChromaDB vector database
- **Modern UI**: Responsive React-based frontend with intuitive user experience
- **RESTful API**: Well-documented API endpoints for easy integration
- **File Upload**: Drag-and-drop PDF upload functionality
- **Query Modes**: Different query modes (PDF-only, Web-only, Hybrid, Restricted)
## Tech Stack
### Backend
- **Framework**: FastAPI (Python)
- **Database**: ChromaDB (Vector Database)
- **Embeddings**: Sentence Transformers
- **Language**: Python 3.11+
- **Web Framework**: FastAPI with Uvicorn ASGI server
- **HTTP Client**: aiohttp
- **PDF Processing**: PyPDF, pdfplumber, pdf2image, pytesseract
- **OCR**: Tesseract for scanned/image-based PDFs
- **LLM Integration**: Groq API
- **Environment Management**: python-dotenv
- **Data Validation**: Pydantic
### Frontend
- **Framework**: React 18+
- **Language**: TypeScript
- **Styling**: Tailwind CSS
- **Build Tool**: Vite
- **HTTP Client**: Axios
- **UI Components**: Custom-built with Lucide React icons
- **File Upload**: react-dropzone
- **Notifications**: react-hot-toast
## Architecture
The application follows a microservices architecture with a clear separation between frontend and backend:
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Frontend β β Backend β β External β
β (React) βββββΊβ (FastAPI) βββββΊβ Services β
β β β β β β
β β’ User Interfaceβ β β’ API Gateway β β β’ Groq API β
β β’ File Upload β β β’ PDF Processor β β β’ Web Search β
β β’ Query Input β β β’ Embedding β β β’ Vector Store β
β β’ Results Displayβ β β’ Retriever β β β
βββββββββββββββββββ β β’ LLM Service β βββββββββββββββββββ
βββββββββββββββββββ
```
## Installation
### Prerequisites
- Python 3.11+
- Node.js 18+
- npm or yarn
- Git
- **Tesseract OCR** (for scanned PDF support):
- macOS: `brew install tesseract poppler`
- Ubuntu: `sudo apt-get install tesseract-ocr poppler-utils`
- Windows: Download from https://github.com/tesseract-ocr/tesseract
### Backend Setup
1. Clone the repository:
```bash
git clone https://github.com/YuvrajSinghBhadoria2/web_based_rag.git
cd web_based_rag/backend
```
2. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Create a `.env` file in the backend directory based on `.env.bak`:
```bash
cp .env.bak .env
```
### Frontend Setup
1. Navigate to the frontend directory:
```bash
cd ../frontend
```
2. Install dependencies:
```bash
npm install
```
## Configuration
### Backend Environment Variables
Create a `.env` file in the backend directory with the following variables:
```env
GROQ_API_KEY=your_groq_api_key_here
SERPER_API_KEY=your_serper_api_key_here # Optional - for web search
TAVILY_API_KEY=your_tavily_api_key_here # Optional - for web search
CHROMA_DB_PATH=./storage/vector_db
UPLOAD_DIR=./storage/uploads
MODEL_NAME=llama3-70b-8192
TEMPERATURE=0.1
MAX_TOKENS=1000
TOP_P=1
STOP_TOKENS=["\n", "###"]
CORS_ORIGINS=["http://localhost:5173", "http://localhost:3000", "http://127.0.0.1:5173", "http://127.0.0.1:3000", "http://localhost:5175"]
```
Replace `your_groq_api_key_here` with your actual Groq API key. You can get one from [Groq Cloud](https://console.groq.com/keys).
For web search functionality, add Serper or Tavily API keys (optional - without them, hybrid mode will only use PDF sources).
## Usage
### Running the Backend
1. Make sure you're in the backend directory
2. Activate your virtual environment
3. Start the backend server:
```bash
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```
The backend will be available at `http://localhost:8000` with API documentation at `http://localhost:8000/api/docs`.
### Running the Frontend
1. Navigate to the frontend directory
2. Start the development server:
```bash
npm run dev
```
The frontend will be available at `http://localhost:5173`.
### Application Workflow
1. **Upload Documents**: Use the drag-and-drop interface to upload PDF documents
2. **Select Query Mode**: Choose between PDF-only, Web-only, Hybrid, or Restricted modes
3. **Enter Query**: Type your question in the query input
4. **Get Response**: Receive an AI-generated answer with confidence score and source citations
5. **Review Sources**: View the documents and web pages that contributed to the response
### OCR for Scanned PDFs
The system automatically detects and processes scanned/image-based PDFs using Tesseract OCR:
- If a PDF contains selectable text, it uses the native text extraction
- If no text is found, it automatically applies OCR to extract text from images
- Works with scanned documents, image-only PDFs, and documents with mixed content
## Project Structure
```
web_based_rag/
βββ backend/
β βββ app/
β β βββ api/
β β β βββ v1/
β β β βββ routes/
β β β βββ documents.py # Document management endpoints
β β β βββ health.py # Health check endpoint
β β β βββ query.py # Query processing endpoints
β β β βββ upload.py # File upload endpoints
β β βββ core/ # Core utilities and configurations
β β βββ models/
β β β βββ schemas.py # Pydantic models and schemas
β β βββ services/
β β β βββ confidence.py # Confidence scoring service
β β β βββ embeddings.py # Embedding generation service
β β β βββ enhanced_llm.py # Enhanced LLM service
β β β βββ llm_service.py # LLM integration service
β β β βββ pdf_processor.py # PDF processing service
β β β βββ prompt_guard.py # Prompt safety service
β β β βββ retriever.py # Information retrieval service
β β β βββ vector_store.py # Vector database operations
β β β βββ web_search.py # Web search service
β β βββ utils/
β β β βββ chunking.py # Text chunking utilities
β β β βββ rate_limiter.py # Rate limiting utilities
β β βββ config.py # Configuration settings
β β βββ main.py # Application entry point
β βββ storage/
β β βββ uploads/ # Uploaded PDF files
β β βββ vector_db/ # Vector database files
β β βββ documents.json # Document metadata
β βββ requirements.txt # Python dependencies
β βββ Dockerfile # Docker configuration
β βββ .env.bak # Environment variables template
βββ frontend/
βββ src/
β βββ components/
β β βββ common/ # Reusable UI components
β β βββ documents/ # Document-related components
β β βββ layout/ # Layout components
β β βββ query/ # Query input components
β β βββ results/ # Results display components
β β βββ settings/ # Settings modal components
β βββ context/
β β βββ AppContext.tsx # Application state management
β βββ services/
β β βββ api.ts # API service client
β βββ types/
β β βββ index.ts # Type definitions
β βββ App.tsx # Main application component
β βββ main.tsx # Application entry point
βββ package.json
βββ tsconfig.json
βββ tailwind.config.js
βββ vite.config.ts
```
## API Endpoints
### Health Check
- `GET /` - Root endpoint returning API information
### Documents
- `GET /api/v1/documents` - Get list of uploaded documents
- `DELETE /api/v1/documents/{document_id}` - Delete a document
### File Upload
- `POST /api/v1/upload` - Upload PDF document
### Query
- `POST /api/v1/query` - Process query with specified mode
- Request body: `{"query": "your query", "mode": "pdf|web|hybrid|restricted", "document_ids": ["optional document IDs"]}`
- Response: `{"response": "answer", "sources": [], "confidence": 0.85}`
### Additional Endpoints
- `GET /api/docs` - Interactive API documentation (Swagger UI)
- `GET /api/redoc` - Alternative API documentation (ReDoc)
## Frontend Components
### Layout Components
- **Header**: Navigation and branding
- **Sidebar**: Document management and settings
- **MainContent**: Primary content area
### Document Components
- **FileUpload**: Drag-and-drop PDF upload
- **DocumentList**: Display of uploaded documents
- **DocumentCard**: Individual document information
### Query Components
- **QueryInput**: Input field with mode selector
- **ModeSelector**: Options for PDF-only, Web-only, Hybrid, or Restricted queries
### Results Components
- **ResultsDisplay**: Container for query results
- **AnswerCard**: Display of the AI-generated answer
- **SourcesList**: List of source documents
- **SourceCard**: Detailed source information
- **ConfidenceIndicator**: Visual representation of response confidence
### Settings Components
- **SettingsModal**: Configuration options
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## Deploying to Hugging Face Spaces
This application is configured for deployment on Hugging Face Spaces using the Docker SDK. The repository includes:
- A `Dockerfile` that sets up the complete environment
- A `README.md` with proper Hugging Face metadata
- All necessary backend and frontend code
To deploy to your Space:
1. Create a new Space with the Docker SDK
2. Point it to this repository
3. Add your API keys as Space Secrets:
- `GROQ_API_KEY`: Your Groq API key
4. The Space will automatically build and deploy using the Dockerfile
Your application will be served at the port specified in the Dockerfile (7860).
### Option 1: Using the Docker Image
1. Create a new Space on Hugging Face with the following settings:
- **Space SDK**: Docker
- **Hardware**: Choose based on your needs (GPU recommended for better performance)
2. Add your Hugging Face token and API keys as secrets in the Space settings:
- `HF_TOKEN`: Your Hugging Face token (`your_hf_token_here`)
- `GROQ_API_KEY`: Your Groq API key
- Any other required API keys
3. Create a `Dockerfile` in your Space repository with the following content:
```dockerfile
FROM python:3.11-slim
WORKDIR /app
# Install nodejs for the frontend
RUN apt-get update && apt-get install -y nodejs npm && apt-get clean
# Copy backend requirements and install Python dependencies
COPY backend/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Install frontend dependencies
COPY frontend/package*.json ./frontend/
RUN cd frontend && npm ci --only=production
# Copy the rest of the application
COPY . .
# Build the frontend
RUN cd frontend && npm run build
# Expose the port Hugging Face Spaces expects
EXPOSE 7860
# Start both backend and frontend
CMD bash -c "cd backend && python -m uvicorn app.main:app --host 0.0.0.0 --port 7860 & cd frontend && npx serve -s dist -l 7861"
```
4. Create an `.env` file in the backend directory with your API keys:
```env
GROQ_API_KEY=your_groq_api_key_here
# Add other required environment variables
```
### Option 2: Deploying Your Existing React Frontend (Recommended)
To deploy your existing React frontend along with the FastAPI backend (this preserves your original UI):
1. In your Hugging Face Space repository, copy your entire project
2. Create a Dockerfile that builds and serves both applications:
```dockerfile
FROM node:18-alpine AS frontend-build
WORKDIR /app
COPY frontend/package*.json .
RUN npm ci
COPY frontend/ .
RUN npm run build
FROM python:3.11-slim AS backend-build
WORKDIR /app
# Install dependencies
COPY backend/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY backend/ .
# Copy built frontend
COPY --from=frontend-build /app/dist ./frontend/dist
# Install node for serving frontend
RUN apt-get update && apt-get install -y nodejs npm && apt-get clean
EXPOSE 7860
CMD python -m uvicorn app.main:app --host 0.0.0.0 --port 7860
```
3. Update your backend CORS settings in `backend/app/config.py` to allow the Hugging Face Space URL
4. Add your API keys as Space secrets:
- `GROQ_API_KEY`: Your Groq API key
- Other required API keys
Note: This approach maintains your original React interface which is more feature-rich than a Gradio interface. Your existing frontend with its document cards, sidebar, settings modal, and responsive design will be preserved.
## Deployment Steps
1. Create a new repository on Hugging Face Spaces
2. Push your code to the repository
3. Add your API keys as secrets in the Space settings
4. The application will automatically build and deploy
Your RAG application is now ready for deployment on Hugging Face Spaces with your Hugging Face token: `your_hf_token_here`
## License
This project is licensed under the MIT License - see the LICENSE file for details.
|