Spaces:
Sleeping
Sleeping
File size: 6,962 Bytes
6e94f25 c1aa0bf 6e94f25 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 |
---
title: PDF Summarization
emoji: π’
colorFrom: blue
colorTo: yellow
sdk: docker
pinned: false
short_description: PDF-Summarization
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
## π¦ Project Info
- **Project Name:** Hackrx
- **Python Version:** 3.13.3
# π HackRx PDF RAG API
A production-ready FastAPI service that processes PDF documents and answers questions using Retrieval-Augmented Generation (RAG) with Google's Gemini AI.
## β¨ Features
- π **PDF Processing**: Automatic extraction and chunking of PDF documents
- π§ **AI-Powered Q&A**: Answers questions using Google's Gemini Pro 2.5
- π **Vector Search**: Fast similarity search using FAISS with sentence embeddings
- π **Authentication**: Secure Bearer token authentication
- β‘ **Fast Response**: Sub-30 second response times
- π **Public API**: HTTPS-ready for production deployment
## π API Endpoints
### POST /hackrx/run
Process a PDF document and answer questions.
**Authentication**: Bearer Token Required
**Request Format**:
```json
{
"documents": "https://example.com/document.pdf",
"questions": [
"What is the grace period for premium payment?",
"What is the waiting period for pre-existing diseases?"
]
}
```
**Response Format**:
```json
{
"answers": [
"A grace period of thirty days is provided...",
"There is a waiting period of thirty-six months..."
]
}
```
### GET /health
Health check endpoint for monitoring.
## π οΈ Installation
### Prerequisites
- Python 3.8+
- pip package manager
- 4GB+ RAM recommended
### 1. Clone and Setup
```bash
git clone <your-repo-url>
cd hackrx-pdf-rag-api
```
### 2. Create Virtual Environment
```bash
# Windows
python -m venv venv
venv\\Scripts\\activate
# Linux/Mac
python3 -m venv venv
source venv/bin/activate
```
### 3. Install Dependencies
```bash
pip install -r requirements.txt
```
### 4. Environment Configuration
Create a `.env` file:
```bash
# API Configuration
API_KEY=hackrx-2025-secure-key
# Server Configuration
HOST=0.0.0.0
PORT=8000
DEBUG=False
```
### 5. Run the Application
#### Local Development
```bash
# Run with auto-reload
python fastapi_app.py
# Or with uvicorn
uvicorn fastapi_app:app --reload --host 0.0.0.0 --port 8000
```
#### Production with HTTPS
```bash
# With SSL certificates
uvicorn fastapi_app:app --host 0.0.0.0 --port 443 \
--ssl-keyfile=path/to/key.pem \
--ssl-certfile=path/to/cert.pem
```
## π Deployment Options
### Option 1: Render (Recommended)
1. Create account at [render.com](https://render.com)
2. Connect GitHub repository
3. Use `render.yaml` configuration
4. Deploy automatically
### Option 2: Railway
```bash
# Install Railway CLI
npm install -g @railway/cli
# Deploy
railway login
railway up
```
### Option 3: Heroku
```bash
# Install Heroku CLI
heroku create hackrx-pdf-rag-api
git push heroku main
```
### Option 4: Docker
```bash
# Build image
docker build -t hackrx-pdf-rag .
# Run container
docker run -p 8000:8000 hackrx-pdf-rag
```
## π Usage Examples
### 1. Testing with curl
```bash
curl -X POST "http://localhost:8000/hackrx/run" \\
-H "Authorization: Bearer hackrx-2024-secure-key" \\
-H "Content-Type: application/json" \\
-d '{
"documents": "https://hackrx.blob.core.windows.net/assets/policy.pdf?sv=2023-01-03&st=2025-07-04T09%3A11%3A24Z&se=2027-07-05T09%3A11%3A00Z&sr=b&sp=r&sig=N4a9OU0w0QXO6AOIBiu4bpl7AXvEZogeT%2FjUHNO7HzQ%3D",
"questions": [
"What is the grace period for premium payment?",
"What is the waiting period for pre-existing diseases?"
]
}'
```
### 2. Python Client
```python
import requests
API_URL = "https://your-domain.com/hackrx/run"
API_KEY = "hackrx-2024-secure-key"
response = requests.post(
API_URL,
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"documents": "https://example.com/document.pdf",
"questions": [
"What is the grace period for premium payment?",
"What is the waiting period for pre-existing diseases?",
"Does this policy cover maternity expenses?",
"What is the waiting period for cataract surgery?",
"Are the medical expenses for an organ donor covered?",
"What is the No Claim Discount (NCD) offered?",
"Is there a benefit for preventive health check-ups?",
"How does the policy define a 'Hospital'?",
"What is the extent of coverage for AYUSH treatments?",
"Are there any sub-limits on room rent and ICU charges?"
]
}
)
print(response.json())
```
### 3. JavaScript/Node.js
```javascript
const axios = require('axios');
async function processDocument() {
const response = await axios.post('https://your-domain.com/hackrx/run', {
documents: 'https://example.com/document.pdf',
questions: [
'What is the grace period for premium payment?',
'What is the waiting period for pre-existing diseases?'
]
}, {
headers: {
'Authorization': 'Bearer hackrx-2024-secure-key',
'Content-Type': 'application/json'
}
});
console.log(response.data);
}
processDocument();
```
## π§ Configuration
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `API_KEY` | Bearer token for authentication | `hackrx-2024-secure-key` |
| `GEMINI_API_KEY` | Google Gemini API key | Provided |
| `HOST` | Server host | `0.0.0.0` |
| `PORT` | Server port | `8000` |
| `DEBUG` | Debug mode | `False` |
### Performance Tuning
- **Memory**: Ensure 4GB+ RAM for large PDFs
- **Timeout**: Set to 30 seconds for optimal performance
- **Concurrency**: Supports multiple simultaneous requests
## π§ͺ Testing
### Run Tests
```bash
# Install test dependencies
pip install pytest pytest-asyncio httpx
# Run tests
pytest test_api.py -v
```
### Load Testing
```bash
# Install locust
pip install locust
# Run load test
locust -f load_test.py --host=http://localhost:8000
```
## π Monitoring
### Health Check
```bash
curl http://localhost:8000/health
```
### Logging
- Logs are written to stdout
- Use `docker logs` for containerized deployments
- Integrate with monitoring tools like Datadog or New Relic
## π Troubleshooting
### Common Issues
1. **PDF Download Fails**
- Check URL accessibility
- Verify file format (PDF only)
- Ensure HTTPS URLs for production
2. **Slow Response Times**
- Increase server resources
- Optimize PDF size (< 10MB recommended)
- Check network connectivity
3. **Authentication Errors**
- Verify Bearer token format
- Check API key in environment variables
4. **Memory Issues**
- Monitor RAM usage during processing
- Consider PDF size limits
- Use streaming for large files
### Debug Mode
```bash
# Enable debug logging
DEBUG=True python fastapi_app.py
```
## ποΈ Architecture |