Spaces:
Runtime error
Runtime error
Commit Β·
667283b
1
Parent(s): bfc207f
Revert old hf readme
Browse files
README.md
CHANGED
|
@@ -1,352 +1,11 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
- [Architecture](#architecture)
|
| 10 |
-
- [Installation](#installation)
|
| 11 |
-
- [Configuration](#configuration)
|
| 12 |
-
- [API Endpoints](#api-endpoints)
|
| 13 |
-
- [Usage](#usage)
|
| 14 |
-
- [Environment Variables](#environment-variables)
|
| 15 |
-
- [Development](#development)
|
| 16 |
-
- [Deployment](#deployment)
|
| 17 |
-
- [Troubleshooting](#troubleshooting)
|
| 18 |
-
- [License](#license)
|
| 19 |
-
|
| 20 |
-
## Overview
|
| 21 |
-
|
| 22 |
-
The PageIndex Memori Chatbot API enables natural language interactions with PDF documents using advanced RAG technology. It combines PageIndex's vectorless retrieval system with Telkom's AI LLM to provide accurate and context-aware responses to user queries about uploaded documents.
|
| 23 |
-
|
| 24 |
-
## Features
|
| 25 |
-
|
| 26 |
-
- **PDF Upload & Processing**: Upload PDF documents for intelligent processing and indexing
|
| 27 |
-
- **Vectorless RAG**: Uses PageIndex technology for document understanding without traditional vector embeddings
|
| 28 |
-
- **Natural Language Chat**: Engage in conversations with your documents using natural language
|
| 29 |
-
- **Session Management**: Maintain conversation context with session IDs
|
| 30 |
-
- **Document Metadata Storage**: Keeps track of uploaded documents and their processing status
|
| 31 |
-
- **Memory Management**: Uses Memori for conversation history and context
|
| 32 |
-
- **Asynchronous Processing**: Handles document processing and AI calls asynchronously
|
| 33 |
-
- **Comprehensive Logging**: Detailed logging for debugging and monitoring
|
| 34 |
-
|
| 35 |
-
## Architecture
|
| 36 |
-
|
| 37 |
-
The system consists of several key components:
|
| 38 |
-
|
| 39 |
-
```
|
| 40 |
-
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
|
| 41 |
-
β Frontend ββββββ FastAPI API ββββββ PageIndex β
|
| 42 |
-
β (Client) β β (Backend) β β (Retrieval) β
|
| 43 |
-
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
|
| 44 |
-
β
|
| 45 |
-
ββββββββββββββββββββ
|
| 46 |
-
β Memori β
|
| 47 |
-
β (Memory) β
|
| 48 |
-
ββββββββββββββββββββ
|
| 49 |
-
β
|
| 50 |
-
ββββββββββββββββββββ
|
| 51 |
-
β LLM Service β
|
| 52 |
-
β (Generation) β
|
| 53 |
-
ββββββββββββββββββββ
|
| 54 |
-
```
|
| 55 |
-
|
| 56 |
-
### Key Components:
|
| 57 |
-
|
| 58 |
-
- **FastAPI**: Web framework providing RESTful API endpoints
|
| 59 |
-
- **PageIndex**: Vectorless RAG system for document processing and retrieval
|
| 60 |
-
- **Memori**: Memory management system for conversation context
|
| 61 |
-
- **Telkom LLM**: Large Language Model for response generation
|
| 62 |
-
- **Loguru**: Structured logging system
|
| 63 |
-
- **SQLAlchemy**: Database ORM for persistent storage
|
| 64 |
-
|
| 65 |
-
## Installation
|
| 66 |
-
|
| 67 |
-
### Prerequisites
|
| 68 |
-
|
| 69 |
-
- Python 3.11 or higher
|
| 70 |
-
- pip package manager
|
| 71 |
-
- Access to PageIndex API (https://dash.pageindex.ai/api-keys)
|
| 72 |
-
- Access to Telkom AI LLM service
|
| 73 |
-
|
| 74 |
-
### Setup
|
| 75 |
-
|
| 76 |
-
1. Clone the repository:
|
| 77 |
-
```bash
|
| 78 |
-
git clone [repository-url]
|
| 79 |
-
cd pageindex-trial
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
2. Create a virtual environment:
|
| 83 |
-
```bash
|
| 84 |
-
python -m venv .venv
|
| 85 |
-
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
| 86 |
-
```
|
| 87 |
-
|
| 88 |
-
3. Install dependencies:
|
| 89 |
-
```bash
|
| 90 |
-
pip install -r requirements.txt
|
| 91 |
-
# Or if using poetry:
|
| 92 |
-
poetry install
|
| 93 |
-
```
|
| 94 |
-
|
| 95 |
-
## Configuration
|
| 96 |
-
|
| 97 |
-
### Environment Variables
|
| 98 |
-
|
| 99 |
-
Create a `.env` file in the project root with the following variables:
|
| 100 |
-
|
| 101 |
-
```env
|
| 102 |
-
# PageIndex Configuration
|
| 103 |
-
PAGEINDEX_API_KEY=your_pageindex_api_key_here
|
| 104 |
-
|
| 105 |
-
# LLM Configuration
|
| 106 |
-
URL_CUSTOM_LLM=https://your-llm-endpoint.com/api/v1/chat/completions
|
| 107 |
-
TOKEN_CUSTOM_LLM=your_llm_api_token_here
|
| 108 |
-
|
| 109 |
-
# Database Configuration
|
| 110 |
-
DATABASE_URL=sqlite:///./memori.db
|
| 111 |
-
|
| 112 |
-
# Application Configuration
|
| 113 |
-
APP_NAME=PageIndex RAG Chatbot
|
| 114 |
-
DEBUG=False
|
| 115 |
-
HOST=0.0.0.0
|
| 116 |
-
PORT=8000
|
| 117 |
-
PDF_STORAGE_PATH=./uploaded_pdfs
|
| 118 |
-
MAX_FILE_SIZE=10485760 # 10MB in bytes
|
| 119 |
-
LLM_MODEL_NAME=telkom-ai-instruct
|
| 120 |
-
LLM_TEMPERATURE=0
|
| 121 |
-
LLM_MAX_TOKENS=3000
|
| 122 |
-
MAX_HISTORY_LENGTH=10
|
| 123 |
-
```
|
| 124 |
-
|
| 125 |
-
### Configuration Options
|
| 126 |
-
|
| 127 |
-
- `APP_NAME`: Name of your application
|
| 128 |
-
- `DEBUG`: Enable/disable debug mode
|
| 129 |
-
- `HOST`: Host address for the API server
|
| 130 |
-
- `PORT`: Port number for the API server
|
| 131 |
-
- `PDF_STORAGE_PATH`: Directory to store uploaded PDFs
|
| 132 |
-
- `MAX_FILE_SIZE`: Maximum allowed PDF file size in bytes
|
| 133 |
-
- `LLM_TEMPERATURE`: Temperature setting for LLM responses (0-1)
|
| 134 |
-
- `LLM_MAX_TOKENS`: Maximum tokens for LLM responses
|
| 135 |
-
- `MAX_HISTORY_LENGTH`: Number of conversation turns to maintain in memory
|
| 136 |
-
|
| 137 |
-
## API Endpoints
|
| 138 |
-
|
| 139 |
-
### Upload PDF
|
| 140 |
-
- **Endpoint**: `POST /upload`
|
| 141 |
-
- **Description**: Upload a PDF document for processing
|
| 142 |
-
- **Request**: Multipart form data with PDF file
|
| 143 |
-
- **Response**: PDF ID and upload status
|
| 144 |
-
|
| 145 |
-
```bash
|
| 146 |
-
curl -X POST "http://localhost:8000/upload" \
|
| 147 |
-
-H "accept: application/json" \
|
| 148 |
-
-H "Content-Type: multipart/form-data" \
|
| 149 |
-
-F "file=@document.pdf"
|
| 150 |
-
```
|
| 151 |
-
|
| 152 |
-
### Chat with PDF
|
| 153 |
-
- **Endpoint**: `POST /chat`
|
| 154 |
-
- **Description**: Chat with an uploaded PDF document
|
| 155 |
-
- **Request**: JSON with message, PDF ID, and optional session ID
|
| 156 |
-
- **Response**: AI-generated response and session information
|
| 157 |
-
|
| 158 |
-
```bash
|
| 159 |
-
curl -X POST "http://localhost:8000/chat" \
|
| 160 |
-
-H "Content-Type: application/json" \
|
| 161 |
-
-d '{
|
| 162 |
-
"message": "What is this document about?",
|
| 163 |
-
"pdf_id": "generated-uuid-here",
|
| 164 |
-
"session_id": "optional-session-id"
|
| 165 |
-
}'
|
| 166 |
-
```
|
| 167 |
-
|
| 168 |
-
### Delete PDF
|
| 169 |
-
- **Endpoint**: `DELETE /pdf/{pdf_id}`
|
| 170 |
-
- **Description**: Delete an uploaded PDF and its associated data
|
| 171 |
-
- **Request**: Path parameter with PDF ID
|
| 172 |
-
- **Response**: Deletion confirmation
|
| 173 |
-
|
| 174 |
-
```bash
|
| 175 |
-
curl -X DELETE "http://localhost:8000/pdf/{pdf_id}"
|
| 176 |
-
```
|
| 177 |
-
|
| 178 |
-
### Root Endpoint
|
| 179 |
-
- **Endpoint**: `GET /`
|
| 180 |
-
- **Description**: Health check and welcome message
|
| 181 |
-
- **Response**: Welcome message
|
| 182 |
-
|
| 183 |
-
```bash
|
| 184 |
-
curl -X GET "http://localhost:8000/"
|
| 185 |
-
```
|
| 186 |
-
|
| 187 |
-
## Usage
|
| 188 |
-
|
| 189 |
-
### Starting the Server
|
| 190 |
-
|
| 191 |
-
```bash
|
| 192 |
-
# Using uvicorn directly
|
| 193 |
-
uvicorn core.main:app --host 0.0.0.0 --port 8000 --reload
|
| 194 |
-
|
| 195 |
-
# Or run the main module
|
| 196 |
-
python -m core.main
|
| 197 |
-
```
|
| 198 |
-
|
| 199 |
-
### Basic Workflow
|
| 200 |
-
|
| 201 |
-
1. **Upload a PDF**: Send a POST request to `/upload` with your PDF file
|
| 202 |
-
2. **Wait for Processing**: The system will process the PDF using PageIndex
|
| 203 |
-
3. **Chat with Document**: Send messages to `/chat` with the PDF ID to get responses
|
| 204 |
-
4. **Manage Sessions**: Optionally provide session IDs to maintain conversation context
|
| 205 |
-
5. **Clean Up**: Delete PDFs when no longer needed using the delete endpoint
|
| 206 |
-
|
| 207 |
-
### Example Python Client
|
| 208 |
-
|
| 209 |
-
```python
|
| 210 |
-
import requests
|
| 211 |
-
|
| 212 |
-
# Upload PDF
|
| 213 |
-
with open('document.pdf', 'rb') as f:
|
| 214 |
-
upload_response = requests.post(
|
| 215 |
-
'http://localhost:8000/upload',
|
| 216 |
-
files={'file': f}
|
| 217 |
-
)
|
| 218 |
-
pdf_id = upload_response.json()['pdf_id']
|
| 219 |
-
|
| 220 |
-
# Chat with PDF
|
| 221 |
-
chat_response = requests.post(
|
| 222 |
-
'http://localhost:8000/chat',
|
| 223 |
-
json={
|
| 224 |
-
'message': 'Summarize this document',
|
| 225 |
-
'pdf_id': pdf_id
|
| 226 |
-
}
|
| 227 |
-
)
|
| 228 |
-
print(chat_response.json()['response'])
|
| 229 |
-
```
|
| 230 |
-
|
| 231 |
-
## Development
|
| 232 |
-
|
| 233 |
-
### Project Structure
|
| 234 |
-
|
| 235 |
-
```
|
| 236 |
-
pageindex-trial/
|
| 237 |
-
βββ core/ # Core application modules
|
| 238 |
-
β βββ __init__.py
|
| 239 |
-
β βββ config.py # Application configuration
|
| 240 |
-
β βββ llm_engine.py # LLM integration
|
| 241 |
-
β βββ main.py # FastAPI application
|
| 242 |
-
β βββ routes.py # API route definitions
|
| 243 |
-
βββ data/ # Data files
|
| 244 |
-
βββ uploaded_pdfs/ # Storage for uploaded PDFs
|
| 245 |
-
βββ .env # Environment variables
|
| 246 |
-
βββ .gitignore
|
| 247 |
-
βββ README.md
|
| 248 |
-
βββ requirements.txt # Python dependencies
|
| 249 |
-
βββ pyproject.toml # Poetry configuration
|
| 250 |
-
βββ uv.lock # Dependency lock file
|
| 251 |
-
```
|
| 252 |
-
|
| 253 |
-
### Running Tests
|
| 254 |
-
|
| 255 |
-
```bash
|
| 256 |
-
# Run all tests
|
| 257 |
-
pytest
|
| 258 |
-
|
| 259 |
-
# Run tests with coverage
|
| 260 |
-
pytest --cov=core
|
| 261 |
-
```
|
| 262 |
-
|
| 263 |
-
### Code Formatting
|
| 264 |
-
|
| 265 |
-
The project uses standard Python formatting. Run the following to format code:
|
| 266 |
-
|
| 267 |
-
```bash
|
| 268 |
-
# Using black
|
| 269 |
-
black .
|
| 270 |
-
|
| 271 |
-
# Using ruff
|
| 272 |
-
ruff check .
|
| 273 |
-
ruff format .
|
| 274 |
-
```
|
| 275 |
-
|
| 276 |
-
## Deployment
|
| 277 |
-
|
| 278 |
-
### Production Deployment
|
| 279 |
-
|
| 280 |
-
For production deployment, consider:
|
| 281 |
-
|
| 282 |
-
1. **Use a production ASGI server** like Gunicorn:
|
| 283 |
-
```bash
|
| 284 |
-
gunicorn core.main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000
|
| 285 |
-
```
|
| 286 |
-
|
| 287 |
-
2. **Set DEBUG=False** in production
|
| 288 |
-
3. **Use a proper database** instead of SQLite
|
| 289 |
-
4. **Configure logging** for production
|
| 290 |
-
5. **Implement proper authentication/authorization**
|
| 291 |
-
6. **Set up monitoring and alerting**
|
| 292 |
-
|
| 293 |
-
### Docker Deployment
|
| 294 |
-
|
| 295 |
-
```dockerfile
|
| 296 |
-
FROM python:3.11-slim
|
| 297 |
-
|
| 298 |
-
WORKDIR /app
|
| 299 |
-
|
| 300 |
-
COPY requirements.txt .
|
| 301 |
-
RUN pip install --no-cache-dir -r requirements.txt
|
| 302 |
-
|
| 303 |
-
COPY . .
|
| 304 |
-
|
| 305 |
-
EXPOSE 8000
|
| 306 |
-
|
| 307 |
-
CMD ["uvicorn", "core.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
| 308 |
-
```
|
| 309 |
-
|
| 310 |
-
## Troubleshooting
|
| 311 |
-
|
| 312 |
-
### Common Issues
|
| 313 |
-
|
| 314 |
-
1. **PageIndex API Key Issues**:
|
| 315 |
-
- Ensure your API key is correctly set in the environment
|
| 316 |
-
- Check that you have access to the PageIndex service
|
| 317 |
-
|
| 318 |
-
2. **LLM Connection Issues**:
|
| 319 |
-
- Verify the LLM endpoint URL and token are correct
|
| 320 |
-
- Check network connectivity to the LLM service
|
| 321 |
-
|
| 322 |
-
3. **PDF Processing Failures**:
|
| 323 |
-
- Ensure PDF files are not corrupted
|
| 324 |
-
- Check that file sizes are within the allowed limit
|
| 325 |
-
|
| 326 |
-
4. **Memory Issues**:
|
| 327 |
-
- Monitor memory usage during document processing
|
| 328 |
-
- Consider increasing server resources for large documents
|
| 329 |
-
|
| 330 |
-
### Logging
|
| 331 |
-
|
| 332 |
-
The application uses Loguru for detailed logging. Check the logs to diagnose issues:
|
| 333 |
-
|
| 334 |
-
- Info logs: General application flow
|
| 335 |
-
- Debug logs: Detailed information (when DEBUG=True)
|
| 336 |
-
- Warning logs: Potential issues
|
| 337 |
-
- Error logs: Failed operations
|
| 338 |
-
|
| 339 |
-
### Performance Considerations
|
| 340 |
-
|
| 341 |
-
- Large PDF files will take longer to process
|
| 342 |
-
- Complex documents may require more LLM calls
|
| 343 |
-
- Consider implementing document size limits for better performance
|
| 344 |
-
- Monitor API usage limits for PageIndex and LLM services
|
| 345 |
-
|
| 346 |
-
## License
|
| 347 |
-
|
| 348 |
-
This project is licensed under the [MIT License](LICENSE) - see the LICENSE file for details.
|
| 349 |
-
|
| 350 |
---
|
| 351 |
|
| 352 |
-
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: PageIndex Chat
|
| 3 |
+
emoji: π
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: pink
|
| 6 |
+
sdk: docker
|
| 7 |
+
pinned: false
|
| 8 |
+
short_description: Simple PDF QnA Chatbot using PageIndex
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|