Spaces:
Runtime error
Runtime error
Enhance README and add data management features
Browse files- Updated README to include new "Clear All Data" functionality for user-friendly data management.
- Added detailed instructions for in-app data clearing, including environment detection and user feedback.
- Introduced `DataClearingService` for comprehensive data management, ensuring safe operations and automatic session resets.
- Enhanced UI elements to reflect new data management capabilities and improved status reporting.
README.md
CHANGED
|
@@ -36,12 +36,16 @@ A Hugging Face Space that converts various document formats to Markdown and lets
|
|
| 36 |
- **Usage limits** to prevent abuse on public spaces
|
| 37 |
- **Powered by Gemini 2.5 Flash** for high-quality responses
|
| 38 |
- **OpenAI embeddings** for accurate document retrieval
|
|
|
|
| 39 |
|
| 40 |
### User Interface
|
| 41 |
- **Dual-tab interface**: Document Converter + Chat
|
| 42 |
-
- **Real-time status monitoring** for RAG system
|
| 43 |
- **Auto-ingestion** of converted documents into chat system
|
| 44 |
-
-
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
## Using MarkItDown & Docling
|
| 47 |
|
|
@@ -130,7 +134,8 @@ The application uses centralized configuration management. You can enhance funct
|
|
| 130 |
3. Ask questions about your converted documents
|
| 131 |
4. Enjoy real-time streaming responses with document context
|
| 132 |
5. Use "New Session" to start fresh conversations
|
| 133 |
-
6.
|
|
|
|
| 134 |
|
| 135 |
## Local Development
|
| 136 |
|
|
@@ -191,6 +196,23 @@ This is particularly useful when:
|
|
| 191 |
- Resetting the system to a clean state
|
| 192 |
- Debugging document ingestion issues
|
| 193 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 194 |
### π§ͺ **Development Features:**
|
| 195 |
- **Automatic Environment Setup**: Dependencies are checked and installed automatically
|
| 196 |
- **Configuration Validation**: Startup validation reports missing API keys and configuration issues
|
|
@@ -418,7 +440,8 @@ markit_v2/
|
|
| 418 |
β β βββ latex_to_markdown_converter.py # LaTeX conversion utility
|
| 419 |
β βββ services/ # Business logic layer
|
| 420 |
β β βββ __init__.py # Package initialization
|
| 421 |
-
β β
|
|
|
|
| 422 |
β βββ parsers/ # Parser implementations
|
| 423 |
β β βββ __init__.py # Package initialization
|
| 424 |
β β βββ parser_interface.py # Enhanced parser interface
|
|
@@ -449,6 +472,7 @@ markit_v2/
|
|
| 449 |
- **Configuration Management**: Centralized API keys, model settings, and app configuration (`src/core/config.py`)
|
| 450 |
- **Exception Hierarchy**: Proper error handling with specific exception types (`src/core/exceptions.py`)
|
| 451 |
- **Service Layer**: Business logic separated from UI and core utilities (`src/services/document_service.py`)
|
|
|
|
| 452 |
- **Environment Management**: Automated dependency checking and setup (`src/core/environment.py`)
|
| 453 |
- **Enhanced Parser Interface**: Validation, metadata, and cancellation support
|
| 454 |
- **Lightweight Launcher**: Quick development startup with `run_app.py`
|
|
@@ -458,13 +482,22 @@ markit_v2/
|
|
| 458 |
### π§ **RAG System Architecture:**
|
| 459 |
- **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
|
| 460 |
- **Markdown-Aware Chunking** (`src/rag/chunking.py`): Preserves tables and code blocks as whole units
|
| 461 |
-
- **Vector Store** (`src/rag/vector_store.py`): Chroma database with persistent storage
|
| 462 |
- **Chat Memory** (`src/rag/memory.py`): Session management and conversation history
|
| 463 |
- **Chat Service** (`src/rag/chat_service.py`): Streaming RAG responses with Gemini 2.5 Flash
|
| 464 |
-
- **Document Ingestion** (`src/rag/ingestion.py`): Automated pipeline
|
| 465 |
- **Usage Limiting**: Anti-abuse measures for public deployment
|
| 466 |
- **Auto-Ingestion**: Seamless integration with document conversion workflow
|
| 467 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 468 |
### ZeroGPU Integration Notes
|
| 469 |
|
| 470 |
When developing for Hugging Face Spaces with Stateless GPU:
|
|
|
|
| 36 |
- **Usage limits** to prevent abuse on public spaces
|
| 37 |
- **Powered by Gemini 2.5 Flash** for high-quality responses
|
| 38 |
- **OpenAI embeddings** for accurate document retrieval
|
| 39 |
+
- **ποΈ Clear All Data** button for easy data management in both local and HF Space environments
|
| 40 |
|
| 41 |
### User Interface
|
| 42 |
- **Dual-tab interface**: Document Converter + Chat
|
| 43 |
+
- **Real-time status monitoring** for RAG system with environment detection
|
| 44 |
- **Auto-ingestion** of converted documents into chat system
|
| 45 |
+
- **Enhanced status display**: Shows vector store document count, chat history files, and environment type
|
| 46 |
+
- **Data management controls**: Clear All Data button with comprehensive feedback
|
| 47 |
+
- **Filename preservation**: Downloaded files maintain original names (e.g., "example data.pdf" β "example data.md")
|
| 48 |
+
- Clean, responsive UI with modern styling
|
| 49 |
|
| 50 |
## Using MarkItDown & Docling
|
| 51 |
|
|
|
|
| 134 |
3. Ask questions about your converted documents
|
| 135 |
4. Enjoy real-time streaming responses with document context
|
| 136 |
5. Use "New Session" to start fresh conversations
|
| 137 |
+
6. Use "ποΈ Clear All Data" to remove all documents and chat history
|
| 138 |
+
7. Monitor your usage limits in the status panel
|
| 139 |
|
| 140 |
## Local Development
|
| 141 |
|
|
|
|
| 196 |
- Resetting the system to a clean state
|
| 197 |
- Debugging document ingestion issues
|
| 198 |
|
| 199 |
+
### ποΈ **In-App Data Clearing:**
|
| 200 |
+
In addition to command-line data clearing, you can also clear data directly from the web interface:
|
| 201 |
+
|
| 202 |
+
1. Go to the **"Chat with Documents"** tab
|
| 203 |
+
2. Click the **"ποΈ Clear All Data"** button in the control panel
|
| 204 |
+
3. All vector store documents and chat history will be cleared
|
| 205 |
+
4. A new chat session will automatically start
|
| 206 |
+
5. The status panel will update to reflect the cleared state
|
| 207 |
+
|
| 208 |
+
**Features of in-app clearing:**
|
| 209 |
+
- **Environment Detection**: Automatically works in both local and HF Space environments
|
| 210 |
+
- **Comprehensive Clearing**: Removes both vector store documents and chat history files
|
| 211 |
+
- **Smart Path Resolution**: Uses `/tmp/data/*` for HF Spaces, `./data/*` for local development
|
| 212 |
+
- **User Feedback**: Shows detailed results of what was cleared
|
| 213 |
+
- **Auto-Session Reset**: Starts fresh chat session after clearing
|
| 214 |
+
- **Safe Operation**: Handles errors gracefully and provides status updates
|
| 215 |
+
|
| 216 |
### π§ͺ **Development Features:**
|
| 217 |
- **Automatic Environment Setup**: Dependencies are checked and installed automatically
|
| 218 |
- **Configuration Validation**: Startup validation reports missing API keys and configuration issues
|
|
|
|
| 440 |
β β βββ latex_to_markdown_converter.py # LaTeX conversion utility
|
| 441 |
β βββ services/ # Business logic layer
|
| 442 |
β β βββ __init__.py # Package initialization
|
| 443 |
+
β β βββ document_service.py # π Document processing service
|
| 444 |
+
β β βββ data_clearing_service.py # π Data management and clearing service
|
| 445 |
β βββ parsers/ # Parser implementations
|
| 446 |
β β βββ __init__.py # Package initialization
|
| 447 |
β β βββ parser_interface.py # Enhanced parser interface
|
|
|
|
| 472 |
- **Configuration Management**: Centralized API keys, model settings, and app configuration (`src/core/config.py`)
|
| 473 |
- **Exception Hierarchy**: Proper error handling with specific exception types (`src/core/exceptions.py`)
|
| 474 |
- **Service Layer**: Business logic separated from UI and core utilities (`src/services/document_service.py`)
|
| 475 |
+
- **Data Management Service**: Comprehensive data clearing functionality (`src/services/data_clearing_service.py`)
|
| 476 |
- **Environment Management**: Automated dependency checking and setup (`src/core/environment.py`)
|
| 477 |
- **Enhanced Parser Interface**: Validation, metadata, and cancellation support
|
| 478 |
- **Lightweight Launcher**: Quick development startup with `run_app.py`
|
|
|
|
| 482 |
### π§ **RAG System Architecture:**
|
| 483 |
- **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
|
| 484 |
- **Markdown-Aware Chunking** (`src/rag/chunking.py`): Preserves tables and code blocks as whole units
|
| 485 |
+
- **Vector Store** (`src/rag/vector_store.py`): Chroma database with persistent storage and deduplication
|
| 486 |
- **Chat Memory** (`src/rag/memory.py`): Session management and conversation history
|
| 487 |
- **Chat Service** (`src/rag/chat_service.py`): Streaming RAG responses with Gemini 2.5 Flash
|
| 488 |
+
- **Document Ingestion** (`src/rag/ingestion.py`): Automated pipeline with intelligent duplicate handling
|
| 489 |
- **Usage Limiting**: Anti-abuse measures for public deployment
|
| 490 |
- **Auto-Ingestion**: Seamless integration with document conversion workflow
|
| 491 |
|
| 492 |
+
### ποΈ **Data Management & Deduplication:**
|
| 493 |
+
- **File Hash-Based Deduplication**: Uses SHA-256 hashes of original file content to prevent duplicates
|
| 494 |
+
- **Chroma Where Filter Integration**: Persistent duplicate detection using vector store metadata queries
|
| 495 |
+
- **Automatic Document Replacement**: When same file is uploaded again, old version is replaced with new one
|
| 496 |
+
- **Cross-Environment Data Clearing**: Works seamlessly in both local development and HF Space environments
|
| 497 |
+
- **Environment-Aware Path Resolution**: Automatically detects and uses correct data paths (`./data/*` vs `/tmp/data/*`)
|
| 498 |
+
- **Comprehensive Status Reporting**: Real-time display of vector store documents, chat history files, and environment type
|
| 499 |
+
- **Safe Clearing Operations**: Graceful error handling with detailed feedback on clearing operations
|
| 500 |
+
|
| 501 |
### ZeroGPU Integration Notes
|
| 502 |
|
| 503 |
When developing for Hugging Face Spaces with Stateless GPU:
|