AnseMin commited on
Commit
35912f6
Β·
1 Parent(s): f46dfbd

Enhance README and add data management features

Browse files

- Updated README to include new "Clear All Data" functionality for user-friendly data management.
- Added detailed instructions for in-app data clearing, including environment detection and user feedback.
- Introduced `DataClearingService` for comprehensive data management, ensuring safe operations and automatic session resets.
- Enhanced UI elements to reflect new data management capabilities and improved status reporting.

Files changed (1) hide show
  1. README.md +39 -6
README.md CHANGED
@@ -36,12 +36,16 @@ A Hugging Face Space that converts various document formats to Markdown and lets
36
  - **Usage limits** to prevent abuse on public spaces
37
  - **Powered by Gemini 2.5 Flash** for high-quality responses
38
  - **OpenAI embeddings** for accurate document retrieval
 
39
 
40
  ### User Interface
41
  - **Dual-tab interface**: Document Converter + Chat
42
- - **Real-time status monitoring** for RAG system
43
  - **Auto-ingestion** of converted documents into chat system
44
- - Clean, responsive UI
 
 
 
45
 
46
  ## Using MarkItDown & Docling
47
 
@@ -130,7 +134,8 @@ The application uses centralized configuration management. You can enhance funct
130
  3. Ask questions about your converted documents
131
  4. Enjoy real-time streaming responses with document context
132
  5. Use "New Session" to start fresh conversations
133
- 6. Monitor your usage limits in the status panel
 
134
 
135
  ## Local Development
136
 
@@ -191,6 +196,23 @@ This is particularly useful when:
191
  - Resetting the system to a clean state
192
  - Debugging document ingestion issues
193
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
  ### πŸ§ͺ **Development Features:**
195
  - **Automatic Environment Setup**: Dependencies are checked and installed automatically
196
  - **Configuration Validation**: Startup validation reports missing API keys and configuration issues
@@ -418,7 +440,8 @@ markit_v2/
418
  β”‚ β”‚ └── latex_to_markdown_converter.py # LaTeX conversion utility
419
  β”‚ β”œβ”€β”€ services/ # Business logic layer
420
  β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
421
- β”‚ β”‚ └── document_service.py # πŸ†• Document processing service
 
422
  β”‚ β”œβ”€β”€ parsers/ # Parser implementations
423
  β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
424
  β”‚ β”‚ β”œβ”€β”€ parser_interface.py # Enhanced parser interface
@@ -449,6 +472,7 @@ markit_v2/
449
  - **Configuration Management**: Centralized API keys, model settings, and app configuration (`src/core/config.py`)
450
  - **Exception Hierarchy**: Proper error handling with specific exception types (`src/core/exceptions.py`)
451
  - **Service Layer**: Business logic separated from UI and core utilities (`src/services/document_service.py`)
 
452
  - **Environment Management**: Automated dependency checking and setup (`src/core/environment.py`)
453
  - **Enhanced Parser Interface**: Validation, metadata, and cancellation support
454
  - **Lightweight Launcher**: Quick development startup with `run_app.py`
@@ -458,13 +482,22 @@ markit_v2/
458
  ### 🧠 **RAG System Architecture:**
459
  - **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
460
  - **Markdown-Aware Chunking** (`src/rag/chunking.py`): Preserves tables and code blocks as whole units
461
- - **Vector Store** (`src/rag/vector_store.py`): Chroma database with persistent storage
462
  - **Chat Memory** (`src/rag/memory.py`): Session management and conversation history
463
  - **Chat Service** (`src/rag/chat_service.py`): Streaming RAG responses with Gemini 2.5 Flash
464
- - **Document Ingestion** (`src/rag/ingestion.py`): Automated pipeline for converting documents to RAG-ready format
465
  - **Usage Limiting**: Anti-abuse measures for public deployment
466
  - **Auto-Ingestion**: Seamless integration with document conversion workflow
467
 
 
 
 
 
 
 
 
 
 
468
  ### ZeroGPU Integration Notes
469
 
470
  When developing for Hugging Face Spaces with Stateless GPU:
 
36
  - **Usage limits** to prevent abuse on public spaces
37
  - **Powered by Gemini 2.5 Flash** for high-quality responses
38
  - **OpenAI embeddings** for accurate document retrieval
39
+ - **πŸ—‘οΈ Clear All Data** button for easy data management in both local and HF Space environments
40
 
41
  ### User Interface
42
  - **Dual-tab interface**: Document Converter + Chat
43
+ - **Real-time status monitoring** for RAG system with environment detection
44
  - **Auto-ingestion** of converted documents into chat system
45
+ - **Enhanced status display**: Shows vector store document count, chat history files, and environment type
46
+ - **Data management controls**: Clear All Data button with comprehensive feedback
47
+ - **Filename preservation**: Downloaded files maintain original names (e.g., "example data.pdf" β†’ "example data.md")
48
+ - Clean, responsive UI with modern styling
49
 
50
  ## Using MarkItDown & Docling
51
 
 
134
  3. Ask questions about your converted documents
135
  4. Enjoy real-time streaming responses with document context
136
  5. Use "New Session" to start fresh conversations
137
+ 6. Use "πŸ—‘οΈ Clear All Data" to remove all documents and chat history
138
+ 7. Monitor your usage limits in the status panel
139
 
140
  ## Local Development
141
 
 
196
  - Resetting the system to a clean state
197
  - Debugging document ingestion issues
198
 
199
+ ### πŸ—‘οΈ **In-App Data Clearing:**
200
+ In addition to command-line data clearing, you can also clear data directly from the web interface:
201
+
202
+ 1. Go to the **"Chat with Documents"** tab
203
+ 2. Click the **"πŸ—‘οΈ Clear All Data"** button in the control panel
204
+ 3. All vector store documents and chat history will be cleared
205
+ 4. A new chat session will automatically start
206
+ 5. The status panel will update to reflect the cleared state
207
+
208
+ **Features of in-app clearing:**
209
+ - **Environment Detection**: Automatically works in both local and HF Space environments
210
+ - **Comprehensive Clearing**: Removes both vector store documents and chat history files
211
+ - **Smart Path Resolution**: Uses `/tmp/data/*` for HF Spaces, `./data/*` for local development
212
+ - **User Feedback**: Shows detailed results of what was cleared
213
+ - **Auto-Session Reset**: Starts fresh chat session after clearing
214
+ - **Safe Operation**: Handles errors gracefully and provides status updates
215
+
216
  ### πŸ§ͺ **Development Features:**
217
  - **Automatic Environment Setup**: Dependencies are checked and installed automatically
218
  - **Configuration Validation**: Startup validation reports missing API keys and configuration issues
 
440
  β”‚ β”‚ └── latex_to_markdown_converter.py # LaTeX conversion utility
441
  β”‚ β”œβ”€β”€ services/ # Business logic layer
442
  β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
443
+ β”‚ β”‚ β”œβ”€β”€ document_service.py # πŸ†• Document processing service
444
+ β”‚ β”‚ └── data_clearing_service.py # πŸ†• Data management and clearing service
445
  β”‚ β”œβ”€β”€ parsers/ # Parser implementations
446
  β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
447
  β”‚ β”‚ β”œβ”€β”€ parser_interface.py # Enhanced parser interface
 
472
  - **Configuration Management**: Centralized API keys, model settings, and app configuration (`src/core/config.py`)
473
  - **Exception Hierarchy**: Proper error handling with specific exception types (`src/core/exceptions.py`)
474
  - **Service Layer**: Business logic separated from UI and core utilities (`src/services/document_service.py`)
475
+ - **Data Management Service**: Comprehensive data clearing functionality (`src/services/data_clearing_service.py`)
476
  - **Environment Management**: Automated dependency checking and setup (`src/core/environment.py`)
477
  - **Enhanced Parser Interface**: Validation, metadata, and cancellation support
478
  - **Lightweight Launcher**: Quick development startup with `run_app.py`
 
482
  ### 🧠 **RAG System Architecture:**
483
  - **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
484
  - **Markdown-Aware Chunking** (`src/rag/chunking.py`): Preserves tables and code blocks as whole units
485
+ - **Vector Store** (`src/rag/vector_store.py`): Chroma database with persistent storage and deduplication
486
  - **Chat Memory** (`src/rag/memory.py`): Session management and conversation history
487
  - **Chat Service** (`src/rag/chat_service.py`): Streaming RAG responses with Gemini 2.5 Flash
488
+ - **Document Ingestion** (`src/rag/ingestion.py`): Automated pipeline with intelligent duplicate handling
489
  - **Usage Limiting**: Anti-abuse measures for public deployment
490
  - **Auto-Ingestion**: Seamless integration with document conversion workflow
491
 
492
+ ### πŸ—‘οΈ **Data Management & Deduplication:**
493
+ - **File Hash-Based Deduplication**: Uses SHA-256 hashes of original file content to prevent duplicates
494
+ - **Chroma Where Filter Integration**: Persistent duplicate detection using vector store metadata queries
495
+ - **Automatic Document Replacement**: When same file is uploaded again, old version is replaced with new one
496
+ - **Cross-Environment Data Clearing**: Works seamlessly in both local development and HF Space environments
497
+ - **Environment-Aware Path Resolution**: Automatically detects and uses correct data paths (`./data/*` vs `/tmp/data/*`)
498
+ - **Comprehensive Status Reporting**: Real-time display of vector store documents, chat history files, and environment type
499
+ - **Safe Clearing Operations**: Graceful error handling with detailed feedback on clearing operations
500
+
501
  ### ZeroGPU Integration Notes
502
 
503
  When developing for Hugging Face Spaces with Stateless GPU: