Spaces:
Runtime error
Runtime error
Enhance README to reflect new features and improvements in document processing
Browse files- Updated GOT-OCR section to highlight native LaTeX output and Mathpix rendering capabilities.
- Introduced Smart Content-Aware Chunking for both Markdown and LaTeX, detailing preservation of structures and automatic format detection.
- Added new configuration options for LaTeX chunking in the application settings.
- Expanded usage instructions for GOT-OCR and other parsers, clarifying output formats.
- Included a new section on advanced LaTeX processing features and use cases for better user guidance.
README.md
CHANGED
|
@@ -24,7 +24,7 @@ A Hugging Face Space that converts various document formats to Markdown and lets
|
|
| 24 |
- Multiple parser options:
|
| 25 |
- MarkItDown: For comprehensive document conversion
|
| 26 |
- Docling: For advanced PDF understanding with table structure recognition + **multi-document processing**
|
| 27 |
-
- GOT-OCR: For image-based OCR with LaTeX
|
| 28 |
- Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
|
| 29 |
- Mistral OCR: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode + **multi-document processing**
|
| 30 |
- **π Intelligent Processing Types**:
|
|
@@ -42,7 +42,10 @@ A Hugging Face Space that converts various document formats to Markdown and lets
|
|
| 42 |
- **BM25 Keyword Search**: Traditional keyword-based retrieval
|
| 43 |
- **Hybrid Search**: Combines semantic + keyword search for best accuracy
|
| 44 |
- **Intelligent document retrieval** using vector embeddings
|
| 45 |
-
- **
|
|
|
|
|
|
|
|
|
|
| 46 |
- **Streaming chat responses** for real-time interaction
|
| 47 |
- **Chat history management** with session persistence
|
| 48 |
- **Usage limits** to prevent abuse on public spaces
|
|
@@ -168,8 +171,10 @@ The application uses centralized configuration management. You can enhance funct
|
|
| 168 |
- `VECTOR_STORE_PATH`: Path for vector database storage (default: ./data/vector_store)
|
| 169 |
- `CHAT_HISTORY_PATH`: Path for chat history storage (default: ./data/chat_history)
|
| 170 |
- `EMBEDDING_MODEL`: OpenAI embedding model (default: text-embedding-3-small)
|
| 171 |
-
- `CHUNK_SIZE`: Document chunk size for
|
| 172 |
-
- `CHUNK_OVERLAP`: Overlap between chunks (default: 200)
|
|
|
|
|
|
|
| 173 |
- `MAX_MESSAGES_PER_SESSION`: Chat limit per session (default: 50)
|
| 174 |
- `MAX_MESSAGES_PER_HOUR`: Chat limit per hour (default: 100)
|
| 175 |
- `RETRIEVAL_K`: Number of documents to retrieve (default: 4)
|
|
@@ -199,7 +204,9 @@ The application uses centralized configuration management. You can enhance funct
|
|
| 199 |
- **"Gemini Flash"** for AI-powered text extraction
|
| 200 |
4. Select an OCR method based on your chosen parser
|
| 201 |
5. Click "Convert"
|
| 202 |
-
6. View the
|
|
|
|
|
|
|
| 203 |
|
| 204 |
#### π **Multi-Document Processing** (NEW!)
|
| 205 |
1. Go to the **"Document Converter"** tab
|
|
@@ -319,10 +326,58 @@ The application uses centralized configuration management. You can enhance funct
|
|
| 319 |
- **Enhanced Error Messages**: Detailed error reporting for debugging
|
| 320 |
- **Centralized Logging**: Configurable logging levels and output formats
|
| 321 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 322 |
## Credits
|
| 323 |
|
| 324 |
- [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
|
| 325 |
- [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
|
|
|
|
| 326 |
- [Gradio](https://gradio.app/) for the UI framework
|
| 327 |
|
| 328 |
---
|
|
@@ -343,6 +398,7 @@ The system supports **four different retrieval methods** for optimal document se
|
|
| 343 |
- **Best for**: General questions and semantic understanding
|
| 344 |
- **Use case**: "What is the main topic of this document?"
|
| 345 |
- **Configuration**: `{'k': 4, 'search_type': 'similarity'}`
|
|
|
|
| 346 |
|
| 347 |
### **2. π MMR (Maximal Marginal Relevance)**
|
| 348 |
- **How it works**: Balances relevance with result diversity to reduce redundancy
|
|
@@ -469,7 +525,12 @@ markit_v2/
|
|
| 469 |
|
| 470 |
### π§ **RAG System Architecture:**
|
| 471 |
- **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
|
| 472 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 473 |
- **π Advanced Vector Store** (`src/rag/vector_store.py`): Multi-strategy retrieval system with:
|
| 474 |
- **Similarity Search**: Traditional semantic retrieval using embeddings
|
| 475 |
- **MMR Support**: Maximal Marginal Relevance for diverse results
|
|
|
|
| 24 |
- Multiple parser options:
|
| 25 |
- MarkItDown: For comprehensive document conversion
|
| 26 |
- Docling: For advanced PDF understanding with table structure recognition + **multi-document processing**
|
| 27 |
+
- GOT-OCR: For image-based OCR with **native LaTeX output** and Mathpix rendering
|
| 28 |
- Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
|
| 29 |
- Mistral OCR: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode + **multi-document processing**
|
| 30 |
- **π Intelligent Processing Types**:
|
|
|
|
| 42 |
- **BM25 Keyword Search**: Traditional keyword-based retrieval
|
| 43 |
- **Hybrid Search**: Combines semantic + keyword search for best accuracy
|
| 44 |
- **Intelligent document retrieval** using vector embeddings
|
| 45 |
+
- **π Smart Content-Aware Chunking**:
|
| 46 |
+
- **Markdown chunking** that preserves tables and code blocks
|
| 47 |
+
- **LaTeX chunking** that preserves mathematical tables, environments, and structures
|
| 48 |
+
- **Automatic format detection** for optimal chunking strategy
|
| 49 |
- **Streaming chat responses** for real-time interaction
|
| 50 |
- **Chat history management** with session persistence
|
| 51 |
- **Usage limits** to prevent abuse on public spaces
|
|
|
|
| 171 |
- `VECTOR_STORE_PATH`: Path for vector database storage (default: ./data/vector_store)
|
| 172 |
- `CHAT_HISTORY_PATH`: Path for chat history storage (default: ./data/chat_history)
|
| 173 |
- `EMBEDDING_MODEL`: OpenAI embedding model (default: text-embedding-3-small)
|
| 174 |
+
- `CHUNK_SIZE`: Document chunk size for Markdown content (default: 1000)
|
| 175 |
+
- `CHUNK_OVERLAP`: Overlap between chunks for Markdown (default: 200)
|
| 176 |
+
- `LATEX_CHUNK_SIZE`: Document chunk size for LaTeX content (default: 1200)
|
| 177 |
+
- `LATEX_CHUNK_OVERLAP`: Overlap between chunks for LaTeX (default: 150)
|
| 178 |
- `MAX_MESSAGES_PER_SESSION`: Chat limit per session (default: 50)
|
| 179 |
- `MAX_MESSAGES_PER_HOUR`: Chat limit per hour (default: 100)
|
| 180 |
- `RETRIEVAL_K`: Number of documents to retrieve (default: 4)
|
|
|
|
| 204 |
- **"Gemini Flash"** for AI-powered text extraction
|
| 205 |
4. Select an OCR method based on your chosen parser
|
| 206 |
5. Click "Convert"
|
| 207 |
+
6. **For GOT-OCR**: View the LaTeX output with **Mathpix rendering** for proper mathematical and tabular display
|
| 208 |
+
7. **For other parsers**: View the Markdown output
|
| 209 |
+
8. Download the converted file (.tex for GOT-OCR, .md for others)
|
| 210 |
|
| 211 |
#### π **Multi-Document Processing** (NEW!)
|
| 212 |
1. Go to the **"Document Converter"** tab
|
|
|
|
| 326 |
- **Enhanced Error Messages**: Detailed error reporting for debugging
|
| 327 |
- **Centralized Logging**: Configurable logging levels and output formats
|
| 328 |
|
| 329 |
+
## π GOT-OCR LaTeX Processing
|
| 330 |
+
|
| 331 |
+
Markit v2 features **advanced LaTeX processing** for GOT-OCR results, providing proper mathematical and tabular content handling:
|
| 332 |
+
|
| 333 |
+
### **π― Key Features:**
|
| 334 |
+
|
| 335 |
+
#### **1. Native LaTeX Output**
|
| 336 |
+
- **No LLM conversion**: GOT-OCR returns raw LaTeX directly for maximum accuracy
|
| 337 |
+
- **Preserves mathematical structures**: Complex formulas, tables, and equations remain intact
|
| 338 |
+
- **.tex file output**: Save files in proper LaTeX format for external use
|
| 339 |
+
|
| 340 |
+
#### **2. Mathpix Markdown Rendering**
|
| 341 |
+
- **Professional display**: Uses Mathpix Markdown library (same as official GOT-OCR demo)
|
| 342 |
+
- **Complex table support**: Renders `\begin{tabular}`, `\multirow`, `\multicolumn` properly
|
| 343 |
+
- **Mathematical expressions**: Displays LaTeX math with proper formatting
|
| 344 |
+
- **Base64 iframe embedding**: Secure, isolated rendering environment
|
| 345 |
+
|
| 346 |
+
#### **3. RAG-Compatible LaTeX Chunking**
|
| 347 |
+
- **LaTeX-aware chunker**: Specialized chunking preserves LaTeX structures
|
| 348 |
+
- **Complete table preservation**: Entire `\begin{tabular}...\end{tabular}` blocks stay intact
|
| 349 |
+
- **Environment detection**: Maintains `\begin{env}...\end{env}` pairs
|
| 350 |
+
- **Intelligent separators**: Uses LaTeX commands (`\section`, `\title`) as break points
|
| 351 |
+
|
| 352 |
+
#### **4. Enhanced Metadata**
|
| 353 |
+
- **Content type tracking**: `content_type: "latex"` for proper handling
|
| 354 |
+
- **Structure detection**: Identifies tables, environments, and mathematical content
|
| 355 |
+
- **Auto-format detection**: GOT-OCR results automatically use LaTeX chunker
|
| 356 |
+
|
| 357 |
+
### **π§ Technical Implementation:**
|
| 358 |
+
|
| 359 |
+
```javascript
|
| 360 |
+
// Mathpix rendering (inspired by official GOT-OCR demo)
|
| 361 |
+
const html = window.render(latexContent, {htmlTags: true});
|
| 362 |
+
|
| 363 |
+
// LaTeX structure preservation
|
| 364 |
+
\begin{tabular}{|l|c|c|}
|
| 365 |
+
\hline Disability & Participants & Results \\
|
| 366 |
+
\hline Blind & 5 & $34.5\%, n=1$ \\
|
| 367 |
+
\end{tabular}
|
| 368 |
+
```
|
| 369 |
+
|
| 370 |
+
### **π Use Cases:**
|
| 371 |
+
- **Research papers**: Mathematical formulas and data tables
|
| 372 |
+
- **Scientific documents**: Complex equations and statistical data
|
| 373 |
+
- **Financial reports**: Tabular data with calculations
|
| 374 |
+
- **Academic content**: Mixed text, math, and structured data
|
| 375 |
+
|
| 376 |
## Credits
|
| 377 |
|
| 378 |
- [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
|
| 379 |
- [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
|
| 380 |
+
- [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) for LaTeX rendering
|
| 381 |
- [Gradio](https://gradio.app/) for the UI framework
|
| 382 |
|
| 383 |
---
|
|
|
|
| 398 |
- **Best for**: General questions and semantic understanding
|
| 399 |
- **Use case**: "What is the main topic of this document?"
|
| 400 |
- **Configuration**: `{'k': 4, 'search_type': 'similarity'}`
|
| 401 |
+
- **Chunking**: Uses content-aware chunking (Markdown or LaTeX) for optimal structure preservation
|
| 402 |
|
| 403 |
### **2. π MMR (Maximal Marginal Relevance)**
|
| 404 |
- **How it works**: Balances relevance with result diversity to reduce redundancy
|
|
|
|
| 525 |
|
| 526 |
### π§ **RAG System Architecture:**
|
| 527 |
- **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
|
| 528 |
+
- **π Smart Content-Aware Chunking** (`src/rag/chunking.py`):
|
| 529 |
+
- **Unified chunker** supporting both Markdown and LaTeX content
|
| 530 |
+
- **Markdown chunking**: Preserves tables and code blocks as whole units
|
| 531 |
+
- **LaTeX chunking**: Preserves `\begin{tabular}`, mathematical environments, and LaTeX structures
|
| 532 |
+
- **Automatic format detection**: GOT-OCR results β LaTeX chunker, others β Markdown chunker
|
| 533 |
+
- **Enhanced metadata**: Content type tracking and structure detection
|
| 534 |
- **π Advanced Vector Store** (`src/rag/vector_store.py`): Multi-strategy retrieval system with:
|
| 535 |
- **Similarity Search**: Traditional semantic retrieval using embeddings
|
| 536 |
- **MMR Support**: Maximal Marginal Relevance for diverse results
|