Spaces:

kev216
/

extract_document_to_md

Sleeping

App Files Files Community

wang.lingxiao commited on Jun 12, 2025

Commit

4f8205f

1 Parent(s): e37fd3c

merge

Browse files

Files changed (3) hide show

README.md +130 -175
__pycache__/app.cpython-313.pyc +0 -0
app.py +236 -1189

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-title: Advanced Document to Markdown Converter
-emoji: 🚀
 colorFrom: blue
 colorTo: purple
 sdk: gradio
@@ -14,205 +14,162 @@ tags:
   - document-processing
   - markdown
   - pdf-converter
-  - ai-analysis
-  - mcp-server-track
-  - mcp-server
-  - nlp
-  - ocr
-short_description: Convert any document to Markdown with AI-powered analysis
 ---
-# 🚀 Advanced Document to Markdown Converter
-Convert documents to Markdown format with AI-powered analysis and advanced features.
 ## Features
 ### 📄 Supported Formats
-- **PDF** - With OCR support for image-based PDFs
-- **Word Documents** (.docx) - Full formatting preservation
-- **PowerPoint** (.pptx) - Slide-by-slide conversion
-- **Excel** (.xlsx) - Table extraction and formatting
-- **Plain Text** (.txt, .md) - Smart formatting detection
-- **Rich Text** (.rtf) - Complete formatting support
-- **E-books** (.epub) - Chapter and content extraction
-### 🧠 AI-Powered Features
-- **Structure Analysis** - Intelligent document organization
-- **Topic Extraction** - Automatic keyword and topic identification
-- **Entity Recognition** - Named entity detection and classification
-- **Content Summarization** - AI-generated document summaries
-- **Smart Heading Detection** - Context-aware heading hierarchy
-### ⚡ Advanced Capabilities
-- **Batch Processing** - Process multiple documents simultaneously
-- **OCR Integration** - Extract text from images and scanned documents
-- **Custom Templates** - Pre-configured output formats
-- **Caching System** - Improved performance for repeated processing
-- **Progress Tracking** - Real-time processing status
-- **Export Options** - Multiple output formats (MD, HTML, PDF)
-### 🔧 Technical Features
-- **MCP Server** - Model Context Protocol integration
-- **Concurrent Processing** - Multi-threaded document handling
-- **Memory Optimization** - Efficient large file processing
-- **Error Recovery** - Robust error handling and reporting
 ## Usage
-### Single Document Processing
-1. Upload your document
-2. Configure processing options
-3. Click "Process Document"
-4. View results in multiple tabs
-### Batch Processing
-1. Upload multiple documents
-2. Enable combination option if needed
-3. Process all documents simultaneously
-4. Export results as needed
-### MCP Integration
-This application can be used as an MCP server with Claude AI:
-```json
-{
-    "mcpServers": {
-        "document_converter": {
-            "command": "npx",
-            "args": [
-                "mcp-remote",
-                "https://YOUR-SPACE-URL/gradio_api/mcp/sse",
-                "--transport",
-                "sse-only"
-            ]
-        }
-    }
-}
-```
 ## Installation
 ### Local Development
 ```bash
-git clone https://huggingface.co/spaces/YOUR-USERNAME/advanced-document-converter
-cd advanced-document-converter
-pip install -r requirements.txt
-python app.py
-```
-### Docker Deployment
-```dockerfile
-FROM python:3.11-slim
-WORKDIR /app
-COPY requirements.txt .
-RUN pip install -r requirements.txt
-# Install system dependencies for OCR
-RUN apt-get update && apt-get install -y \
-    tesseract-ocr \
-    tesseract-ocr-eng \
-    && rm -rf /var/lib/apt/lists/*
-COPY . .
-EXPOSE 7860
-CMD ["python", "app.py"]
 ```
-## API Documentation
-### Core Functions
-#### `process_document(file_path, options)`
-Process a single document and convert to Markdown.
-**Parameters:**
-- `file_path` (str): Path to the document file
-- `options` (dict): Processing configuration
-  - `enable_ai_analysis` (bool): Enable AI-powered analysis
-  - `include_frontmatter` (bool): Add YAML frontmatter
-  - `generate_toc` (bool): Generate table of contents
-  - `use_cache` (bool): Enable result caching
-**Returns:**
-- Dictionary with markdown content, structure analysis, and metadata
-#### `process_multiple_documents(file_paths, options)`
-Process multiple documents concurrently.
-**Parameters:**
-- `file_paths` (list): List of file paths
-- `options` (dict): Processing configuration
-  - `combine_documents` (bool): Merge into single document
-  - Additional options from single document processing
-**Returns:**
-- Dictionary with results for each document and optional combined output
-### MCP Functions
-#### `extract_document_to_md_process_document`
-MCP-compatible function for document processing.
-**Parameters:**
-- `file_path` (str): HTTP/HTTPS URL to document
-- `show_prev` (bool): Return preview only
-- `show_struct` (bool): Include structure analysis
-## Configuration
-### Environment Variables
-- `MAX_FILE_SIZE_MB` - Maximum file size limit (default: 50)
-- `CACHE_DIR` - Directory for cached results
-- `WORKERS` - Number of concurrent workers
-- `ENABLE_OCR` - Enable OCR processing by default
-### Processing Options
-- **AI Analysis**: Uses spaCy NLP models for advanced text analysis
-- **OCR**: Tesseract-based optical character recognition
-- **Caching**: Redis-compatible caching for improved performance
-## Dependencies
-### Core Requirements
 - `gradio>=4.0.0` - Web interface framework
 - `python-docx>=1.1.0` - Word document processing
-- `PyMuPDF>=1.23.0` - PDF processing
-- `python-pptx>=0.6.21` - PowerPoint processing
-- `openpyxl>=3.1.0` - Excel file processing
-### AI/ML Requirements
-- `spacy>=3.7.0` - Natural language processing
-- `pytesseract>=0.3.10` - OCR capabilities
-- `transformers>=4.30.0` - Advanced AI models
-### Optional Features
-- `matplotlib>=3.7.0` - Visualization capabilities
-- `pandas>=2.0.0` - Data processing
-- `scikit-learn>=1.3.0` - Machine learning features
-## Performance
-### Benchmarks
-- **Small files** (<1MB): ~2-5 seconds
-- **Medium files** (1-10MB): ~10-30 seconds
-- **Large files** (10-50MB): ~30-120 seconds
-- **Batch processing**: Linear scaling with concurrent workers
-### Memory Usage
-- **Base memory**: ~200MB
-- **Per document**: ~50-100MB additional
-- **OCR processing**: +200-500MB peak usage
 ## Contributing
 1. Fork the repository
-2. Create feature branch: `git checkout -b feature-name`
-3. Commit changes: `git commit -am 'Add feature'`
-4. Push to branch: `git push origin feature-name`
-5. Submit pull request
 ## License
@@ -220,10 +177,8 @@ MIT License - see LICENSE file for details.
 ## Support
-- **Issues**: Report bugs and feature requests on GitHub
-- **Documentation**: Full API documentation available
-- **Community**: Join discussions in the Community tab
 ---
-*Built with ❤️ using Gradio, spaCy, and various document processing libraries*

 ---
+title: Document to Markdown Converter
+emoji: 📄
 colorFrom: blue
 colorTo: purple
 sdk: gradio
   - document-processing
   - markdown
   - pdf-converter
+  - text-extraction
+short_description: Convert PDF and DOCX documents to Markdown format
 ---
+# 📄 Document to Markdown Converter
+Convert PDF and DOCX documents to Markdown format with intelligent structure analysis.
 ## Features
 ### 📄 Supported Formats
+- **PDF** - Extract text with formatting preservation
+- **Word Documents** (.docx) - Full formatting and structure conversion
+### 🧠 Smart Processing
+- **Heading Detection** - Automatically detect headings based on styles and formatting
+- **Table Extraction** - Convert tables to Markdown format
+- **List Processing** - Preserve ordered and unordered lists
+- **Inline Formatting** - Maintain bold, italic, and other text formatting
+- **Structure Analysis** - Detailed document structure statistics
+### ⚡ Key Capabilities
+- **Font-based Heading Detection** - Uses font size and styling to identify headings
+- **Style Recognition** - Recognizes Word document styles (Title, Heading 1-6)
+- **Table Conversion** - Converts complex tables to Markdown table format
+- **List Recognition** - Identifies and converts various list formats
+- **Text Formatting** - Preserves bold, italic formatting in Markdown syntax
 ## Usage
+### Basic Processing
+1. Upload a PDF or DOCX file
+2. Click "Convert to Markdown"
+3. View the converted Markdown in the output tab
+### Options
+- **Structure Analysis**: Enable to see detailed document statistics
+- **Preview Mode**: Show only the first 500 characters for quick preview
+### Output Tabs
+- **Markdown Output**: The complete converted Markdown text
+- **Structure Analysis**: Statistics about headings, lists, tables, etc.
+- **File Information**: Basic file details (name, type, size)
+## Technical Details
+### PDF Processing
+- Uses PyMuPDF (fitz) for text extraction
+- Analyzes font sizes to determine heading hierarchy
+- Preserves text formatting flags (bold, italic)
+- Processes text blocks while maintaining structure
+### DOCX Processing
+- Uses python-docx for document parsing
+- Recognizes built-in Word styles
+- Extracts tables with proper formatting
+- Maintains paragraph-level formatting
+### Structure Analysis
+The application analyzes:
+- **Headings**: Count by level (H1-H6)
+- **Lists**: Ordered vs unordered list items
+- **Tables**: Number of tables detected
+- **Paragraphs**: Regular text paragraphs
+- **Formatting**: Bold and italic text occurrences
+- **Statistics**: Word count, character count, total lines
 ## Installation
 ### Local Development
 ```bash
+# Clone the repository
+git clone https://huggingface.co/spaces/YOUR-USERNAME/document-to-markdown-converter
+cd document-to-markdown-converter
+# Install dependencies
+pip install -r requirements.txt
+# Run the application
+python app.py
 ```
+### Dependencies
 - `gradio>=4.0.0` - Web interface framework
 - `python-docx>=1.1.0` - Word document processing
+- `PyMuPDF>=1.23.0` - PDF processing library
+## API
+### Core Function
+```python
+def extract_document_to_markdown(file_path: str) -> Dict[str, Any]:
+    """
+    Extract document content and convert to Markdown format
+    Args:
+        file_path: Path to PDF or DOCX file
+    Returns:
+        Dictionary containing:
+        - success: Boolean indicating success
+        - markdown: Converted Markdown content
+        - structure: Document structure analysis
+        - file_info: File metadata (name, type, size)
+        - preview: Short preview of content
+        - error: Error message if processing failed
+    """
+```
+### Structure Analysis Output
+```json
+{
+  "headings": {"h1": 2, "h2": 5, "h3": 8, "h4": 0, "h5": 0, "h6": 0},
+  "lists": {"ordered": 3, "unordered": 7},
+  "tables": 2,
+  "paragraphs": 45,
+  "bold_text": 12,
+  "italic_text": 8,
+  "total_lines": 120,
+  "word_count": 2500,
+  "character_count": 15000
+}
+```
+## Examples
+### Converting a PDF
+1. Upload a PDF file
+2. The application will:
+   - Extract text from each page
+   - Detect headings based on font size
+   - Preserve bold/italic formatting
+   - Convert to clean Markdown
+### Converting a DOCX
+1. Upload a Word document
+2. The application will:
+   - Parse document styles
+   - Convert headings based on style names
+   - Extract and format tables
+   - Maintain list structures
+## Limitations
+- **OCR**: Does not perform OCR on image-based PDFs
+- **Complex Layouts**: May not perfectly preserve complex document layouts
+- **Images**: Does not extract or convert embedded images
+- **Fonts**: Limited font analysis for PDFs
 ## Contributing
 1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Test thoroughly
+5. Submit a pull request
 ## License
 ## Support
+For issues and feature requests, please use the Community tab or create an issue on GitHub.
 ---
+*Built with ❤️ using Gradio, python-docx, and PyMuPDF*

__pycache__/app.cpython-313.pyc CHANGED Viewed

Binary files a/__pycache__/app.cpython-313.pyc and b/__pycache__/app.cpython-313.pyc differ

app.py CHANGED Viewed

@@ -1,452 +1,40 @@
 import gradio as gr
 import re
 import os
-import io
-import json
-import hashlib
-import zipfile
-import tempfile
-from datetime import datetime
-from typing import Dict, Any, Optional, List, Tuple
 from pathlib import Path
-from concurrent.futures import ThreadPoolExecutor, as_completed
-import threading
-import time
-# Import dependencies with fallbacks
-DEPENDENCIES = {
-    "docx": {"available": False, "module": None},
-    "pdf": {"available": False, "module": None},
-    "pptx": {"available": False, "module": None},
-    "xlsx": {"available": False, "module": None},
-    "ocr": {"available": False, "module": None},
-    "nlp": {"available": False, "module": None},
-    "epub": {"available": False, "module": None},
-    "rtf": {"available": False, "module": None},
-}
-# Try importing all dependencies
 try:
     import docx
-    DEPENDENCIES["docx"] = {"available": True, "module": docx}
 except ImportError:
-    pass
 try:
     import fitz  # PyMuPDF
-    DEPENDENCIES["pdf"] = {"available": True, "module": fitz}
-except ImportError:
-    pass
-try:
-    from pptx import Presentation
-    DEPENDENCIES["pptx"] = {"available": True, "module": Presentation}
-except ImportError:
-    pass
-try:
-    import openpyxl
-    DEPENDENCIES["xlsx"] = {"available": True, "module": openpyxl}
-except ImportError:
-    pass
-try:
-    import pytesseract
-    from PIL import Image
-    DEPENDENCIES["ocr"] = {"available": True, "module": (pytesseract, Image)}
-except ImportError:
-    pass
-try:
-    import spacy
-    DEPENDENCIES["nlp"] = {"available": True, "module": spacy}
-except ImportError:
-    pass
-try:
-    import ebooklib
-    from ebooklib import epub
-    DEPENDENCIES["epub"] = {"available": True, "module": (ebooklib, epub)}
 except ImportError:
-    pass
-try:
-    from striprtf.striprtf import rtf_to_text
-    DEPENDENCIES["rtf"] = {"available": True, "module": rtf_to_text}
-except ImportError:
-    pass
-class ProgressTracker:
-    """Thread-safe progress tracking"""
-    def __init__(self):
-        self.current = 0
-        self.total = 100
-        self.status = "Ready"
-        self.lock = threading.Lock()
-    def update(self, current: int, total: int, status: str):
-        with self.lock:
-            self.current = current
-            self.total = total
-            self.status = status
-    def get_progress(self) -> Tuple[int, str]:
-        with self.lock:
-            progress = int((self.current / self.total) * 100) if self.total > 0 else 0
-            return progress, self.status
-class DocumentCache:
-    """Simple file-based cache for processed documents"""
-    def __init__(self, cache_dir: str = "/tmp/doc_cache"):
-        self.cache_dir = Path(cache_dir)
-        self.cache_dir.mkdir(exist_ok=True)
-    def _get_file_hash(self, file_path: str) -> str:
-        """Generate hash for file content"""
-        hasher = hashlib.md5()
-        with open(file_path, "rb") as f:
-            for chunk in iter(lambda: f.read(4096), b""):
-                hasher.update(chunk)
-        return hasher.hexdigest()
-    def get(self, file_path: str) -> Optional[Dict]:
-        """Get cached result if available"""
-        try:
-            file_hash = self._get_file_hash(file_path)
-            cache_file = self.cache_dir / f"{file_hash}.json"
-            if cache_file.exists():
-                with open(cache_file, "r", encoding="utf-8") as f:
-                    return json.load(f)
-        except Exception:
-            pass
-        return None
-    def set(self, file_path: str, result: Dict):
-        """Cache the result"""
-        try:
-            file_hash = self._get_file_hash(file_path)
-            cache_file = self.cache_dir / f"{file_hash}.json"
-            with open(cache_file, "w", encoding="utf-8") as f:
-                json.dump(result, f, ensure_ascii=False, indent=2)
-        except Exception:
-            pass
-class AIContentAnalyzer:
-    """AI-powered content analysis and structuring"""
-    def __init__(self):
-        self.nlp = None
-        if DEPENDENCIES["nlp"]["available"]:
-            try:
-                self.nlp = spacy.load("en_core_web_sm")
-            except OSError:
-                pass
-    def analyze_structure(self, text: str) -> Dict[str, Any]:
-        """Analyze document structure using NLP"""
-        if not self.nlp:
-            return self._basic_structure_analysis(text)
-        doc = self.nlp(text)
-        # Extract entities, topics, and structure
-        entities = [(ent.text, ent.label_) for ent in doc.ents]
-        sentences = [sent.text.strip() for sent in doc.sents]
-        # Identify potential headings based on sentence structure
-        potential_headings = []
-        for sent in sentences:
-            if (
-                len(sent.split()) <= 10
-                and sent[0].isupper()
-                and not sent.endswith(".")
-                and len(sent) > 5
-            ):
-                potential_headings.append(sent)
-        return {
-            "entities": entities[:10],  # Top 10 entities
-            "potential_headings": potential_headings[:20],
-            "sentence_count": len(sentences),
-            "avg_sentence_length": sum(len(s.split()) for s in sentences)
-            / len(sentences)
-            if sentences
-            else 0,
-            "topics": self._extract_topics(doc),
-        }
-    def _basic_structure_analysis(self, text: str) -> Dict[str, Any]:
-        """Basic structure analysis without NLP"""
-        lines = text.split("\n")
-        sentences = re.split(r"[.!?]+", text)
-        return {
-            "entities": [],
-            "potential_headings": [
-                line.strip()
-                for line in lines
-                if len(line.strip().split()) <= 10 and line.strip()
-            ],
-            "sentence_count": len([s for s in sentences if s.strip()]),
-            "avg_sentence_length": sum(len(s.split()) for s in sentences if s.strip())
-            / len(sentences)
-            if sentences
-            else 0,
-            "topics": [],
-        }
-    def _extract_topics(self, doc) -> List[str]:
-        """Extract main topics from document"""
-        # Simple topic extraction based on noun phrases
-        topics = []
-        for chunk in doc.noun_chunks:
-            if len(chunk.text.split()) <= 3 and chunk.text.lower() not in [
-                "the",
-                "a",
-                "an",
-            ]:
-                topics.append(chunk.text)
-        return list(set(topics))[:10]
-    def generate_summary(self, text: str, max_length: int = 200) -> str:
-        """Generate document summary"""
-        sentences = re.split(r"[.!?]+", text)
-        sentences = [s.strip() for s in sentences if s.strip() and len(s.split()) > 5]
-        if not sentences:
-            return "No content to summarize."
-        # Simple extractive summarization - take first few and some middle sentences
-        summary_sentences = []
-        if len(sentences) <= 3:
-            summary_sentences = sentences
-        else:
-            summary_sentences.append(sentences[0])  # First sentence
-            if len(sentences) > 2:
-                summary_sentences.append(
-                    sentences[len(sentences) // 2]
-                )  # Middle sentence
-            summary_sentences.append(sentences[-1])  # Last sentence
-        summary = " ".join(summary_sentences)
-        if len(summary) > max_length:
-            summary = summary[:max_length] + "..."
-        return summary
-class AdvancedDocumentConverter:
-    """Advanced document converter with AI features"""
     def __init__(self):
-        self.progress = ProgressTracker()
-        self.cache = DocumentCache()
-        self.ai_analyzer = AIContentAnalyzer()
-        self.supported_formats = {
-            ".pdf": self.extract_from_pdf,
-            ".docx": self.extract_from_docx,
-            ".pptx": self.extract_from_pptx,
-            ".xlsx": self.extract_from_xlsx,
-            ".txt": self.extract_from_txt,
-            ".md": self.extract_from_txt,
-            ".rtf": self.extract_from_rtf,
-            ".epub": self.extract_from_epub,
-        }
-    def process_document(
-        self, file_path: str, options: Dict[str, Any] = None
-    ) -> Dict[str, Any]:
-        """Main document processing function"""
-        if not options:
-            options = {}
-        # Check cache first
-        if options.get("use_cache", True):
-            cached_result = self.cache.get(file_path)
-            if cached_result:
-                return cached_result
-        self.progress.update(10, 100, "Starting processing...")
-        if not os.path.exists(file_path):
-            return {"error": "File not found", "markdown": "", "structure": {}}
-        file_extension = Path(file_path).suffix.lower()
-        if file_extension not in self.supported_formats:
-            return {
-                "error": f"Unsupported file type: {file_extension}",
-                "markdown": "",
-                "structure": {},
-            }
-        try:
-            self.progress.update(
-                30, 100, f"Extracting content from {file_extension} file..."
-            )
-            # Extract content using appropriate method
-            extractor = self.supported_formats[file_extension]
-            markdown_content = extractor(file_path)
-            self.progress.update(60, 100, "Analyzing document structure...")
-            # Enhanced structure analysis
-            structure = self._analyze_document_structure(markdown_content)
-            self.progress.update(80, 100, "Performing AI analysis...")
-            # AI-powered analysis
-            if options.get("enable_ai_analysis", True):
-                ai_analysis = self.ai_analyzer.analyze_structure(markdown_content)
-                structure["ai_analysis"] = ai_analysis
-                structure["summary"] = self.ai_analyzer.generate_summary(
-                    markdown_content
-                )
-            # Generate frontmatter
-            frontmatter = self._generate_frontmatter(file_path, structure, options)
-            # Final markdown with frontmatter
-            if options.get("include_frontmatter", True):
-                final_markdown = frontmatter + "\n\n" + markdown_content
-            else:
-                final_markdown = markdown_content
-            # Create table of contents
-            if options.get("generate_toc", False):
-                toc = self._generate_table_of_contents(markdown_content)
-                final_markdown = toc + "\n\n" + final_markdown
-            self.progress.update(100, 100, "Processing complete!")
-            result = {
-                "success": True,
-                "file_info": {
-                    "name": Path(file_path).name,
-                    "type": file_extension.upper()[1:],
-                    "size_kb": round(os.path.getsize(file_path) / 1024, 2),
-                    "processed_at": datetime.now().isoformat(),
-                },
-                "markdown": final_markdown,
-                "structure": structure,
-                "frontmatter": frontmatter,
-                "preview": final_markdown[:800] + "..."
-                if len(final_markdown) > 800
-                else final_markdown,
-            }
-            # Cache the result
-            if options.get("use_cache", True):
-                self.cache.set(file_path, result)
-            return result
-        except Exception as e:
-            return {
-                "error": f"Error processing file: {str(e)}",
-                "markdown": "",
-                "structure": {},
-            }
-    def process_multiple_documents(
-        self, file_paths: List[str], options: Dict[str, Any] = None
-    ) -> Dict[str, Any]:
-        """Process multiple documents concurrently"""
-        if not file_paths:
-            return {"error": "No files provided", "results": []}
-        results = []
-        total_files = len(file_paths)
-        with ThreadPoolExecutor(max_workers=3) as executor:
-            # Submit all tasks
-            future_to_file = {
-                executor.submit(self.process_document, file_path, options): file_path
-                for file_path in file_paths
-            }
-            # Process completed tasks
-            for i, future in enumerate(as_completed(future_to_file)):
-                file_path = future_to_file[future]
-                try:
-                    result = future.result()
-                    result["file_path"] = file_path
-                    results.append(result)
-                except Exception as e:
-                    results.append(
-                        {
-                            "error": f"Failed to process {file_path}: {str(e)}",
-                            "file_path": file_path,
-                        }
-                    )
-                # Update progress
-                self.progress.update(
-                    i + 1, total_files, f"Processed {i + 1}/{total_files} files"
-                )
-        # Generate combined document if requested
-        combined_markdown = ""
-        if options and options.get("combine_documents", False):
-            combined_markdown = self._combine_documents(results)
-        return {
-            "success": True,
-            "total_files": total_files,
-            "results": results,
-            "combined_markdown": combined_markdown,
-        }
-    def extract_from_pdf(self, pdf_path: str) -> str:
-        """Enhanced PDF extraction with OCR support"""
-        if not DEPENDENCIES["pdf"]["available"]:
-            raise ImportError("PyMuPDF not installed. Run: pip install PyMuPDF")
-        fitz = DEPENDENCIES["pdf"]["module"]
-        doc = fitz.open(pdf_path)
-        markdown_content = []
-        for page_num in range(len(doc)):
-            page = doc.load_page(page_num)
-            # Extract text blocks
-            blocks = page.get_text("dict")
-            page_markdown = self._convert_pdf_blocks_to_markdown(blocks)
-            # OCR on images if text extraction failed
-            if not page_markdown.strip() and DEPENDENCIES["ocr"]["available"]:
-                page_markdown = self._ocr_pdf_page(page)
-            if page_markdown.strip():
-                markdown_content.append(f"## Page {page_num + 1}\n\n{page_markdown}")
-        doc.close()
-        return "\n\n---\n\n".join(markdown_content)
     def extract_from_docx(self, docx_path: str) -> str:
-        """Enhanced DOCX extraction"""
-        if not DEPENDENCIES["docx"]["available"]:
-            raise ImportError("python-docx not installed. Run: pip install python-docx")
-        docx = DEPENDENCIES["docx"]["module"]
         doc = docx.Document(docx_path)
         markdown_content = []
-        # Process paragraphs with enhanced formatting
         for paragraph in doc.paragraphs:
             if paragraph.text.strip():
                 md_text = self._convert_paragraph_to_markdown(paragraph)
@@ -461,223 +49,47 @@ class AdvancedDocumentConverter:
         return "\n\n".join(markdown_content)
-    def extract_from_pptx(self, pptx_path: str) -> str:
-        """Extract content from PowerPoint presentations"""
-        if not DEPENDENCIES["pptx"]["available"]:
-            raise ImportError("python-pptx not installed. Run: pip install python-pptx")
-        Presentation = DEPENDENCIES["pptx"]["module"]
-        prs = Presentation(pptx_path)
         markdown_content = []
-        for i, slide in enumerate(prs.slides):
-            slide_content = [f"## Slide {i + 1}\n"]
-            for shape in slide.shapes:
-                if hasattr(shape, "text") and shape.text.strip():
-                    # Determine if it's a title or content
-                    if shape == slide.shapes.title:
-                        slide_content.append(f"### {shape.text.strip()}\n")
-                    else:
-                        slide_content.append(f"{shape.text.strip()}\n")
-            if len(slide_content) > 1:  # More than just the slide header
-                markdown_content.append("\n".join(slide_content))
         return "\n\n---\n\n".join(markdown_content)
-    def extract_from_xlsx(self, xlsx_path: str) -> str:
-        """Extract content from Excel files"""
-        if not DEPENDENCIES["xlsx"]["available"]:
-            raise ImportError("openpyxl not installed. Run: pip install openpyxl")
-        openpyxl = DEPENDENCIES["xlsx"]["module"]
-        workbook = openpyxl.load_workbook(xlsx_path, data_only=True)
-        markdown_content = []
-        for sheet_name in workbook.sheetnames:
-            sheet = workbook[sheet_name]
-            markdown_content.append(f"## {sheet_name}\n")
-            # Find the data range
-            max_row = sheet.max_row
-            max_col = sheet.max_column
-            if max_row > 0 and max_col > 0:
-                # Create markdown table
-                table_rows = []
-                for row in range(1, min(max_row + 1, 101)):  # Limit to 100 rows
-                    row_data = []
-                    for col in range(1, max_col + 1):
-                        cell_value = sheet.cell(row=row, column=col).value
-                        row_data.append(
-                            str(cell_value) if cell_value is not None else ""
-                        )
-                    if any(cell.strip() for cell in row_data):  # Skip empty rows
-                        table_rows.append("| " + " | ".join(row_data) + " |")
-                if table_rows:
-                    # Add header separator after first row
-                    if len(table_rows) > 1:
-                        separator = "| " + " | ".join(["---"] * max_col) + " |"
-                        table_rows.insert(1, separator)
-                    markdown_content.append("\n".join(table_rows))
-        return "\n\n".join(markdown_content)
-    def extract_from_txt(self, txt_path: str) -> str:
-        """Extract content from text files"""
-        try:
-            with open(txt_path, "r", encoding="utf-8") as f:
-                content = f.read()
-        except UnicodeDecodeError:
-            with open(txt_path, "r", encoding="latin-1") as f:
-                content = f.read()
-        # If it's already markdown, return as-is
-        if txt_path.endswith(".md"):
-            return content
-        # Convert plain text to markdown with basic formatting
-        lines = content.split("\n")
-        markdown_lines = []
-        for line in lines:
-            line = line.strip()
-            if not line:
-                markdown_lines.append("")
-                continue
-            # Check if line looks like a heading
-            if (
-                len(line.split()) <= 8
-                and (line.isupper() or line.istitle())
-                and not line.endswith(".")
-            ):
-                markdown_lines.append(f"## {line}")
-            else:
-                markdown_lines.append(line)
-        return "\n".join(markdown_lines)
-    def extract_from_rtf(self, rtf_path: str) -> str:
-        """Extract content from RTF files"""
-        if not DEPENDENCIES["rtf"]["available"]:
-            raise ImportError("striprtf not installed. Run: pip install striprtf")
-        rtf_to_text = DEPENDENCIES["rtf"]["module"]
-        with open(rtf_path, "r", encoding="utf-8") as f:
-            rtf_content = f.read()
-        plain_text = rtf_to_text(rtf_content)
-        return self.extract_from_txt_content(plain_text)
-    def extract_from_epub(self, epub_path: str) -> str:
-        """Extract content from EPUB files"""
-        if not DEPENDENCIES["epub"]["available"]:
-            raise ImportError("ebooklib not installed. Run: pip install ebooklib")
-        ebooklib, epub = DEPENDENCIES["epub"]["module"]
-        book = epub.read_epub(epub_path)
-        markdown_content = []
-        for item in book.get_items():
-            if item.get_type() == ebooklib.ITEM_DOCUMENT:
-                content = item.get_content().decode("utf-8")
-                # Basic HTML to markdown conversion
-                text = re.sub(r"<[^>]+>", "", content)  # Remove HTML tags
-                text = re.sub(r"\s+", " ", text).strip()  # Clean whitespace
-                if text:
-                    markdown_content.append(text)
-        return "\n\n".join(markdown_content)
-    def _ocr_pdf_page(self, page) -> str:
-        """Perform OCR on PDF page"""
-        if not DEPENDENCIES["ocr"]["available"]:
-            return ""
-        pytesseract, Image = DEPENDENCIES["ocr"]["module"]
-        try:
-            # Convert page to image
-            pix = page.get_pixmap()
-            img_data = pix.tobytes("png")
-            image = Image.open(io.BytesIO(img_data))
-            # Perform OCR
-            text = pytesseract.image_to_string(image, lang="eng")
-            return text.strip()
-        except Exception:
-            return ""
-    def _convert_pdf_blocks_to_markdown(self, blocks_dict: Dict) -> str:
-        """Enhanced PDF blocks to markdown conversion"""
-        markdown_lines = []
-        for block in blocks_dict.get("blocks", []):
-            if block.get("type") == 0:  # Text block
-                for line in block.get("lines", []):
-                    line_text = ""
-                    for span in line.get("spans", []):
-                        text = span.get("text", "").strip()
-                        if text:
-                            font_size = span.get("size", 12)
-                            flags = span.get("flags", 0)
-                            is_bold = bool(flags & 16)
-                            is_italic = bool(flags & 2)
-                            # Apply inline formatting
-                            if is_bold and is_italic:
-                                text = f"***{text}***"
-                            elif is_bold:
-                                text = f"**{text}**"
-                            elif is_italic:
-                                text = f"*{text}*"
-                            # Apply heading formatting based on font size
-                            if font_size >= 20:
-                                text = f"# {text}"
-                            elif font_size >= 18:
-                                text = f"## {text}"
-                            elif font_size >= 16:
-                                text = f"### {text}"
-                            elif font_size >= 14:
-                                text = f"#### {text}"
-                            line_text += text + " "
-                    if line_text.strip():
-                        markdown_lines.append(line_text.strip())
-        return "\n\n".join(markdown_lines)
     def _convert_paragraph_to_markdown(self, paragraph) -> str:
-        """Enhanced paragraph to markdown conversion"""
         text = paragraph.text.strip()
         if not text:
             return ""
         style_name = paragraph.style.name if paragraph.style else "Normal"
-        # Enhanced formatting detection
         is_bold = any(run.bold for run in paragraph.runs if run.bold)
-        is_italic = any(run.italic for run in paragraph.runs if run.italic)
-        # Font size detection
         font_size = 12
         if paragraph.runs:
             first_run = paragraph.runs[0]
             if first_run.font.size:
                 font_size = first_run.font.size.pt
-        # Advanced heading detection
         if "Title" in style_name or (is_bold and font_size >= 18):
             return f"# {text}"
         elif "Heading 1" in style_name or (is_bold and font_size >= 16):
@@ -693,114 +105,130 @@ class AdvancedDocumentConverter:
         elif "Heading 6" in style_name:
             return f"###### {text}"
         elif re.match(r"^[\d\w]\.\s|^[•\-\*]\s|^\d+\)\s", text):
-            # Enhanced list detection
-            if re.match(r"^\d+\.", text):
-                return f"1. {text[text.find('.') + 1 :].strip()}"
             else:
-                return f"- {text[1:].strip() if text[0] in '•-*' else text}"
         else:
-            # Apply inline formatting
             formatted_text = self._apply_inline_formatting(paragraph)
             return formatted_text
     def _apply_inline_formatting(self, paragraph) -> str:
-        """Enhanced inline formatting application"""
         result = ""
         for run in paragraph.runs:
             text = run.text
-            # Apply multiple formatting
             if run.bold and run.italic:
                 text = f"***{text}***"
             elif run.bold:
                 text = f"**{text}**"
             elif run.italic:
                 text = f"*{text}*"
-            elif run.underline:
-                text = f"<u>{text}</u>"
             result += text
         return result
     def _convert_table_to_markdown(self, table) -> str:
-        """Enhanced table conversion with better formatting"""
         if not table.rows:
             return ""
         markdown_rows = []
         # Process header row
-        header_cells = []
-        for cell in table.rows[0].cells:
-            cell_text = cell.text.strip().replace("\n", " ")
-            header_cells.append(cell_text if cell_text else "Header")
         markdown_rows.append("| " + " | ".join(header_cells) + " |")
         markdown_rows.append("| " + " | ".join(["---"] * len(header_cells)) + " |")
         # Process data rows
         for row in table.rows[1:]:
-            cells = []
-            for cell in row.cells:
-                cell_text = cell.text.strip().replace("\n", " ")
-                cells.append(cell_text if cell_text else " ")
             markdown_rows.append("| " + " | ".join(cells) + " |")
         return "\n".join(markdown_rows)
-    def _analyze_document_structure(self, markdown_text: str) -> Dict[str, Any]:
-        """Enhanced document structure analysis"""
         lines = markdown_text.split("\n")
         structure = {
             "headings": {"h1": 0, "h2": 0, "h3": 0, "h4": 0, "h5": 0, "h6": 0},
             "lists": {"ordered": 0, "unordered": 0},
             "tables": 0,
             "paragraphs": 0,
-            "code_blocks": 0,
-            "links": 0,
-            "images": 0,
             "bold_text": 0,
             "italic_text": 0,
             "total_lines": len(lines),
             "word_count": len(markdown_text.split()),
             "character_count": len(markdown_text),
-            "reading_time_minutes": max(
-                1, len(markdown_text.split()) // 200
-            ),  # ~200 WPM
         }
         in_table = False
-        in_code_block = False
         for line in lines:
-            original_line = line
             line = line.strip()
             if not line:
                 continue
-            # Code blocks
-            if line.startswith("```"):
-                in_code_block = not in_code_block
-                if in_code_block:
-                    structure["code_blocks"] += 1
-                continue
-            if in_code_block:
-                continue
-            # Headings
             if line.startswith("#"):
                 level = len(line) - len(line.lstrip("#"))
                 if level <= 6:
                     structure["headings"][f"h{level}"] += 1
-            # Lists
             elif re.match(r"^\d+\.\s", line):
                 structure["lists"]["ordered"] += 1
             elif re.match(r"^[\-\*\+]\s", line):
                 structure["lists"]["unordered"] += 1
-            # Tables
             elif "|" in line and not in_table:
                 structure["tables"] += 1
                 in_table = True
@@ -813,579 +241,198 @@ class AdvancedDocumentConverter:
                 ):
                     structure["paragraphs"] += 1
-            # Links and images
-            structure["links"] += len(re.findall(r"\[([^\]]+)\]\([^)]+\)", line))
-            structure["images"] += len(re.findall(r"!\[([^\]]*)\]\([^)]+\)", line))
-            # Formatting
             structure["bold_text"] += len(re.findall(r"\*\*[^*]+\*\*", line))
             structure["italic_text"] += len(re.findall(r"\*[^*]+\*", line))
         return structure
-    def _generate_frontmatter(
-        self, file_path: str, structure: Dict, options: Dict
-    ) -> str:
-        """Generate YAML frontmatter for the document"""
-        frontmatter_data = {
-            "title": Path(file_path).stem.replace("_", " ").replace("-", " ").title(),
-            "created": datetime.now().strftime("%Y-%m-%d"),
-            "source_file": Path(file_path).name,
-            "file_type": Path(file_path).suffix[1:].upper(),
-            "word_count": structure.get("word_count", 0),
-            "reading_time": f"{structure.get('reading_time_minutes', 1)} min",
-            "headings": structure.get("headings", {}),
-            "has_tables": structure.get("tables", 0) > 0,
-            "has_images": structure.get("images", 0) > 0,
-        }
-        # Add AI analysis if available
-        if "ai_analysis" in structure:
-            ai_data = structure["ai_analysis"]
-            if ai_data.get("entities"):
-                frontmatter_data["entities"] = [
-                    entity[0] for entity in ai_data["entities"][:5]
-                ]
-            if ai_data.get("topics"):
-                frontmatter_data["topics"] = ai_data["topics"][:5]
-        # Add summary if available
-        if "summary" in structure:
-            frontmatter_data["summary"] = structure["summary"]
-        # Convert to YAML
-        yaml_lines = ["---"]
-        for key, value in frontmatter_data.items():
-            if isinstance(value, dict):
-                yaml_lines.append(f"{key}:")
-                for subkey, subvalue in value.items():
-                    yaml_lines.append(f"  {subkey}: {subvalue}")
-            elif isinstance(value, list):
-                yaml_lines.append(f"{key}:")
-                for item in value:
-                    yaml_lines.append(f"  - {item}")
-            else:
-                yaml_lines.append(f"{key}: {value}")
-        yaml_lines.append("---")
-        return "\n".join(yaml_lines)
-    def _generate_table_of_contents(self, markdown_text: str) -> str:
-        """Generate table of contents from headings"""
-        toc_lines = ["## Table of Contents\n"]
-        lines = markdown_text.split("\n")
-        for line in lines:
-            line = line.strip()
-            if line.startswith("#"):
-                # Extract heading level and text
-                level = len(line) - len(line.lstrip("#"))
-                heading_text = line.lstrip("#").strip()
-                if level <= 4 and heading_text:  # Only include up to h4
-                    # Create anchor link
-                    anchor = (
-                        heading_text.lower().replace(" ", "-").replace("[^a-z0-9-]", "")
-                    )
-                    indent = "  " * (level - 1)
-                    toc_lines.append(f"{indent}- [{heading_text}](#{anchor})")
-        return "\n".join(toc_lines)
-    def _combine_documents(self, results: List[Dict]) -> str:
-        """Combine multiple documents into one"""
-        combined_parts = []
-        for i, result in enumerate(results):
-            if result.get("success") and result.get("markdown"):
-                file_name = result.get("file_info", {}).get("name", f"Document {i + 1}")
-                combined_parts.append(f"# {file_name}\n\n{result['markdown']}")
-        return "\n\n---\n\n".join(combined_parts)
-class EnhancedGradioInterface:
-    """Enhanced Gradio interface with advanced features"""
-    def __init__(self):
-        self.converter = AdvancedDocumentConverter()
-        self.processing_queue = []
-    def create_interface(self):
-        """Create the enhanced Gradio interface"""
-        # Custom CSS for better styling
-        custom_css = """
-        .container { max-width: 1200px; margin: auto; }
-        .upload-area { border: 2px dashed #ccc; border-radius: 10px; padding: 20px; text-align: center; }
-        .progress-bar { background: linear-gradient(90deg, #4CAF50, #45a049); }
-        .feature-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(250px, 1fr)); gap: 15px; }
-        .dependency-status { padding: 10px; border-radius: 5px; margin: 5px 0; }
-        .available { background-color: #d4edda; color: #155724; }
-        .unavailable { background-color: #f8d7da; color: #721c24; }
-        """
-        with gr.Blocks(
-            title="🚀 Advanced Document to Markdown Converter",
-            css=custom_css,
-            theme=gr.themes.Soft(),
-        ) as demo:
-            # Header
-            gr.Markdown("""
-            # 🚀 Advanced Document to Markdown Converter
-            **Convert any document to Markdown with AI-powered analysis and advanced features**
-            Supports: PDF, DOCX, PPTX, XLSX, TXT, MD, RTF, EPUB + OCR for images
-            """)
-            # Dependency status
-            self._create_dependency_status()
-            with gr.Tabs():
-                # Single Document Tab
-                with gr.TabItem("📄 Single Document"):
-                    self._create_single_document_tab()
-                # Batch Processing Tab
-                with gr.TabItem("📚 Batch Processing"):
-                    self._create_batch_processing_tab()
-                # Settings Tab
-                with gr.TabItem("⚙️ Settings"):
-                    self._create_settings_tab()
-                # Export Tab
-                with gr.TabItem("💾 Export"):
-                    self._create_export_tab()
-        return demo
-    def _create_dependency_status(self):
-        """Create dependency status display"""
-        with gr.Accordion("📋 System Status", open=False):
-            status_html = "<div class='feature-grid'>"
-            for dep_name, dep_info in DEPENDENCIES.items():
-                status_class = "available" if dep_info["available"] else "unavailable"
-                status_icon = "✅" if dep_info["available"] else "❌"
-                feature_map = {
-                    "docx": "Word Documents (.docx)",
-                    "pdf": "PDF Documents (.pdf)",
-                    "pptx": "PowerPoint (.pptx)",
-                    "xlsx": "Excel Files (.xlsx)",
-                    "ocr": "OCR (Image Text Extraction)",
-                    "nlp": "AI Text Analysis",
-                    "epub": "E-books (.epub)",
-                    "rtf": "Rich Text Format (.rtf)",
-                }
-                feature_name = feature_map.get(dep_name, dep_name.upper())
-                status_html += f"<div class='dependency-status {status_class}'>{status_icon} {feature_name}</div>"
-            status_html += "</div>"
-            gr.HTML(status_html)
-    def _create_single_document_tab(self):
-        """Create single document processing tab"""
         with gr.Row():
             with gr.Column(scale=1):
                 file_input = gr.File(
                     label="📎 Upload Document",
-                    file_types=[
-                        ".pdf",
-                        ".docx",
-                        ".pptx",
-                        ".xlsx",
-                        ".txt",
-                        ".md",
-                        ".rtf",
-                        ".epub",
-                    ],
                     type="filepath",
                 )
-                with gr.Accordion("🎛️ Processing Options", open=True):
-                    enable_ai = gr.Checkbox(label="🧠 Enable AI Analysis", value=True)
-                    include_frontmatter = gr.Checkbox(
-                        label="📋 Include Frontmatter", value=True
-                    )
-                    generate_toc = gr.Checkbox(
-                        label="📑 Generate Table of Contents", value=False
-                    )
-                    use_cache = gr.Checkbox(label="⚡ Use Cache", value=True)
-                process_btn = gr.Button(
-                    "🚀 Process Document", variant="primary", size="lg"
                 )
-                # Progress display
-                progress_bar = gr.Progress()
-                status_text = gr.Textbox(label="📊 Status", interactive=False)
             with gr.Column(scale=2):
                 with gr.Tabs():
                     with gr.TabItem("📝 Markdown Output"):
                         markdown_output = gr.Textbox(
                             label="Generated Markdown",
-                            lines=25,
-                            max_lines=50,
                             show_copy_button=True,
-                            placeholder="Processed markdown will appear here...",
                         )
-                    with gr.TabItem("🔍 Structure Analysis"):
                         structure_output = gr.JSON(label="Document Structure")
-                    with gr.TabItem("🧠 AI Analysis"):
-                        ai_analysis_output = gr.JSON(label="AI-Powered Analysis")
-                    with gr.TabItem("ℹ️ File Info"):
-                        file_info_output = gr.JSON(label="File Information")
-                    with gr.TabItem("📋 Frontmatter"):
-                        frontmatter_output = gr.Textbox(
-                            label="Generated Frontmatter",
-                            lines=15,
-                            show_copy_button=True,
-                        )
-        # Event handlers
-        def process_single_document(file_path, ai_enabled, frontmatter, toc, cache):
             if not file_path:
-                return "No file uploaded", {}, {}, {}, ""
-            options = {
-                "enable_ai_analysis": ai_enabled,
-                "include_frontmatter": frontmatter,
-                "generate_toc": toc,
-                "use_cache": cache,
-            }
-            result = self.converter.process_document(file_path, options)
             if "error" in result:
-                return f"❌ Error: {result['error']}", {}, {}, {}, ""
-            ai_analysis = result["structure"].get("ai_analysis", {})
-            return (
-                result["markdown"],
-                result["structure"],
-                ai_analysis,
-                result["file_info"],
-                result.get("frontmatter", ""),
-            )
-        process_btn.click(
-            fn=process_single_document,
-            inputs=[
-                file_input,
-                enable_ai,
-                include_frontmatter,
-                generate_toc,
-                use_cache,
-            ],
-            outputs=[
-                markdown_output,
-                structure_output,
-                ai_analysis_output,
-                file_info_output,
-                frontmatter_output,
-            ],
-        )
-    def _create_batch_processing_tab(self):
-        """Create batch processing tab"""
-        with gr.Row():
-            with gr.Column(scale=1):
-                batch_files = gr.File(
-                    label="📚 Upload Multiple Documents",
-                    file_count="multiple",
-                    file_types=[
-                        ".pdf",
-                        ".docx",
-                        ".pptx",
-                        ".xlsx",
-                        ".txt",
-                        ".md",
-                        ".rtf",
-                        ".epub",
-                    ],
-                    type="filepath",
-                )
-                with gr.Accordion("🎛️ Batch Options", open=True):
-                    combine_docs = gr.Checkbox(
-                        label="🔗 Combine into Single Document", value=False
-                    )
-                    batch_ai = gr.Checkbox(label="🧠 Enable AI Analysis", value=True)
-                    batch_frontmatter = gr.Checkbox(
-                        label="📋 Include Frontmatter", value=True
-                    )
-                    max_workers = gr.Slider(
-                        label="⚡ Concurrent Workers",
-                        minimum=1,
-                        maximum=5,
-                        value=3,
-                        step=1,
-                    )
-                batch_process_btn = gr.Button(
-                    "🚀 Process All Documents", variant="primary", size="lg"
-                )
-                # Batch progress
-                batch_progress = gr.Progress()
-                batch_status = gr.Textbox(label="📊 Batch Status", interactive=False)
-            with gr.Column(scale=2):
-                with gr.Tabs():
-                    with gr.TabItem("📋 Batch Results"):
-                        batch_results = gr.JSON(label="Processing Results")
-                    with gr.TabItem("📄 Combined Document"):
-                        combined_output = gr.Textbox(
-                            label="Combined Markdown",
-                            lines=25,
-                            show_copy_button=True,
-                            placeholder="Combined document will appear here if enabled...",
-                        )
-                    with gr.TabItem("📊 Batch Statistics"):
-                        batch_stats = gr.JSON(label="Batch Processing Statistics")
-        def process_batch_documents(
-            file_paths, combine, ai_enabled, frontmatter, workers
-        ):
-            if not file_paths:
-                return "No files uploaded", "", {}
-            options = {
-                "enable_ai_analysis": ai_enabled,
-                "include_frontmatter": frontmatter,
-                "combine_documents": combine,
-            }
-            result = self.converter.process_multiple_documents(file_paths, options)
-            # Generate statistics
-            stats = {
-                "total_files": result["total_files"],
-                "successful": len([r for r in result["results"] if r.get("success")]),
-                "failed": len([r for r in result["results"] if "error" in r]),
-                "total_words": sum(
-                    r.get("structure", {}).get("word_count", 0)
-                    for r in result["results"]
-                    if r.get("success")
-                ),
-                "processing_time": "N/A",  # Would need timing implementation
-            }
-            return result["results"], result.get("combined_markdown", ""), stats
-        batch_process_btn.click(
-            fn=process_batch_documents,
-            inputs=[
-                batch_files,
-                combine_docs,
-                batch_ai,
-                batch_frontmatter,
-                max_workers,
-            ],
-            outputs=[batch_results, combined_output, batch_stats],
-        )
-    def _create_settings_tab(self):
-        """Create settings and configuration tab"""
-        with gr.Column():
-            gr.Markdown("## ⚙️ Global Settings")
-            with gr.Row():
-                with gr.Column():
-                    gr.Markdown("### 🎨 Output Formatting")
-                    markdown_style = gr.Dropdown(
-                        label="Markdown Style",
-                        choices=["Standard", "GitHub Flavored", "CommonMark", "Pandoc"],
-                        value="GitHub Flavored",
-                    )
-                    heading_style = gr.Dropdown(
-                        label="Heading Style",
-                        choices=["ATX (# Header)", "Setext (Header\\n=====)"],
-                        value="ATX (# Header)",
-                    )
-                    line_break_style = gr.Dropdown(
-                        label="Line Break Style",
-                        choices=["Two Spaces", "Backslash"],
-                        value="Two Spaces",
-                    )
-                with gr.Column():
-                    gr.Markdown("### 🧠 AI Settings")
-                    ai_model = gr.Dropdown(
-                        label="NLP Model",
-                        choices=["en_core_web_sm", "en_core_web_md", "en_core_web_lg"],
-                        value="en_core_web_sm",
-                    )
-                    summary_length = gr.Slider(
-                        label="Summary Max Length",
-                        minimum=50,
-                        maximum=500,
-                        value=200,
-                        step=50,
-                    )
-                    max_topics = gr.Slider(
-                        label="Max Topics to Extract",
-                        minimum=5,
-                        maximum=20,
-                        value=10,
-                        step=1,
-                    )
-            with gr.Row():
-                with gr.Column():
-                    gr.Markdown("### 🔧 Processing Settings")
-                    cache_enabled = gr.Checkbox(label="Enable Global Cache", value=True)
-                    ocr_enabled = gr.Checkbox(label="Enable OCR by Default", value=True)
-                    preserve_formatting = gr.Checkbox(
-                        label="Preserve Original Formatting", value=True
-                    )
-                    max_file_size = gr.Slider(
-                        label="Max File Size (MB)",
-                        minimum=1,
-                        maximum=100,
-                        value=50,
-                        step=1,
-                    )
-                with gr.Column():
-                    gr.Markdown("### 📊 Performance")
-                    clear_cache_btn = gr.Button("🗑️ Clear Cache", variant="secondary")
-                    cache_info = gr.JSON(label="Cache Information")
-                    system_info = gr.JSON(
-                        label="System Information",
-                        value={
-                            "supported_formats": list(
-                                self.converter.supported_formats.keys()
-                            ),
-                            "available_features": [
-                                k for k, v in DEPENDENCIES.items() if v["available"]
-                            ],
-                            "missing_features": [
-                                k for k, v in DEPENDENCIES.items() if not v["available"]
-                            ],
-                        },
-                    )
-        def clear_cache():
-            # Implementation would clear the cache directory
-            return {"status": "Cache cleared", "timestamp": datetime.now().isoformat()}
-        clear_cache_btn.click(fn=clear_cache, outputs=[cache_info])
-    def _create_export_tab(self):
-        """Create export and download tab"""
-        with gr.Column():
-            gr.Markdown("## 💾 Export Options")
-            with gr.Row():
-                with gr.Column():
-                    gr.Markdown("### 📤 Export Formats")
-                    export_format = gr.Dropdown(
-                        label="Export Format",
-                        choices=[
-                            "Markdown (.md)",
-                            "HTML (.html)",
-                            "PDF (.pdf)",
-                            "ZIP Archive",
-                        ],
-                        value="Markdown (.md)",
-                    )
-                    include_metadata = gr.Checkbox(label="Include Metadata", value=True)
-                    include_css = gr.Checkbox(
-                        label="Include CSS (for HTML)", value=True
-                    )
-                    custom_css = gr.Textbox(
-                        label="Custom CSS",
-                        lines=10,
-                        placeholder="/* Custom CSS for HTML export */",
-                        visible=False,
-                    )
-                with gr.Column():
-                    gr.Markdown("### 📋 Export Templates")
-                    template_choice = gr.Dropdown(
-                        label="Document Template",
-                        choices=[
-                            "Default",
-                            "Academic Paper",
-                            "Technical Documentation",
-                            "Blog Post",
-                            "README",
-                        ],
-                        value="Default",
-                    )
-                    custom_header = gr.Textbox(
-                        label="Custom Header",
-                        lines=3,
-                        placeholder="Custom header to prepend to document",
-                    )
-                    custom_footer = gr.Textbox(
-                        label="Custom Footer",
-                        lines=3,
-                        placeholder="Custom footer to append to document",
-                    )
-            with gr.Row():
-                export_btn = gr.Button(
-                    "📦 Generate Export", variant="primary", size="lg"
-                )
-                download_btn = gr.File(label="📥 Download Export", interactive=False)
-            export_status = gr.Textbox(label="Export Status", interactive=False)
-        def update_css_visibility(format_choice):
-            return gr.update(visible="HTML" in format_choice)
-        export_format.change(
-            fn=update_css_visibility, inputs=[export_format], outputs=[custom_css]
         )
-# Create and launch the application
-def main():
-    """Main application entry point"""
-    interface = EnhancedGradioInterface()
-    demo = interface.create_interface()
-    # Launch with MCP server enabled
-    demo.launch(
-        mcp_server=True,
-        server_name="0.0.0.0",
-        server_port=7860,
-        share=True,
-        show_api=True,
-        show_error=True,
-    )
 if __name__ == "__main__":
-    main()

 import gradio as gr
 import re
+from typing import Dict, Any
 import os
 from pathlib import Path
+# Import dependencies for PDF and DOCX processing
 try:
     import docx
+    DOCX_AVAILABLE = True
 except ImportError:
+    DOCX_AVAILABLE = False
 try:
     import fitz  # PyMuPDF
+    PDF_AVAILABLE = True
 except ImportError:
+    PDF_AVAILABLE = False
+class DocumentToMarkdownConverter:
+    """Simple document to markdown converter"""
     def __init__(self):
+        pass
     def extract_from_docx(self, docx_path: str) -> str:
+        """Extract content from DOCX and convert to Markdown"""
+        if not DOCX_AVAILABLE:
+            raise ImportError("python-docx not installed")
         doc = docx.Document(docx_path)
         markdown_content = []
+        # Process paragraphs
         for paragraph in doc.paragraphs:
             if paragraph.text.strip():
                 md_text = self._convert_paragraph_to_markdown(paragraph)
         return "\n\n".join(markdown_content)
+    def extract_from_pdf(self, pdf_path: str) -> str:
+        """Extract content from PDF and convert to Markdown"""
+        if not PDF_AVAILABLE:
+            raise ImportError("PyMuPDF not installed")
+        doc = fitz.open(pdf_path)
         markdown_content = []
+        for page_num in range(len(doc)):
+            page = doc.load_page(page_num)
+            # Extract text blocks with formatting
+            blocks = page.get_text("dict")
+            page_markdown = self._convert_pdf_blocks_to_markdown(blocks)
+            if page_markdown.strip():
+                page_header = f"## Page {page_num + 1}"
+                markdown_content.append(page_header + "\n\n" + page_markdown)
+        doc.close()
         return "\n\n---\n\n".join(markdown_content)
     def _convert_paragraph_to_markdown(self, paragraph) -> str:
+        """Convert DOCX paragraph to Markdown"""
         text = paragraph.text.strip()
         if not text:
             return ""
         style_name = paragraph.style.name if paragraph.style else "Normal"
+        # Check if paragraph has bold formatting
         is_bold = any(run.bold for run in paragraph.runs if run.bold)
+        # Check font size for heading detection
         font_size = 12
         if paragraph.runs:
             first_run = paragraph.runs[0]
             if first_run.font.size:
                 font_size = first_run.font.size.pt
+        # Convert based on style and formatting
         if "Title" in style_name or (is_bold and font_size >= 18):
             return f"# {text}"
         elif "Heading 1" in style_name or (is_bold and font_size >= 16):
         elif "Heading 6" in style_name:
             return f"###### {text}"
         elif re.match(r"^[\d\w]\.\s|^[•\-\*]\s|^\d+\)\s", text):
+            # List items
+            if text.startswith(("1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9.")):
+                return f"1. {text[2:].strip()}"
             else:
+                char_to_check = text[0] if text else ""
+                if char_to_check in "•-*":
+                    return f"- {text[1:].strip()}"
+                else:
+                    return f"- {text}"
         else:
+            # Regular paragraph
             formatted_text = self._apply_inline_formatting(paragraph)
             return formatted_text
     def _apply_inline_formatting(self, paragraph) -> str:
+        """Apply inline formatting (bold, italic) to text"""
         result = ""
         for run in paragraph.runs:
             text = run.text
             if run.bold and run.italic:
                 text = f"***{text}***"
             elif run.bold:
                 text = f"**{text}**"
             elif run.italic:
                 text = f"*{text}*"
             result += text
         return result
     def _convert_table_to_markdown(self, table) -> str:
+        """Convert DOCX table to Markdown table"""
         if not table.rows:
             return ""
         markdown_rows = []
         # Process header row
+        header_cells = [cell.text.strip() for cell in table.rows[0].cells]
         markdown_rows.append("| " + " | ".join(header_cells) + " |")
         markdown_rows.append("| " + " | ".join(["---"] * len(header_cells)) + " |")
         # Process data rows
         for row in table.rows[1:]:
+            cells = [cell.text.strip() for cell in row.cells]
             markdown_rows.append("| " + " | ".join(cells) + " |")
         return "\n".join(markdown_rows)
+    def _convert_pdf_blocks_to_markdown(self, blocks_dict) -> str:
+        """Convert PDF text blocks to Markdown"""
+        markdown_lines = []
+        for block in blocks_dict.get("blocks", []):
+            if block.get("type") == 0:  # Text block
+                for line in block.get("lines", []):
+                    line_text = ""
+                    for span in line.get("spans", []):
+                        text = span.get("text", "").strip()
+                        if text:
+                            # Check formatting
+                            font_size = span.get("size", 12)
+                            flags = span.get("flags", 0)
+                            # Bold = flags & 16, Italic = flags & 2
+                            is_bold = bool(flags & 16)
+                            is_italic = bool(flags & 2)
+                            # Apply formatting
+                            if is_bold and is_italic:
+                                text = f"***{text}***"
+                            elif is_bold:
+                                text = f"**{text}**"
+                            elif is_italic:
+                                text = f"*{text}*"
+                            # Check if it's a heading based on font size
+                            if font_size >= 18:
+                                text = f"# {text}"
+                            elif font_size >= 16:
+                                text = f"## {text}"
+                            elif font_size >= 14:
+                                text = f"### {text}"
+                            line_text += text + " "
+                    if line_text.strip():
+                        markdown_lines.append(line_text.strip())
+        return "\n\n".join(markdown_lines)
+    def analyze_markdown_structure(self, markdown_text: str) -> Dict[str, Any]:
+        """Analyze the structure of extracted Markdown"""
         lines = markdown_text.split("\n")
         structure = {
             "headings": {"h1": 0, "h2": 0, "h3": 0, "h4": 0, "h5": 0, "h6": 0},
             "lists": {"ordered": 0, "unordered": 0},
             "tables": 0,
             "paragraphs": 0,
             "bold_text": 0,
             "italic_text": 0,
             "total_lines": len(lines),
             "word_count": len(markdown_text.split()),
             "character_count": len(markdown_text),
         }
         in_table = False
         for line in lines:
             line = line.strip()
             if not line:
                 continue
+            # Count headings
             if line.startswith("#"):
                 level = len(line) - len(line.lstrip("#"))
                 if level <= 6:
                     structure["headings"][f"h{level}"] += 1
+            # Count lists
             elif re.match(r"^\d+\.\s", line):
                 structure["lists"]["ordered"] += 1
             elif re.match(r"^[\-\*\+]\s", line):
                 structure["lists"]["unordered"] += 1
+            # Count tables
             elif "|" in line and not in_table:
                 structure["tables"] += 1
                 in_table = True
                 ):
                     structure["paragraphs"] += 1
+            # Count formatting
             structure["bold_text"] += len(re.findall(r"\*\*[^*]+\*\*", line))
             structure["italic_text"] += len(re.findall(r"\*[^*]+\*", line))
         return structure
+def extract_document_to_markdown(file_path: str) -> Dict[str, Any]:
+    """
+    Extract document content and convert to Markdown format
+    Args:
+        file_path: Path to PDF or DOCX file
+    Returns:
+        Dictionary containing markdown content and structure analysis
+    """
+    if not file_path or not os.path.exists(file_path):
+        return {"error": "File not found", "markdown": "", "structure": {}}
+    converter = DocumentToMarkdownConverter()
+    file_extension = Path(file_path).suffix.lower()
+    try:
+        if file_extension == ".docx":
+            if not DOCX_AVAILABLE:
+                return {
+                    "error": "python-docx not installed. Run: pip install python-docx",
+                    "markdown": "",
+                    "structure": {},
+                }
+            markdown_content = converter.extract_from_docx(file_path)
+        elif file_extension == ".pdf":
+            if not PDF_AVAILABLE:
+                return {
+                    "error": "PyMuPDF not installed. Run: pip install PyMuPDF",
+                    "markdown": "",
+                    "structure": {},
+                }
+            markdown_content = converter.extract_from_pdf(file_path)
+        else:
+            return {
+                "error": f"Unsupported file type: {file_extension}. Only PDF and DOCX files are supported.",
+                "markdown": "",
+                "structure": {},
+            }
+        # Analyze markdown structure
+        structure = converter.analyze_markdown_structure(markdown_content)
+        return {
+            "success": True,
+            "file_info": {
+                "name": Path(file_path).name,
+                "type": file_extension.upper()[1:],
+                "size_kb": round(os.path.getsize(file_path) / 1024, 2),
+            },
+            "markdown": markdown_content,
+            "structure": structure,
+            "preview": markdown_content[:500] + "..."
+            if len(markdown_content) > 500
+            else markdown_content,
+        }
+    except Exception as e:
+        return {
+            "error": f"Error processing file: {str(e)}",
+            "markdown": "",
+            "structure": {},
+        }
+def create_interface():
+    """Create the main Gradio interface"""
+    with gr.Blocks(
+        title="Document to Markdown Converter", theme=gr.themes.Soft()
+    ) as demo:
+        gr.Markdown("""
+        # 📄 Document to Markdown Converter
+        Convert PDF and DOCX files to Markdown format with structure analysis.
+        **Supported formats:** PDF (.pdf), Word Documents (.docx)
+        """)
+        # Show dependency status
+        missing_deps = []
+        if not DOCX_AVAILABLE:
+            missing_deps.append("python-docx")
+        if not PDF_AVAILABLE:
+            missing_deps.append("PyMuPDF")
+        if missing_deps:
+            gr.Markdown(
+                f"⚠️ **Missing dependencies**: Some features may be limited. Missing: {', '.join(missing_deps)}"
+            )
+        else:
+            gr.Markdown("✅ **All dependencies available**: Full functionality enabled")
         with gr.Row():
             with gr.Column(scale=1):
+                # File upload
                 file_input = gr.File(
                     label="📎 Upload Document",
+                    file_types=[".pdf", ".docx"],
                     type="filepath",
                 )
+                # Process button
+                extract_btn = gr.Button(
+                    "🔄 Convert to Markdown", variant="primary", size="lg"
                 )
+                # Options
+                with gr.Accordion("⚙️ Options", open=False):
+                    show_structure = gr.Checkbox(
+                        label="📊 Show Structure Analysis", value=True
+                    )
+                    show_preview = gr.Checkbox(
+                        label="👁️ Show Preview Only (first 500 chars)", value=False
+                    )
             with gr.Column(scale=2):
+                # Output tabs
                 with gr.Tabs():
                     with gr.TabItem("📝 Markdown Output"):
                         markdown_output = gr.Textbox(
                             label="Generated Markdown",
+                            lines=20,
+                            max_lines=40,
                             show_copy_button=True,
+                            placeholder="Converted markdown will appear here...",
                         )
+                    with gr.TabItem("📊 Structure Analysis"):
                         structure_output = gr.JSON(label="Document Structure")
+                    with gr.TabItem("ℹ️ File Information"):
+                        info_output = gr.JSON(label="File Details")
+        # Event handler
+        def process_document(file_path, show_struct, show_prev):
+            """Process uploaded document"""
             if not file_path:
+                return "No file uploaded", {}, {}
+            result = extract_document_to_markdown(file_path)
             if "error" in result:
+                return f"❌ Error: {result['error']}", {}, {}
+            # Determine what to show
+            markdown_text = result["preview"] if show_prev else result["markdown"]
+            structure = result["structure"] if show_struct else {}
+            file_info = result["file_info"]
+            return markdown_text, structure, file_info
+        # Connect the button
+        extract_btn.click(
+            fn=process_document,
+            inputs=[file_input, show_structure, show_preview],
+            outputs=[markdown_output, structure_output, info_output],
         )
+        # Examples section
+        gr.Markdown("""
+        ## 📖 Usage Examples
+        1. **Upload a PDF or DOCX file** using the file uploader above
+        2. **Click "Convert to Markdown"** to process the document
+        3. **View results** in the tabs:
+           - **Markdown Output**: The converted markdown text
+           - **Structure Analysis**: Document statistics and structure
+           - **File Information**: Basic file details
+        ### ✨ Features
+        - **Smart heading detection** based on font size and styles
+        - **Table extraction** and markdown formatting
+        - **List detection** and proper markdown conversion
+        - **Inline formatting** preservation (bold, italic)
+        - **Structure analysis** with statistics
+        """)
+    return demo
 if __name__ == "__main__":
+    # Create and launch the interface
+    demo = create_interface()
+    demo.launch(server_name="0.0.0.0", mcp_server=True, server_port=7860, share=True)