Spaces:

Agents-MCP-Hackathon
/

pdf_explainer

Sleeping

App Files Files Community

spagestic commited on Jun 9, 2025

Commit

e37b0d2

1 Parent(s): 1027486

docs: enhance README with detailed application overview, features, and installation instructions

Browse files

Files changed (1) hide show

README.md +217 -2

README.md CHANGED Viewed

@@ -10,10 +10,225 @@ pinned: false
 tags: [agent-demo-track]
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-## Video Overview
 [Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg)
 This video explains the usage and purpose of the Pdf Explainer application.

 tags: [agent-demo-track]
 ---
+# 🔍 PDF Explainer
+An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies.
+## 🎥 Video Overview
 [Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg)
 This video explains the usage and purpose of the Pdf Explainer application.
+## ✨ Features
+- **📄 PDF Text Extraction**: Extract text content from PDF documents using advanced OCR technology
+- **🤖 Intelligent Explanations**: Generate simple, easy-to-understand explanations of complex content
+- **🔊 Audio Generation**: Convert explanations to high-quality audio narrations
+- **⚡ Parallel Processing**: Efficient processing of large documents with chunking and parallel audio generation
+- **🎯 Context-Aware**: Maintains context across document sections for coherent explanations
+- **📱 User-Friendly Interface**: Clean, responsive Gradio-based web interface
+## 🏗️ Architecture & Technology Stack
+### Core Technologies
+#### 1. **Mistral OCR** - Text Extraction
+- **Model**: `mistral-ocr-latest`
+- **Purpose**: Extract text and images from PDF documents
+- **Features**:
+  - Advanced OCR capabilities with markdown formatting
+  - Image extraction with coordinate mapping
+  - Multi-page document support
+  - Base64 encoding for secure document processing
+#### 2. **Mistral AI Models** - Content Generation
+- **Topic Extraction**: `ministral-8b-2410` for document topic identification
+- **Explanation Generation**: `mistral-small-2503` for creating simplified explanations
+- **Features**:
+  - Structured JSON output for topic extraction
+  - Chat history maintenance for contextual explanations
+  - Temperature-controlled generation for consistent results
+  - Section-by-section processing with heading analysis
+#### 3. **Chatterbox TTS** - Audio Generation
+- **Platform**: Modal-deployed APIs
+- **Endpoints**:
+  - `GENERATE_AUDIO_ENDPOINT`: Standard text-to-speech conversion
+  - `GENERATE_WITH_FILE_ENDPOINT`: Voice cloning with custom audio prompts
+- **Features**:
+  - High-quality audio synthesis
+  - Voice cloning capabilities
+  - Streaming audio responses
+  - Progress tracking for long generations
+### Processing Pipeline
+```mermaid
+graph TD
+    A[PDF Upload] --> B[Mistral OCR Processing]
+    B --> C[Text Extraction & Image Detection]
+    C --> D[Section Analysis & Heading Detection]
+    D --> E[Topic Identification - Ministral-8B]
+    E --> F[Explanation Generation - Mistral-Small]
+    F --> G[Text Chunking for Audio]
+    G --> H[Parallel Audio Processing]
+    H --> I[Chatterbox TTS Generation]
+    I --> J[Audio Concatenation]
+    J --> K[Final Output]
+```
+## 🔧 Installation & Setup
+### Prerequisites
+- Python 3.8+
+- Virtual environment (recommended)
+### Environment Variables
+Create a `.env` file based on `.env.example`:
+```bash
+# Mistral AI API Key
+MISTRAL_API_KEY=your_mistral_api_key_here
+# Chatterbox TTS API Endpoints (Modal)
+HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health
+GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio
+GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json
+GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file
+GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate
+```
+### Installation
+1. **Clone the repository**:
+   ```bash
+   git clone <repository-url>
+   cd pdf_explainer
+   ```
+2. **Create virtual environment**:
+   ```bash
+   python -m venv .venv
+   source .venv/Scripts/activate  # Windows
+   # or
+   source .venv/bin/activate      # Linux/Mac
+   ```
+3. **Install dependencies**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+4. **Run the application**:
+   ```bash
+   python app.py
+   ```
+## 🚀 Usage
+1. **Upload PDF**: Use the file upload interface to select your PDF document
+2. **Automatic Processing**: The application will:
+   - Extract text using Mistral OCR
+   - Generate explanations using Mistral AI
+   - Create audio narration using Chatterbox TTS
+3. **View Results**: Access extracted text, explanations, and audio in separate tabs
+4. **Download**: Copy text or download audio files as needed
+## 📁 Project Structure
+```
+pdf_explainer/
+├── app.py                      # Main application entry point
+├── requirements.txt            # Python dependencies
+├── .env.example               # Environment variables template
+├── src/
+│   ├── processors/            # Core processing modules
+│   │   ├── pdf_processor.py          # Main PDF processing orchestrator
+│   │   ├── pdf_text_extractor.py     # Mistral OCR integration
+│   │   ├── audio_processor.py        # Audio generation coordinator
+│   │   ├── generate_tts_audio.py     # Chatterbox TTS integration
+│   │   ├── text_chunker.py           # Text splitting for audio processing
+│   │   ├── parallel_processor.py     # Parallel audio generation
+│   │   └── audio_concatenator.py     # Audio chunk merging
+│   ├── ui_components/         # User interface components
+│   │   ├── interface.py              # Gradio interface builder
+│   │   └── styles.py                 # CSS styling
+│   └── utils/                 # Utility modules
+│       └── text_explainer.py         # Mistral AI explanation generation
+```
+## 🔧 Key Components
+### PDF Processing (`PDFTextExtractor`)
+- **OCR Integration**: Processes PDFs using Mistral's latest OCR model
+- **Multi-strategy Extraction**: Multiple fallback methods for text extraction
+- **Image Support**: Extracts and maps images with coordinates
+- **Error Handling**: Robust error recovery and debugging
+### Explanation Generation (`TextExplainer`)
+- **Section Analysis**: Automatic detection of markdown headings
+- **Context Maintenance**: Chat history for coherent multi-section explanations
+- **Topic Extraction**: Automatic identification of document themes
+- **Adaptive Processing**: Skips minimal content sections to optimize API usage
+### Audio Processing (`AudioProcessor`)
+- **Intelligent Chunking**: Splits text at natural boundaries (paragraphs, sentences)
+- **Parallel Generation**: Concurrent audio generation for faster processing
+- **Audio Concatenation**: Seamless merging with silence padding and fade effects
+- **Progress Tracking**: Real-time updates during long operations
+## 🎛️ Configuration Options
+### Text Chunking
+- `max_chunk_size`: Maximum characters per audio chunk (default: 800)
+- `overlap_sentences`: Sentence overlap between chunks for continuity
+### Audio Processing
+- `max_workers`: Parallel processing threads (default: 4)
+- `silence_duration`: Pause between audio chunks (default: 0.5s)
+- `fade_duration`: Fade in/out effects (default: 0.1s)
+### AI Models
+- Mistral OCR: Latest OCR model for text extraction
+- Ministral-8B: Topic extraction with structured output
+- Mistral-Small: Explanation generation with chat context
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch: `git checkout -b feature-name`
+3. Make your changes and test thoroughly
+4. Commit with descriptive messages: `git commit -m "Add feature description"`
+5. Push to your fork: `git push origin feature-name`
+6. Create a pull request
+## 📄 License
+This project is open source and available under the [MIT License](LICENSE).
+## 🆘 Support
+For questions, issues, or contributions:
+- Create an issue in the repository
+- Check the video overview for usage guidance
+- Review the code documentation for technical details
+---
+**Built with ❤️ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS**