Spaces:
Sleeping
Sleeping
| title: Pdf Explainer | |
| emoji: π¦ | |
| colorFrom: indigo | |
| colorTo: yellow | |
| sdk: gradio | |
| sdk_version: 5.33.0 | |
| app_file: app.py | |
| pinned: false | |
| tags: [agent-demo-track] | |
| # π PDF Explainer | |
| An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies. | |
| ## π₯ Video Overview | |
| [Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg) | |
| This video explains the usage and purpose of the Pdf Explainer application. | |
| ## β¨ Features | |
| - **π PDF Text Extraction**: Extract text content from PDF documents using advanced OCR technology | |
| - **π€ Intelligent Explanations**: Generate simple, easy-to-understand explanations of complex content | |
| - **π Audio Generation**: Convert explanations to high-quality audio narrations | |
| - **β‘ Parallel Processing**: Efficient processing of large documents with chunking and parallel audio generation | |
| - **π― Context-Aware**: Maintains context across document sections for coherent explanations | |
| - **π± User-Friendly Interface**: Clean, responsive Gradio-based web interface | |
| ## ποΈ Architecture & Technology Stack | |
| ### Core Technologies | |
| #### 1. **Mistral OCR** - Text Extraction | |
| - **Model**: `mistral-ocr-latest` | |
| - **Purpose**: Extract text and images from PDF documents | |
| - **Features**: | |
| - Advanced OCR capabilities with markdown formatting | |
| - Image extraction with coordinate mapping | |
| - Multi-page document support | |
| - Base64 encoding for secure document processing | |
| #### 2. **Mistral AI Models** - Content Generation | |
| - **Topic Extraction**: `ministral-8b-2410` for document topic identification | |
| - **Explanation Generation**: `mistral-medium-2505` for creating simplified explanations | |
| - **Features**: | |
| - Structured JSON output for topic extraction | |
| - Chat history maintenance for contextual explanations | |
| - Temperature-controlled generation for consistent results | |
| - Section-by-section processing with heading analysis | |
| #### 3. **Chatterbox TTS** - Audio Generation | |
| - **Platform**: Modal-deployed APIs | |
| - **Endpoints**: | |
| - `GENERATE_AUDIO_ENDPOINT`: Standard text-to-speech conversion | |
| - `GENERATE_WITH_FILE_ENDPOINT`: Voice cloning with custom audio prompts | |
| - **Features**: | |
| - High-quality audio synthesis | |
| - Voice cloning capabilities | |
| - Streaming audio responses | |
| - Progress tracking for long generations | |
| ### Processing Pipeline | |
| ```mermaid | |
| graph TD | |
| A[PDF Upload] --> B[Mistral OCR Processing] | |
| B --> C[Text Extraction & Image Detection] | |
| C --> D[Section Analysis & Heading Detection] | |
| D --> E[Topic Identification - Ministral-8B] | |
| E --> F[Explanation Generation - Mistral-Small] | |
| F --> G[Text Chunking for Audio] | |
| G --> H[Parallel Audio Processing] | |
| H --> I[Chatterbox TTS Generation] | |
| I --> J[Audio Concatenation] | |
| J --> K[Final Output] | |
| ``` | |
| ## π§ Installation & Setup | |
| ### Prerequisites | |
| - Python 3.8+ | |
| - Virtual environment (recommended) | |
| ### Environment Variables | |
| Create a `.env` file based on `.env.example`: | |
| ```bash | |
| # Mistral AI API Key | |
| MISTRAL_API_KEY=your_mistral_api_key_here | |
| # Chatterbox TTS API Endpoints (Modal) | |
| HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health | |
| GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio | |
| GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json | |
| GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file | |
| GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate | |
| ``` | |
| ### Installation | |
| 1. **Clone the repository**: | |
| ```bash | |
| git clone <repository-url> | |
| cd pdf_explainer | |
| ``` | |
| 2. **Create virtual environment**: | |
| ```bash | |
| python -m venv .venv | |
| source .venv/Scripts/activate # Windows | |
| # or | |
| source .venv/bin/activate # Linux/Mac | |
| ``` | |
| 3. **Install dependencies**: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 4. **Run the application**: | |
| ```bash | |
| python app.py | |
| ``` | |
| ## π Usage | |
| 1. **Upload PDF**: Use the file upload interface to select your PDF document | |
| 2. **Automatic Processing**: The application will: | |
| - Extract text using Mistral OCR | |
| - Generate explanations using Mistral AI | |
| - Create audio narration using Chatterbox TTS | |
| 3. **View Results**: Access extracted text, explanations, and audio in separate tabs | |
| 4. **Download**: Copy text or download audio files as needed | |
| ## π Project Structure | |
| ``` | |
| pdf_explainer/ | |
| βββ app.py # Main application entry point | |
| βββ requirements.txt # Python dependencies | |
| βββ .env.example # Environment variables template | |
| βββ src/ | |
| β βββ processors/ # Core processing modules | |
| β β βββ pdf_processor.py # Main PDF processing orchestrator | |
| β β βββ pdf_text_extractor.py # Mistral OCR integration | |
| β β βββ audio_processor.py # Audio generation coordinator | |
| β β βββ generate_tts_audio.py # Chatterbox TTS integration | |
| β β βββ text_chunker.py # Text splitting for audio processing | |
| β β βββ parallel_processor.py # Parallel audio generation | |
| β β βββ audio_concatenator.py # Audio chunk merging | |
| β βββ ui_components/ # User interface components | |
| β β βββ interface.py # Gradio interface builder | |
| β β βββ styles.py # CSS styling | |
| β βββ utils/ # Utility modules | |
| β βββ text_explainer.py # Mistral AI explanation generation | |
| ``` | |
| ## π§ Key Components | |
| ### PDF Processing (`PDFTextExtractor`) | |
| - **OCR Integration**: Processes PDFs using Mistral's latest OCR model | |
| - **Multi-strategy Extraction**: Multiple fallback methods for text extraction | |
| - **Image Support**: Extracts and maps images with coordinates | |
| - **Error Handling**: Robust error recovery and debugging | |
| ### Explanation Generation (`TextExplainer`) | |
| - **Section Analysis**: Automatic detection of markdown headings | |
| - **Context Maintenance**: Chat history for coherent multi-section explanations | |
| - **Topic Extraction**: Automatic identification of document themes | |
| - **Adaptive Processing**: Skips minimal content sections to optimize API usage | |
| ### Audio Processing (`AudioProcessor`) | |
| - **Intelligent Chunking**: Splits text at natural boundaries (paragraphs, sentences) | |
| - **Parallel Generation**: Concurrent audio generation for faster processing | |
| - **Audio Concatenation**: Seamless merging with silence padding and fade effects | |
| - **Progress Tracking**: Real-time updates during long operations | |
| ## ποΈ Configuration Options | |
| ### Text Chunking | |
| - `max_chunk_size`: Maximum characters per audio chunk (default: 800) | |
| - `overlap_sentences`: Sentence overlap between chunks for continuity | |
| ### Audio Processing | |
| - `max_workers`: Parallel processing threads (default: 4) | |
| - `silence_duration`: Pause between audio chunks (default: 0.5s) | |
| - `fade_duration`: Fade in/out effects (default: 0.1s) | |
| ### AI Models | |
| - Mistral OCR: Latest OCR model for text extraction | |
| - Ministral-8B: Topic extraction with structured output | |
| - Mistral-Small: Explanation generation with chat context | |
| ## π€ Contributing | |
| 1. Fork the repository | |
| 2. Create a feature branch: `git checkout -b feature-name` | |
| 3. Make your changes and test thoroughly | |
| 4. Commit with descriptive messages: `git commit -m "Add feature description"` | |
| 5. Push to your fork: `git push origin feature-name` | |
| 6. Create a pull request | |
| ## π License | |
| This project is open source and available under the [MIT License](LICENSE). | |
| ## π Support | |
| For questions, issues, or contributions: | |
| - Create an issue in the repository | |
| - Check the video overview for usage guidance | |
| - Review the code documentation for technical details | |
| --- | |
| **Built with β€οΈ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS** | |