README.md · Jimmi42/MonkeyOCR-Apple-Silicon at main

File size: 10,175 Bytes

18352e1

---
license: mit
tags:
- OCR
- Apple Silicon
- MLX
- MLX-VLM
- Vision Language Model
- Document Processing
- Gradio
- Apple M1
- Apple M2
- Apple M3
- Apple M4
- MonkeyOCR
- Qwen2.5-VL
library_name: transformers
---

# 🚀 MonkeyOCR-MLX: Apple Silicon Optimized OCR

A high-performance OCR application optimized for Apple Silicon with **MLX-VLM acceleration**, featuring advanced document layout analysis and intelligent text extraction.

## 🔥 Key Features

- **⚡ MLX-VLM Optimization**: Native Apple Silicon acceleration using MLX framework
- **🚀 3x Faster Processing**: Compared to standard PyTorch on M-series chips  
- **🧠 Advanced AI**: Powered by Qwen2.5-VL model with specialized layout analysis
- **📄 Multi-format Support**: PDF, PNG, JPG, JPEG with intelligent structure detection
- **🌐 Modern Web Interface**: Beautiful Gradio interface for easy document processing
- **🔄 Batch Processing**: Efficient handling of multiple documents
- **🎯 High Accuracy**: Specialized for complex financial documents and tables
- **🔒 100% Private**: All processing happens locally on your Mac

## 📊 Performance Benchmarks

**Test: Complex Financial Document (Tax Form)**
- **MLX-VLM**: ~15-18 seconds ⚡
- **Standard PyTorch**: ~25-30 seconds
- **CPU Only**: ~60-90 seconds

**MacBook M4 Pro Performance**:
- Model loading: ~1.7s
- Text extraction: ~15s  
- Table structure: ~18s
- Memory usage: ~13GB peak

## 🛠 Installation

### Prerequisites

- **macOS** with Apple Silicon (M1/M2/M3/M4)
- **Python 3.11+**
- **16GB+ RAM** (32GB+ recommended for large documents)

### Quick Setup

1. **Clone the repository**:
   ```bash
   git clone https://huggingface.co/Jimmi42/MonkeyOCR-Apple-Silicon
   cd MonkeyOCR-Apple-Silicon
   ```

2. **Run the automated setup script**:
   ```bash
   chmod +x setup.sh
   ./setup.sh
   ```
   
   This script will automatically:
   - Download MonkeyOCR from the official GitHub repository
   - **Apply MLX-VLM optimization patches** for Apple Silicon
   - **Enable smart backend auto-selection** (MLX/LMDeploy/transformers)
   - Install UV package manager if needed
   - Set up virtual environment with Python 3.11
   - Install all dependencies including MLX-VLM
   - Download required model weights
   - Configure optimal backend for your hardware

3. **Alternative manual installation**:
   ```bash
   # Install UV if not already installed
   curl -LsSf https://astral.sh/uv/install.sh | sh
   
   # Download MonkeyOCR
   git clone https://github.com/Yuliang-Liu/MonkeyOCR.git MonkeyOCR
   
   # Install dependencies (includes mlx-vlm)
   uv sync
   
   # Download models
   cd MonkeyOCR && python tools/download_model.py && cd ..
   ```

## 🏃‍♂️ Usage

### Web Interface (Recommended)

```bash
# Activate virtual environment
source .venv/bin/activate  # or `uv shell`

# Start the web app
python app.py
```

Access the interface at `http://localhost:7861`

### Command Line

```bash
python main.py path/to/document.pdf
```

## ⚙️ Configuration

### Smart Backend Selection (Default)

The app automatically detects your hardware and selects the optimal backend:

```yaml
# model_configs_mps.yaml
device: mps
chat_config:
  backend: auto  # Smart auto-selection
  batch_size: 1
  max_new_tokens: 256
  temperature: 0.0
```

**Auto-Selection Logic:**
- 🍎 **Apple Silicon (MPS)** → MLX-VLM (3x faster)
- 🖥️ **CUDA GPU** → LMDeploy (optimized for NVIDIA)  
- 💻 **CPU/Fallback** → Transformers (universal compatibility)

### Performance Backends

| Backend | Speed | Memory | Best For | Auto-Selected |
|---------|-------|--------|----------|---------------|
| `auto` | ⚡ | 🧠 | **All systems** (Recommended) | ✅ Default |
| `mlx` | 🚀🚀🚀 | 🟢 | Apple Silicon | 🍎 Auto for MPS |
| `lmdeploy` | 🚀🚀 | 🟡 | CUDA systems | 🖥️ Auto for CUDA |
| `transformers` | 🚀 | 🟢 | Universal fallback | 💻 Auto for CPU |

## 🧠 Model Architecture

### Core Components
- **Layout Detection**: DocLayout-YOLO for document structure analysis
- **Vision-Language Model**: Qwen2.5-VL with MLX optimization
- **Layout Reading**: LayoutReader for reading order optimization
- **MLX Framework**: Native Apple Silicon acceleration

### Apple Silicon Optimizations
- **Metal Performance Shaders**: Direct GPU acceleration
- **Unified Memory**: Optimized memory access patterns
- **Neural Engine**: Utilizes Apple's dedicated AI hardware
- **Float16 Precision**: Optimal speed/accuracy balance

## 🎯 Perfect For

### Document Types:
- 📊 **Financial Documents**: Tax forms, invoices, statements
- 📋 **Legal Documents**: Contracts, forms, certificates  
- 📄 **Academic Papers**: Research papers, articles
- 🏢 **Business Documents**: Reports, presentations, spreadsheets

### Advanced Features:
- ✅ Complex table extraction with highlighted cells
- ✅ Multi-column layouts and mixed content
- ✅ Mathematical formulas and equations
- ✅ Structured data output (Markdown, JSON)
- ✅ Batch processing for multiple files

## 🚨 Troubleshooting

### MLX-VLM Issues

```bash
# Test MLX-VLM availability
python -c "import mlx_vlm; print('✅ MLX-VLM available')"

# Check if auto backend selection is working
python -c "
from MonkeyOCR.magic_pdf.model.custom_model import MonkeyOCR
model = MonkeyOCR('model_configs_mps.yaml')
print(f'Selected backend: {type(model.chat_model).__name__}')
"
```

### Performance Issues

```bash
# Check MPS availability
python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"

# Monitor memory usage during processing
top -pid $(pgrep -f "python app.py")
```

### Common Solutions

1. **Patches Not Applied**: 
   - Re-run `./setup.sh` to reapply patches
   - Check that `MonkeyOCR` directory exists and has our modifications
   - Verify `MonkeyChat_MLX` class exists in `MonkeyOCR/magic_pdf/model/custom_model.py`

2. **Wrong Backend Selected**: 
   - Check hardware detection with `python -c "import torch; print(torch.backends.mps.is_available())"`
   - Verify MLX-VLM is installed: `pip install mlx-vlm`
   - Use `backend: mlx` in config to force MLX backend

3. **Slow Performance**: 
   - Ensure auto-selection chose MLX backend on Apple Silicon
   - Check Activity Monitor for MPS GPU usage
   - Verify `backend: auto` in model_configs_mps.yaml

4. **Memory Issues**: 
   - Reduce image resolution before processing
   - Close other memory-intensive applications
   - Reduce batch_size to 1 in config

5. **Port Already in Use**:
   ```bash
   GRADIO_SERVER_PORT=7862 python app.py
   ```

## 📁 Project Structure

```
MonkeyOCR-MLX/
├── 🌐 app.py                    # Gradio web interface
├── 🖥️ main.py                   # CLI interface  
├── ⚙️ model_configs_mps.yaml    # MLX-optimized config
├── 📦 requirements.txt          # Dependencies (includes mlx-vlm)
├── 🛠️ torch_patch.py           # Compatibility patches
├── 🧠 MonkeyOCR/               # Core AI models
│   └── 🎯 magic_pdf/           # Processing engine
├── 📄 .gitignore               # Git ignore rules
└── 📚 README.md                # This file
```

## 🔥 What's New in MLX Version

- ✨ **Smart Patching System**: Automatically applies MLX-VLM optimizations to official MonkeyOCR
- 🧠 **Intelligent Backend Selection**: Auto-detects hardware and selects optimal backend
- 🚀 **3x Faster Processing**: MLX-VLM acceleration on Apple Silicon
- 💾 **Better Memory Efficiency**: Optimized for unified memory architecture
- 🎯 **Improved Accuracy**: Enhanced table and structure detection
- 🔧 **Zero Configuration**: Works out-of-the-box with smart defaults
- 📊 **Performance Monitoring**: Built-in timing and metrics
- 🛠️ **Latest Fix (June 2025)**: Resolved MLX-VLM prompt formatting for optimal OCR output
- 🔄 **Always Up-to-Date**: Uses official MonkeyOCR repository with our patches applied

## 🔬 Technical Implementation

### Smart Patching System
- **Dynamic Code Injection**: Automatically adds MLX-VLM class to official MonkeyOCR
- **Backend Selection Logic**: Patches smart hardware detection into initialization
- **Zero Maintenance**: Always uses latest official MonkeyOCR with our optimizations
- **Seamless Integration**: Patches are applied transparently during setup

### MLX-VLM Backend (`MonkeyChat_MLX`)
- Direct MLX framework integration
- Optimized for Apple's Metal Performance Shaders
- Native unified memory management
- Specialized prompt processing for OCR tasks
- Fixed prompt formatting for optimal output quality

### Intelligent Fallback System
- **Hardware Detection**: MPS → MLX, CUDA → LMDeploy, CPU → Transformers
- **Graceful Degradation**: Falls back to compatible backends if preferred unavailable
- **Cross-Platform**: Maintains compatibility across all systems
- **Error Recovery**: Automatic fallback on initialization failures

## 🤝 Contributing

We welcome contributions! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit changes (`git commit -m 'Add amazing feature'`)
4. Push to branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- **Apple MLX Team**: For the incredible MLX framework
- **MonkeyOCR Team**: For the foundational OCR model  
- **Qwen Team**: For the excellent Qwen2.5-VL model
- **Gradio Team**: For the beautiful web interface
- **MLX-VLM Contributors**: For the MLX vision-language integration

## 📞 Support

- 🐛 **Bug Reports**: [Create an issue](https://huggingface.co/Jimmi42/MonkeyOCR-Apple-Silicon/discussions)
- 💬 **Discussions**: [Hugging Face Discussions](https://huggingface.co/Jimmi42/MonkeyOCR-Apple-Silicon/discussions)
- 📖 **Documentation**: Check the troubleshooting section above
- ⭐ **Star the repository** if you find it useful!

---

**🚀 Supercharged for Apple Silicon • Made with ❤️ for the MLX Community**

*Experience the future of OCR with native Apple Silicon optimization*