File size: 10,175 Bytes
18352e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
---
license: mit
tags:
- OCR
- Apple Silicon
- MLX
- MLX-VLM
- Vision Language Model
- Document Processing
- Gradio
- Apple M1
- Apple M2
- Apple M3
- Apple M4
- MonkeyOCR
- Qwen2.5-VL
library_name: transformers
---

# πŸš€ MonkeyOCR-MLX: Apple Silicon Optimized OCR

A high-performance OCR application optimized for Apple Silicon with **MLX-VLM acceleration**, featuring advanced document layout analysis and intelligent text extraction.

## πŸ”₯ Key Features

- **⚑ MLX-VLM Optimization**: Native Apple Silicon acceleration using MLX framework
- **πŸš€ 3x Faster Processing**: Compared to standard PyTorch on M-series chips  
- **🧠 Advanced AI**: Powered by Qwen2.5-VL model with specialized layout analysis
- **πŸ“„ Multi-format Support**: PDF, PNG, JPG, JPEG with intelligent structure detection
- **🌐 Modern Web Interface**: Beautiful Gradio interface for easy document processing
- **πŸ”„ Batch Processing**: Efficient handling of multiple documents
- **🎯 High Accuracy**: Specialized for complex financial documents and tables
- **πŸ”’ 100% Private**: All processing happens locally on your Mac

## πŸ“Š Performance Benchmarks

**Test: Complex Financial Document (Tax Form)**
- **MLX-VLM**: ~15-18 seconds ⚑
- **Standard PyTorch**: ~25-30 seconds
- **CPU Only**: ~60-90 seconds

**MacBook M4 Pro Performance**:
- Model loading: ~1.7s
- Text extraction: ~15s  
- Table structure: ~18s
- Memory usage: ~13GB peak

## πŸ›  Installation

### Prerequisites

- **macOS** with Apple Silicon (M1/M2/M3/M4)
- **Python 3.11+**
- **16GB+ RAM** (32GB+ recommended for large documents)

### Quick Setup

1. **Clone the repository**:
   ```bash
   git clone https://huggingface.co/Jimmi42/MonkeyOCR-Apple-Silicon
   cd MonkeyOCR-Apple-Silicon
   ```

2. **Run the automated setup script**:
   ```bash
   chmod +x setup.sh
   ./setup.sh
   ```
   
   This script will automatically:
   - Download MonkeyOCR from the official GitHub repository
   - **Apply MLX-VLM optimization patches** for Apple Silicon
   - **Enable smart backend auto-selection** (MLX/LMDeploy/transformers)
   - Install UV package manager if needed
   - Set up virtual environment with Python 3.11
   - Install all dependencies including MLX-VLM
   - Download required model weights
   - Configure optimal backend for your hardware

3. **Alternative manual installation**:
   ```bash
   # Install UV if not already installed
   curl -LsSf https://astral.sh/uv/install.sh | sh
   
   # Download MonkeyOCR
   git clone https://github.com/Yuliang-Liu/MonkeyOCR.git MonkeyOCR
   
   # Install dependencies (includes mlx-vlm)
   uv sync
   
   # Download models
   cd MonkeyOCR && python tools/download_model.py && cd ..
   ```

## πŸƒβ€β™‚οΈ Usage

### Web Interface (Recommended)

```bash
# Activate virtual environment
source .venv/bin/activate  # or `uv shell`

# Start the web app
python app.py
```

Access the interface at `http://localhost:7861`

### Command Line

```bash
python main.py path/to/document.pdf
```

## βš™οΈ Configuration

### Smart Backend Selection (Default)

The app automatically detects your hardware and selects the optimal backend:

```yaml
# model_configs_mps.yaml
device: mps
chat_config:
  backend: auto  # Smart auto-selection
  batch_size: 1
  max_new_tokens: 256
  temperature: 0.0
```

**Auto-Selection Logic:**
- 🍎 **Apple Silicon (MPS)** β†’ MLX-VLM (3x faster)
- πŸ–₯️ **CUDA GPU** β†’ LMDeploy (optimized for NVIDIA)  
- πŸ’» **CPU/Fallback** β†’ Transformers (universal compatibility)

### Performance Backends

| Backend | Speed | Memory | Best For | Auto-Selected |
|---------|-------|--------|----------|---------------|
| `auto` | ⚑ | 🧠 | **All systems** (Recommended) | βœ… Default |
| `mlx` | πŸš€πŸš€πŸš€ | 🟒 | Apple Silicon | 🍎 Auto for MPS |
| `lmdeploy` | πŸš€πŸš€ | 🟑 | CUDA systems | πŸ–₯️ Auto for CUDA |
| `transformers` | πŸš€ | 🟒 | Universal fallback | πŸ’» Auto for CPU |

## 🧠 Model Architecture

### Core Components
- **Layout Detection**: DocLayout-YOLO for document structure analysis
- **Vision-Language Model**: Qwen2.5-VL with MLX optimization
- **Layout Reading**: LayoutReader for reading order optimization
- **MLX Framework**: Native Apple Silicon acceleration

### Apple Silicon Optimizations
- **Metal Performance Shaders**: Direct GPU acceleration
- **Unified Memory**: Optimized memory access patterns
- **Neural Engine**: Utilizes Apple's dedicated AI hardware
- **Float16 Precision**: Optimal speed/accuracy balance

## 🎯 Perfect For

### Document Types:
- πŸ“Š **Financial Documents**: Tax forms, invoices, statements
- πŸ“‹ **Legal Documents**: Contracts, forms, certificates  
- πŸ“„ **Academic Papers**: Research papers, articles
- 🏒 **Business Documents**: Reports, presentations, spreadsheets

### Advanced Features:
- βœ… Complex table extraction with highlighted cells
- βœ… Multi-column layouts and mixed content
- βœ… Mathematical formulas and equations
- βœ… Structured data output (Markdown, JSON)
- βœ… Batch processing for multiple files

## 🚨 Troubleshooting

### MLX-VLM Issues

```bash
# Test MLX-VLM availability
python -c "import mlx_vlm; print('βœ… MLX-VLM available')"

# Check if auto backend selection is working
python -c "
from MonkeyOCR.magic_pdf.model.custom_model import MonkeyOCR
model = MonkeyOCR('model_configs_mps.yaml')
print(f'Selected backend: {type(model.chat_model).__name__}')
"
```

### Performance Issues

```bash
# Check MPS availability
python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"

# Monitor memory usage during processing
top -pid $(pgrep -f "python app.py")
```

### Common Solutions

1. **Patches Not Applied**: 
   - Re-run `./setup.sh` to reapply patches
   - Check that `MonkeyOCR` directory exists and has our modifications
   - Verify `MonkeyChat_MLX` class exists in `MonkeyOCR/magic_pdf/model/custom_model.py`

2. **Wrong Backend Selected**: 
   - Check hardware detection with `python -c "import torch; print(torch.backends.mps.is_available())"`
   - Verify MLX-VLM is installed: `pip install mlx-vlm`
   - Use `backend: mlx` in config to force MLX backend

3. **Slow Performance**: 
   - Ensure auto-selection chose MLX backend on Apple Silicon
   - Check Activity Monitor for MPS GPU usage
   - Verify `backend: auto` in model_configs_mps.yaml

4. **Memory Issues**: 
   - Reduce image resolution before processing
   - Close other memory-intensive applications
   - Reduce batch_size to 1 in config

5. **Port Already in Use**:
   ```bash
   GRADIO_SERVER_PORT=7862 python app.py
   ```

## πŸ“ Project Structure

```
MonkeyOCR-MLX/
β”œβ”€β”€ 🌐 app.py                    # Gradio web interface
β”œβ”€β”€ πŸ–₯️ main.py                   # CLI interface  
β”œβ”€β”€ βš™οΈ model_configs_mps.yaml    # MLX-optimized config
β”œβ”€β”€ πŸ“¦ requirements.txt          # Dependencies (includes mlx-vlm)
β”œβ”€β”€ πŸ› οΈ torch_patch.py           # Compatibility patches
β”œβ”€β”€ 🧠 MonkeyOCR/               # Core AI models
β”‚   └── 🎯 magic_pdf/           # Processing engine
β”œβ”€β”€ πŸ“„ .gitignore               # Git ignore rules
└── πŸ“š README.md                # This file
```

## πŸ”₯ What's New in MLX Version

- ✨ **Smart Patching System**: Automatically applies MLX-VLM optimizations to official MonkeyOCR
- 🧠 **Intelligent Backend Selection**: Auto-detects hardware and selects optimal backend
- πŸš€ **3x Faster Processing**: MLX-VLM acceleration on Apple Silicon
- πŸ’Ύ **Better Memory Efficiency**: Optimized for unified memory architecture
- 🎯 **Improved Accuracy**: Enhanced table and structure detection
- πŸ”§ **Zero Configuration**: Works out-of-the-box with smart defaults
- πŸ“Š **Performance Monitoring**: Built-in timing and metrics
- πŸ› οΈ **Latest Fix (June 2025)**: Resolved MLX-VLM prompt formatting for optimal OCR output
- πŸ”„ **Always Up-to-Date**: Uses official MonkeyOCR repository with our patches applied

## πŸ”¬ Technical Implementation

### Smart Patching System
- **Dynamic Code Injection**: Automatically adds MLX-VLM class to official MonkeyOCR
- **Backend Selection Logic**: Patches smart hardware detection into initialization
- **Zero Maintenance**: Always uses latest official MonkeyOCR with our optimizations
- **Seamless Integration**: Patches are applied transparently during setup

### MLX-VLM Backend (`MonkeyChat_MLX`)
- Direct MLX framework integration
- Optimized for Apple's Metal Performance Shaders
- Native unified memory management
- Specialized prompt processing for OCR tasks
- Fixed prompt formatting for optimal output quality

### Intelligent Fallback System
- **Hardware Detection**: MPS β†’ MLX, CUDA β†’ LMDeploy, CPU β†’ Transformers
- **Graceful Degradation**: Falls back to compatible backends if preferred unavailable
- **Cross-Platform**: Maintains compatibility across all systems
- **Error Recovery**: Automatic fallback on initialization failures

## 🀝 Contributing

We welcome contributions! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit changes (`git commit -m 'Add amazing feature'`)
4. Push to branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## πŸ™ Acknowledgments

- **Apple MLX Team**: For the incredible MLX framework
- **MonkeyOCR Team**: For the foundational OCR model  
- **Qwen Team**: For the excellent Qwen2.5-VL model
- **Gradio Team**: For the beautiful web interface
- **MLX-VLM Contributors**: For the MLX vision-language integration

## πŸ“ž Support

- πŸ› **Bug Reports**: [Create an issue](https://huggingface.co/Jimmi42/MonkeyOCR-Apple-Silicon/discussions)
- πŸ’¬ **Discussions**: [Hugging Face Discussions](https://huggingface.co/Jimmi42/MonkeyOCR-Apple-Silicon/discussions)
- πŸ“– **Documentation**: Check the troubleshooting section above
- ⭐ **Star the repository** if you find it useful!

---

**πŸš€ Supercharged for Apple Silicon β€’ Made with ❀️ for the MLX Community**

*Experience the future of OCR with native Apple Silicon optimization*