File size: 8,507 Bytes
0d9f6c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
# End-to-End OCR with CRNN, CTC, and FastAPI

This project provides a complete pipeline for Optical Character Recognition (OCR). It features a **Convolutional Recurrent Neural Network (CRNN)** trained with **Connectionist Temporal Classification (CTC)** loss for text recognition. The trained model is served via a **FastAPI** application, which uses the **CRAFT** model for initial text detection.

The entire workflow, from data preparation and model training to deployment as a web service, is documented and implemented in this repository.

## Table of Contents

- [Workflow Overview](#workflow-overview)
- [Features](#features)
- [Project Structure](#project-structure)
- [Technical Details](#technical-details)
  - [Text Detection: CRAFT](#text-detection-craft)
  - [Text Recognition: CRNN](#text-recognition-crnn)
- [Getting Started](#getting-started)
  - [Prerequisites](#prerequisites)
  - [Installation](#installation)
- [Model Training and Export](#model-training-and-export)
  - [Dataset](#dataset)
  - [Running the Training Notebook](#running-the-training-notebook)
  - [Exporting to ONNX](#exporting-to-onnx)
- [Inference](#inference)
  - [Notebook Inference](#notebook-inference)
- [API Server (FastAPI)](#api-server-fastapi)
  - [Running the Server](#running-the-server)
  - [API Endpoints](#api-endpoints)
    - [HTTP POST Request](#http-post-request)
    - [WebSocket Connection](#websocket-connection)
- [Deployment with Docker](#deployment-with-docker)

## Workflow Overview

The project follows a clear, step-by-step process from model creation to deployment:

1.  **Train the Model**: The `notebook/ocr.ipynb` notebook is used to train the CRNN text recognition model on the MJSynth dataset.
2.  **Export the Model**: The trained PyTorch model is exported to the ONNX format (`model.onnx`) for efficient inference.
3.  **Serve the Model**: The FastAPI application (`main/app.py`) loads the ONNX model and the CRAFT text detection model to provide OCR capabilities through a web API.
4.  **Deploy**: The entire application is containerized using Docker for easy and reproducible deployment.

## Features

-   **Two-Stage OCR**: Uses CRAFT for accurate text detection and a CRNN for robust text recognition.
-   **Deep Learning Model**: An optimized CRNN architecture implemented in PyTorch.
-   **Efficient Inference**: Model is exported to ONNX for fast performance.
-   **Web API**: A FastAPI server with both REST and WebSocket endpoints.
-   **Reproducible Environment**: Comes with a `Dockerfile` for easy setup and deployment.
-   **Complete Workflow**: Includes all steps from training to deployment.

## Project Structure

```
.
β”œβ”€β”€ .gitignore           # Files to be ignored by Git
β”œβ”€β”€ Dockerfile           # Docker configuration for the API
β”œβ”€β”€ main
β”‚   β”œβ”€β”€ app.py           # Main FastAPI application
β”‚   └── core             # Modularized application logic
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ config.py    # Configuration variables
β”‚       β”œβ”€β”€ models.py    # Model loading (CRAFT)
β”‚       β”œβ”€β”€ ocr.py       # Core OCR pipeline
β”‚       └── utils.py     # Utility functions
β”œβ”€β”€ models
β”‚   β”œβ”€β”€ final.pth        # Final trained PyTorch model weights
β”‚   └── model.onnx       # Trained model in ONNX format
β”œβ”€β”€ notebook
β”‚   └── ocr.ipynb        # Jupyter Notebook for training and export
└── requirements.txt     # Python dependencies
```

## Technical Details

### Text Detection: CRAFT

The FastAPI application first uses the **CRAFT (Character-Region Awareness for Text)** model to detect text regions in the input image. It identifies bounding boxes around words or lines of text. This project uses the `hezarai/CRAFT` implementation.

### Text Recognition: CRNN

For each bounding box detected by CRAFT, a **Convolutional Recurrent Neural Network (CRNN)** is used to recognize the text within that region.

-   **Convolutional Layers (CNN)**: Serve as a powerful feature extractor, processing the image patch and outputting a sequence of feature vectors.
-   **Recurrent Layers (RNN)**: A bidirectional LSTM network processes the feature sequence, capturing contextual dependencies between characters.
-   **CTC Loss**: The model is trained with Connectionist Temporal Classification (CTC) loss, which eliminates the need for character-level alignment between the input image and the output text, making it perfect for OCR.

## Getting Started

### Prerequisites

-   Python 3.9 or higher
-   An NVIDIA GPU with CUDA for training is highly recommended.
-   Docker for containerized deployment.

### Installation

1.  **Clone the repository:**
    ```bash
    git clone <repository-url>
    cd crnn-ctc-ocr
    ```

2.  **Set up a virtual environment (recommended):**
    ```bash
    python -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
    ```

3.  **Install dependencies:**
    ```bash
    pip install -r requirements.txt
    ```

## Model Training and Export

The `notebook/ocr.ipynb` notebook contains the complete code for training and exporting the model.

### Dataset

The model is trained on the **MJSynth (MJ Synth)** dataset, a large-scale synthetic dataset for text recognition. The notebook automatically downloads it using the `datasets` library from Hugging Face (`priyank-m/MJSynth_text_recognition`).

### Running the Training Notebook

1.  **Launch Jupyter:**
    ```bash
    jupyter notebook
    ```
2.  Open `notebook/ocr.ipynb`.
3.  You can run all cells to execute the full pipeline. The `DEMO` flag is set to `True` by default to train on a smaller subset for a quick run. Set it to `False` for full training.
4.  The notebook will:
    -   Load and preprocess the dataset.
    -   Define the CRNN model, loss function, and optimizer.
    -   Run the training loop, showing progress and validation metrics (Character Error Rate, Word Error Rate).
    -   Save the best model to `checkpoints/best.pth` and the final model to `checkpoints/final.pth`.

### Exporting to ONNX

After training, the last cells of the notebook handle the export to ONNX.

-   It takes the trained CRNN model.
-   It exports the model to `export/model.onnx`. **This file is crucial for the FastAPI application.** The repository already includes a pre-trained `models/model.onnx`.

## Inference

### Notebook Inference

The notebook includes helper functions to test the model directly.

**1. Using the PyTorch model (`.pth`):**
```python
# predict with the best saved PyTorch model
predict('path/to/your/image.png', model_path='checkpoints/best.pth')
```

**2. Using the ONNX model:**
```python
# predict with the exported ONNX model
predict_onnx('path/to/your/image.png', onnx_path='export/model.onnx')
```

## API Server (FastAPI)

The application provides a web server to perform OCR on uploaded images.

### Running the Server

To run the API server locally:
```bash
python main/app.py
```
The server will start on `http://localhost:8000`.

### API Endpoints

#### HTTP POST Request

-   **Endpoint**: `POST /predict/image`
-   **Description**: Upload an image and receive the OCR results in JSON format.
-   **Example using `curl`**:
    ```bash
    curl -X POST -F "file=@/path/to/your/image.jpg" http://localhost:8000/predict/image
    ```
-   **Response**: A JSON object containing the detected paragraph, lines, words, and a base64-encoded image with bounding boxes drawn on it.

#### WebSocket Connection

-   **Endpoint**: `ws://localhost:8000/ws/predict`
-   **Description**: A WebSocket endpoint for real-time OCR. Send an image as bytes, and the server will return the OCR result as a JSON message. This is useful for streaming or interactive applications.

## Deployment with Docker

The project includes a `Dockerfile` to easily containerize and deploy the application.

1.  **Prerequisite**: Ensure the `models/model.onnx` file exists. If you've trained your own model, make sure your exported `export/model.onnx` is moved to `models/model.onnx` or the path in `main/app.py` is updated.
    *(Note: The provided code in `main/app.py` looks for `export/model.onnx`, so ensure this path is correct for your setup or that the file is present at build time.)*

2.  **Build the Docker image:**
    ```bash
    docker build -t ocr-api .
    ```

3.  **Run the Docker container:**
    ```bash
    docker run -p 8000:8000 ocr-api
    ```
    The application will be running and accessible at `http://localhost:8000`.