oneocr

File size: 18,691 Bytes

# OneOCR — Reverse-Engineered Cross-Platform OCR Pipeline

Full reimplementation of Microsoft's OneOCR engine from Windows Snipping Tool.  
`.onemodel` encryption cracked, 34 ONNX models extracted, all custom ops replaced — runs on any OS with `onnxruntime`.

---

## Project Status

| Component | Status | Details |
|---|---|---|
| **.onemodel decryption** | ✅ Done | AES-256-CFB128, static key + IV |
| **Model extraction** | ✅ Done | 34 ONNX models, 33 config files |
| **Custom op unlocking** | ✅ Done | `OneOCRFeatureExtract` → `Gemm`/`Conv1x1` |
| **ONNX pipeline** | ⚠️ Partial | **53% match rate** vs DLL (10/19 test images) |
| **DLL pipeline (Windows)** | ✅ Done | ctypes wrapper, 100% accuracy |
| **DLL pipeline (Linux)** | ✅ Done | Wine bridge, 100% accuracy, Docker ready |

### Known ONNX Engine Limitations

The Python reimplementation achieves **53% match rate** against the original DLL. Below is a detailed breakdown of the remaining issues.

#### Issue 1: False FPN2 Detections (4 images)
**Images:** ocr_test 6, 13, 17, 18  
**Symptom:** Panel edges / dialog borders detected as text  
**Cause:** FPN2 (stride=4) sees edges as text-like textures  
**DLL solution:** `SeglinkProposals` — advanced C++ post-processing with multi-stage NMS:
- `textline_hardnms_iou_threshold = 0.32`
- `textline_groupnms_span_ratio_threshold = 0.3`
- `ambiguous_nms_threshold = 0.3` / `ambiguous_overlap_threshold = 0.5`
- `K_of_detections` — per-scale detection limit

#### Issue 2: Missing Small Characters "..." (2 images)
**Images:** ocr_test 7, 14  
**Symptom:** Three dots too small to detect  
**Cause:** Minimum `min_component_pixels` and `min_area` thresholds insufficient  
**DLL solution:** `SeglinkGroup` — groups neighboring segments into a single line

#### Issue 3: Character Recognition Errors (2 images)
**Images:** ocr_test 1, 15  
**Symptom:** "iob" instead of "job", extra text from margins  
**Cause:** Differences in text cropping/preprocessing  
**DLL solution:** `BaseNormalizer` — sophisticated text line normalization

#### Issue 4: Large Images (test.png — 31.8% match)
**Symptom:** 55 of 74 lines detected, some cut off at edges  
**Cause:** Adaptive Scaling — DLL scales at multiple levels  
**DLL solution:** `AdaptiveScaling` with `AS_LARGE_TEXT_THRESHOLD`

---

## Architecture

```
Image (PIL / numpy)
    │
    ▼
┌──────────────────────────────────┐
│  Detector (model_00)             │  PixelLink FPN (fpn2/3/4)
│  BGR, mean subtraction           │  stride = 4 / 8 / 16
│  → pixel_scores, link_scores    │  8-neighbor, Union-Find
│  → bounding quads (lines)       │  minAreaRect + NMS (IoU 0.2)
└──────────────────────────────────┘
    │
    ▼ for each detected line
┌──────────────────────────────────┐
│  Crop + padding (15%)            │  Axis-aligned / perspective
│  ScriptID (model_01)             │  10 scripts: Latin, CJK, Arabic...
│  RGB / 255.0, height=60px       │  HW/PC classification, flip detection
└──────────────────────────────────┘
    │
    ▼ per script
┌──────────────────────────────────┐
│  Recognizer (model_02–10)        │  DynamicQuantizeLSTM + CTC
│  Per-script character maps       │  Greedy decode with per-char confidence
│  → text + word confidences       │  Word splitting on spaces
└──────────────────────────────────┘
    │
    ▼
┌──────────────────────────────────┐
│  Line grouping & sorting         │  Y-overlap clustering
│  Per-word bounding boxes         │  Proportional quad interpolation
│  Text angle estimation           │  Median of top-edge angles
└──────────────────────────────────┘
```

### Model Registry (34 models)

| Index | Role | Script | Custom Op | Status |
|-------|------|--------|-----------|--------|
| 0 | Detector | Universal | `QLinearSigmoid` | ✅ Works |
| 1 | ScriptID | Universal | — | ✅ Works |
| 2–10 | Recognizers | Latin/CJK/Arabic/Cyrillic/Devanagari/Greek/Hebrew/Tamil/Thai | `DynamicQuantizeLSTM` | ✅ Work |
| 11–21 | LangSm (confidence) | Per-script | `OneOCRFeatureExtract` → **Gemm** | ✅ Unlocked |
| 22–32 | LangMd (confidence) | Per-script | `OneOCRFeatureExtract` → **Gemm** | ✅ Unlocked |
| 33 | LineLayout | Universal | `OneOCRFeatureExtract` → **Conv1x1** | ✅ Unlocked |

---

## Quick Start

### Requirements

```bash
pip install onnxruntime numpy opencv-python-headless Pillow pycryptodome onnx
```

Or with `uv`:
```bash
uv sync --extra extract
```

### Model Extraction (one-time)

```bash
# Full pipeline: decrypt → extract → unlock → verify
python tools/extract_pipeline.py ocr_data/oneocr.onemodel

# Verify existing models only
python tools/extract_pipeline.py --verify-only
```

### Usage

```python
# Recommended: Unified engine (auto-selects best backend)
from ocr.engine_unified import OcrEngineUnified
from PIL import Image

engine = OcrEngineUnified()  # auto: DLL → Wine → ONNX
result = engine.recognize_pil(Image.open("screenshot.png"))

print(f"Backend: {engine.backend_name}")  # "dll" / "wine" / "onnx"
print(result.text)                         # "Hello World"
print(result.average_confidence)           # 0.975

for line in result.lines:
    for word in line.words:
        print(f"  '{word.text}' conf={word.confidence:.0%} "
              f"bbox=({word.bounding_rect.x1:.0f},{word.bounding_rect.y1:.0f})")
```

```bash
# CLI:
python main.py screenshot.png                # auto backend
python main.py screenshot.png --backend dll  # force DLL (Windows)
python main.py screenshot.png --backend wine # force Wine (Linux)
python main.py screenshot.png --backend onnx # force ONNX (any OS)
python main.py screenshot.png -o result.json # save JSON output
```

### ONNX Engine (alternative — cross-platform, no Wine needed)

```python
from ocr.engine_onnx import OcrEngineOnnx
from PIL import Image

engine = OcrEngineOnnx()
result = engine.recognize_pil(Image.open("screenshot.png"))
print(result.text)
```

### API Reference

```python
engine = OcrEngineOnnx(
    models_dir="path/to/onnx_models",       # optional
    config_dir="path/to/config_data",        # optional
    providers=["CUDAExecutionProvider"],      # optional (default: CPU)
)

# Input formats:
result = engine.recognize_pil(pil_image)       # PIL Image
result = engine.recognize_numpy(rgb_array)     # numpy (H,W,3) RGB
result = engine.recognize_bytes(png_bytes)     # raw bytes (PNG/JPEG)

# Result:
result.text                # str — full recognized text
result.text_angle          # float — detected rotation angle
result.lines               # list[OcrLine]
result.average_confidence  # float — overall confidence 0-1
result.error               # str | None — error message

# Per-word:
word.text                  # str
word.confidence            # float — CTC confidence per word
word.bounding_rect         # BoundingRect (x1,y1...x4,y4 quadrilateral)
```

---

## Running on Linux (Wine Bridge — 100% accuracy)

The DLL has a remarkably clean dependency profile (only `KERNEL32`, `bcrypt`, `dbghelp` + shipped `onnxruntime.dll`), making it fully compatible with Wine.

### Option A: Docker (recommended)

```bash
# Build
docker build -t oneocr .

# Run OCR on an image
docker run --rm -v $(pwd)/working_space:/data oneocr \
    python main.py /data/input/test.png --output /data/output/result.json

# Interactive shell
docker run --rm -it -v $(pwd)/working_space:/data oneocr bash
```

### Option B: Native Wine

```bash
# 1. Install Wine + MinGW cross-compiler
# Ubuntu/Debian:
sudo apt install wine64 mingw-w64

# Fedora:
sudo dnf install wine mingw64-gcc

# Arch:
sudo pacman -S wine mingw-w64-gcc

# 2. Initialize 64-bit Wine prefix
WINEARCH=win64 wineboot --init

# 3. Compile the Wine loader (one-time)
x86_64-w64-mingw32-gcc -O2 -o tools/oneocr_loader.exe tools/oneocr_loader.c

# 4. Test
python main.py screenshot.png --backend wine
```

### Wine Bridge Architecture

```
Linux Python ──► subprocess (wine64) ──► oneocr_loader.exe ──► oneocr.dll
    ▲                                           │
    │                                           ▼
    └──── JSON stdout ◄──── OCR results ◄──── onnxruntime.dll
```

**DLL Dependencies (all implemented in Wine ≥ 8.0):**

| DLL | Functions | Wine Status | Notes |
|-----|-----------|-------------|-------|
| `KERNEL32.dll` | 183 | ✅ Full | Standard WinAPI |
| `bcrypt.dll` | 12 | ✅ Full | AES-256-CFB128 for model decryption |
| `dbghelp.dll` | 5 | ✅ Stubs | Debug symbols — non-critical |
| `onnxruntime.dll` | 1 | N/A | Shipped with package |

---

## Project Structure

```
ONEOCR/
├── main.py                          # CLI entry point (auto-selects backend)
├── Dockerfile                       # Docker setup for Linux (Wine + DLL)
├── pyproject.toml                   # Project config & dependencies
├── README.md                        # This documentation
├── .gitignore
│
├── ocr/                             # Core OCR package
│   ├── __init__.py                  # Exports all engines & models
│   ├── engine.py                    # DLL wrapper (Windows only, 374 lines)
│   ├── engine_onnx.py               # ONNX engine (cross-platform, ~1100 lines)
│   ├── engine_unified.py            # Unified wrapper (DLL → Wine → ONNX)
│   └── models.py                    # Data models: OcrResult, OcrLine, OcrWord
│
├── tools/                           # Utilities
│   ├── extract_pipeline.py          # Extraction pipeline (decrypt→extract→unlock→verify)
│   ├── visualize_ocr.py             # OCR result visualization with bounding boxes
│   ├── test_quick.py                # Quick OCR test on images
│   ├── wine_bridge.py               # Wine bridge for Linux (C loader + Python API)
│   └── oneocr_loader.c              # C source for Wine loader (auto-generated)
│
├── ocr_data/                        # Runtime data (DO NOT commit)
│   ├── oneocr.dll                   # Original DLL (Windows only)
│   ├── oneocr.onemodel              # Encrypted model container
│   └── onnxruntime.dll              # ONNX Runtime DLL
│
├── oneocr_extracted/                # Extracted models (auto-generated)
│   ├── onnx_models/                 # 34 raw ONNX (models 11-33 have custom ops)
│   ├── onnx_models_unlocked/        # 23 unlocked (models 11-33, standard ONNX ops)
│   └── config_data/                 # Character maps, rnn_info, manifest, configs
│
├── working_space/                   # Test images
│   └── input/                       # 19 test images
│
└── _archive/                        # Archive — RE scripts, analyses, prototypes
    ├── temp/re_output/              # DLL reverse engineering results
    ├── attempts/                    # Decryption attempts
    ├── analysis/                    # Cryptographic analyses
    └── hooks/                       # Frida hooks
```

---

## Technical Details

### .onemodel Encryption

| Element | Value |
|---------|-------|
| Algorithm | AES-256-CFB128 |
| Master Key | `kj)TGtrK>f]b[Piow.gU+nC@s""""""4` (32B) |
| IV | `Copyright @ OneO` (16B) |
| DX key | `SHA256(master_key + file[8:24])` |
| Config key | `SHA256(DX[48:64] + DX[32:48])` |
| Chunk key | `SHA256(chunk_header[16:32] + chunk_header[0:16])` |

### OneOCRFeatureExtract — Cracked Custom Op

Proprietary op (domain `com.microsoft.oneocr`) stores weights as a **big-endian float32** blob in a STRING tensor.

**Models 11–32** (21→50 features):
```
config_blob (4492B, big-endian float32):
  W[21×50] = 1050 floats     (weight matrix)
  b[50]    = 50 floats       (bias)
  metadata = 23 floats       (dimensions [21, 50, 2], flags, calibration)

  Replacement: Gemm(input, W^T, b)
```

**Model 33** (256→16 channels):
```
config_blob (16548B, big-endian float32):
  W[256×16] = 4096 floats    (convolution weights)
  b[16]     = 16 floats      (bias)
  metadata  = 25 floats      (dimensions [256, 16], flags)

  Replacement: Conv(input, W[in,out].T → [16,256,1,1], b, kernel=1x1)
```

### Detector Configuration (from DLL protobuf manifest)

```
segment_conf_threshold:               0.7   (field 8)
textline_conf_threshold per-FPN:      P2=0.7, P3=0.8, P4=0.8  (field 9)
textline_nms_threshold:               0.2   (field 10)
textline_overlap_threshold:           0.4   (field 11)
text_confidence_threshold:            0.8   (field 13)
ambiguous_nms_threshold:              0.3   (field 15)
ambiguous_overlap_threshold:          0.5   (field 16)
ambiguous_save_threshold:             0.4   (field 17)
textline_hardnms_iou_threshold:       0.32  (field 20)
textline_groupnms_span_ratio_threshold: 0.3 (field 21)
```

### PixelLink Detector

- **FPN levels**: fpn2 (stride=4), fpn3 (stride=8), fpn4 (stride=16)
- **Outputs per level**: `scores_hori/vert` (pixel text probability), `link_scores_hori/vert` (8-neighbor connectivity), `bbox_deltas_hori/vert` (corner offsets)
- **Post-processing**: Threshold pixels → Union-Find connected components → bbox regression → NMS
- **Detects TEXT LINES** — word splitting comes from the recognizer

### CTC Recognition

- Target height: 60px, aspect ratio preserved
- Input: RGB / 255.0, NCHW format
- Output: log-softmax [T, 1, N_chars]
- Decoding: greedy argmax with repeat merging + blank removal
- Per-character confidence via `exp(max_logprob)`

---

## DLL Reverse Engineering — Results & Materials

### DLL Source Structure (from debug symbols)

```
C:\__w\1\s\CoreEngine\Native\
├── TextDetector/
│   ├── AdaptiveScaling           ← multi-level image scaling
│   ├── SeglinkProposal           ← KEY: detection post-processing
│   ├── SeglinkGroup.h            ← segment grouping into lines
│   ├── TextLinePolygon           ← precise text contouring
│   ├── RelationRCNNRpn2          ← relational region proposal network
│   ├── BaseRCNN, DQDETR          ← alternative detectors
│   ├── PolyFitting               ← polynomial fitting
│   └── BarcodePolygon            ← barcode detection
│
├── TextRecognizer/
│   ├── TextLineRecognizerImpl    ← main CTC implementation
│   ├── ArgMaxDecoder             ← CTC decoding
│   ├── ConfidenceProcessor       ← confidence models (models 11-21)
│   ├── RejectionProcessor        ← rejection models (models 22-32)
│   ├── DbLstm                    ← dynamic batch LSTM
│   └── CharacterMap/             ← per-script character maps
│
├── TextAnalyzer/
│   ├── TextAnalyzerImpl          ← text layout analysis
│   └── AuxMltClsClassifier       ← auxiliary classifier
│
├── TextNormalizer/
│   ├── BaseNormalizer            ← text line normalization
│   └── ConcatTextLines           ← line concatenation
│
├── TextPipeline/
│   ├── TextPipelineDevImpl       ← main pipeline
│   └── FilterXY                  ← position-based filtering
│
├── CustomOps/onnxruntime/
│   ├── SeglinkProposalsOp        ← ONNX op (NOT in our models)
│   ├── XYSeglinkProposalsOp      ← XY variant
│   └── FeatureExtractOp          ← = Gemm / Conv1x1
│
├── ModelParser/
│   ├── ModelParser               ← .onemodel parsing
│   └── Crypto                    ← AES-256-CFB128
│
└── Common/
    ├── ImageUtility              ← image conversion
    └── ImageFeature              ← image features
```

### RE Materials

Reverse engineering results in `_archive/temp/re_output/`:
- `03_oneocr_classes.txt` — 186 C++ classes
- `06_config_strings.txt` — 429 config strings
- `15_manifest_decoded.txt` — 1182 lines of decoded protobuf manifest
- `09_constants.txt` — 42 float + 14 double constants (800.0, 0.7, 0.8, 0.92...)
- `10_disassembly.txt` — disassembly of key exports

---

## For Future Developers — Roadmap

### Priority 1: SeglinkProposals (hardest, highest impact)

This is the key C++ post-processing in the DLL that is NOT part of the ONNX models.  
Responsible for ~80% of the differences between the DLL and our implementation.

**What it does:**
1. Takes raw pixel_scores + link_scores + bbox_deltas from all 3 FPN levels
2. Groups segments into lines (SeglinkGroup) — merges neighboring small components into a single line
3. Multi-stage NMS: textline_nms → hardnms → ambiguous_nms → groupnms
4. Confidence filtering with `text_confidence_threshold = 0.8`
5. `K_of_detections` — detection count limit

**Where to look:**
- `_archive/temp/re_output/06_config_strings.txt` — parameter names
- `_archive/temp/re_output/15_manifest_decoded.txt` — parameter values
- `SeglinkProposal` class in DLL — ~2000 lines of C++

**Approach:**
- Decompile `SeglinkProposal::Process` with IDA Pro / Ghidra
- Alternatively: black-box testing of different NMS configurations

### Priority 2: AdaptiveScaling

The DLL dynamically scales images based on text size.

**Parameters:**
- `AS_LARGE_TEXT_THRESHOLD` — large text threshold
- Multi-scale: DLL can run the detector at multiple scales

### Priority 3: BaseNormalizer

The DLL normalizes text crops before recognition more effectively than our simple resize.

### Priority 4: Confidence/Rejection Models (11-32)

The DLL uses models 11-32 to filter results — we skip them. Integration could improve
precision by removing false detections.

---

## Performance

| Operation | ONNX (CPU) | DLL | Notes |
|---|---|---|---|
| Detection (PixelLink) | ~50-200ms | ~15-50ms | Model inference + post-processing |
| ScriptID | ~5ms | ~3ms | Single forward pass |
| Recognition (CTC) | ~30ms/line | ~10ms/line | Per-script LSTM |
| Full pipeline | ~300-1000ms | ~15-135ms | Depends on line count |

---

## License

For research and educational purposes only.