oneocr / README.md

OneOCR Dev

feat: Wine bridge - run DLL on Linux via Wine (100% accuracy)

be4a6f1 1 day ago

18.7 kB

	# OneOCR — Reverse-Engineered Cross-Platform OCR Pipeline

	Full reimplementation of Microsoft's OneOCR engine from Windows Snipping Tool.
	`.onemodel` encryption cracked, 34 ONNX models extracted, all custom ops replaced — runs on any OS with `onnxruntime`.

	---

	## Project Status

	\| Component \| Status \| Details \|
	\|---\|---\|---\|
	\| .onemodel decryption \| ✅ Done \| AES-256-CFB128, static key + IV \|
	\| Model extraction \| ✅ Done \| 34 ONNX models, 33 config files \|
	\| Custom op unlocking \| ✅ Done \| `OneOCRFeatureExtract` → `Gemm`/`Conv1x1` \|
	\| ONNX pipeline \| ⚠️ Partial \| 53% match rate vs DLL (10/19 test images) \|
	\| DLL pipeline (Windows) \| ✅ Done \| ctypes wrapper, 100% accuracy \|
	\| DLL pipeline (Linux) \| ✅ Done \| Wine bridge, 100% accuracy, Docker ready \|

	### Known ONNX Engine Limitations

	The Python reimplementation achieves 53% match rate against the original DLL. Below is a detailed breakdown of the remaining issues.

	#### Issue 1: False FPN2 Detections (4 images)
	Images: ocr_test 6, 13, 17, 18
	Symptom: Panel edges / dialog borders detected as text
	Cause: FPN2 (stride=4) sees edges as text-like textures
	DLL solution: `SeglinkProposals` — advanced C++ post-processing with multi-stage NMS:
	- `textline_hardnms_iou_threshold = 0.32`
	- `textline_groupnms_span_ratio_threshold = 0.3`
	- `ambiguous_nms_threshold = 0.3` / `ambiguous_overlap_threshold = 0.5`
	- `K_of_detections` — per-scale detection limit

	#### Issue 2: Missing Small Characters "..." (2 images)
	Images: ocr_test 7, 14
	Symptom: Three dots too small to detect
	Cause: Minimum `min_component_pixels` and `min_area` thresholds insufficient
	DLL solution: `SeglinkGroup` — groups neighboring segments into a single line

	#### Issue 3: Character Recognition Errors (2 images)
	Images: ocr_test 1, 15
	Symptom: "iob" instead of "job", extra text from margins
	Cause: Differences in text cropping/preprocessing
	DLL solution: `BaseNormalizer` — sophisticated text line normalization

	#### Issue 4: Large Images (test.png — 31.8% match)
	Symptom: 55 of 74 lines detected, some cut off at edges
	Cause: Adaptive Scaling — DLL scales at multiple levels
	DLL solution: `AdaptiveScaling` with `AS_LARGE_TEXT_THRESHOLD`

	---

	## Architecture

	```
	Image (PIL / numpy)
	│
	▼
	┌──────────────────────────────────┐
	│ Detector (model_00) │ PixelLink FPN (fpn2/3/4)
	│ BGR, mean subtraction │ stride = 4 / 8 / 16
	│ → pixel_scores, link_scores │ 8-neighbor, Union-Find
	│ → bounding quads (lines) │ minAreaRect + NMS (IoU 0.2)
	└──────────────────────────────────┘
	│
	▼ for each detected line
	┌──────────────────────────────────┐
	│ Crop + padding (15%) │ Axis-aligned / perspective
	│ ScriptID (model_01) │ 10 scripts: Latin, CJK, Arabic...
	│ RGB / 255.0, height=60px │ HW/PC classification, flip detection
	└──────────────────────────────────┘
	│
	▼ per script
	┌──────────────────────────────────┐
	│ Recognizer (model_02–10) │ DynamicQuantizeLSTM + CTC
	│ Per-script character maps │ Greedy decode with per-char confidence
	│ → text + word confidences │ Word splitting on spaces
	└──────────────────────────────────┘
	│
	▼
	┌──────────────────────────────────┐
	│ Line grouping & sorting │ Y-overlap clustering
	│ Per-word bounding boxes │ Proportional quad interpolation
	│ Text angle estimation │ Median of top-edge angles
	└──────────────────────────────────┘
	```

	### Model Registry (34 models)

	\| Index \| Role \| Script \| Custom Op \| Status \|
	\|-------\|------\|--------\|-----------\|--------\|
	\| 0 \| Detector \| Universal \| `QLinearSigmoid` \| ✅ Works \|
	\| 1 \| ScriptID \| Universal \| — \| ✅ Works \|
	\| 2–10 \| Recognizers \| Latin/CJK/Arabic/Cyrillic/Devanagari/Greek/Hebrew/Tamil/Thai \| `DynamicQuantizeLSTM` \| ✅ Work \|
	\| 11–21 \| LangSm (confidence) \| Per-script \| `OneOCRFeatureExtract` → Gemm \| ✅ Unlocked \|
	\| 22–32 \| LangMd (confidence) \| Per-script \| `OneOCRFeatureExtract` → Gemm \| ✅ Unlocked \|
	\| 33 \| LineLayout \| Universal \| `OneOCRFeatureExtract` → Conv1x1 \| ✅ Unlocked \|

	---

	## Quick Start

	### Requirements

	```bash
	pip install onnxruntime numpy opencv-python-headless Pillow pycryptodome onnx
	```

	Or with `uv`:
	```bash
	uv sync --extra extract
	```

	### Model Extraction (one-time)

	```bash
	# Full pipeline: decrypt → extract → unlock → verify
	python tools/extract_pipeline.py ocr_data/oneocr.onemodel

	# Verify existing models only
	python tools/extract_pipeline.py --verify-only
	```

	### Usage

	```python
	# Recommended: Unified engine (auto-selects best backend)
	from ocr.engine_unified import OcrEngineUnified
	from PIL import Image

	engine = OcrEngineUnified() # auto: DLL → Wine → ONNX
	result = engine.recognize_pil(Image.open("screenshot.png"))

	print(f"Backend: {engine.backend_name}") # "dll" / "wine" / "onnx"
	print(result.text) # "Hello World"
	print(result.average_confidence) # 0.975

	for line in result.lines:
	for word in line.words:
	print(f" '{word.text}' conf={word.confidence:.0%} "
	f"bbox=({word.bounding_rect.x1:.0f},{word.bounding_rect.y1:.0f})")
	```

	```bash
	# CLI:
	python main.py screenshot.png # auto backend
	python main.py screenshot.png --backend dll # force DLL (Windows)
	python main.py screenshot.png --backend wine # force Wine (Linux)
	python main.py screenshot.png --backend onnx # force ONNX (any OS)
	python main.py screenshot.png -o result.json # save JSON output
	```

	### ONNX Engine (alternative — cross-platform, no Wine needed)

	```python
	from ocr.engine_onnx import OcrEngineOnnx
	from PIL import Image

	engine = OcrEngineOnnx()
	result = engine.recognize_pil(Image.open("screenshot.png"))
	print(result.text)
	```

	### API Reference

	```python
	engine = OcrEngineOnnx(
	models_dir="path/to/onnx_models", # optional
	config_dir="path/to/config_data", # optional
	providers=["CUDAExecutionProvider"], # optional (default: CPU)
	)

	# Input formats:
	result = engine.recognize_pil(pil_image) # PIL Image
	result = engine.recognize_numpy(rgb_array) # numpy (H,W,3) RGB
	result = engine.recognize_bytes(png_bytes) # raw bytes (PNG/JPEG)

	# Result:
	result.text # str — full recognized text
	result.text_angle # float — detected rotation angle
	result.lines # list[OcrLine]
	result.average_confidence # float — overall confidence 0-1
	result.error # str \| None — error message

	# Per-word:
	word.text # str
	word.confidence # float — CTC confidence per word
	word.bounding_rect # BoundingRect (x1,y1...x4,y4 quadrilateral)
	```

	---

	## Running on Linux (Wine Bridge — 100% accuracy)

	The DLL has a remarkably clean dependency profile (only `KERNEL32`, `bcrypt`, `dbghelp` + shipped `onnxruntime.dll`), making it fully compatible with Wine.

	### Option A: Docker (recommended)

	```bash
	# Build
	docker build -t oneocr .

	# Run OCR on an image
	docker run --rm -v $(pwd)/working_space:/data oneocr \
	python main.py /data/input/test.png --output /data/output/result.json

	# Interactive shell
	docker run --rm -it -v $(pwd)/working_space:/data oneocr bash
	```

	### Option B: Native Wine

	```bash
	# 1. Install Wine + MinGW cross-compiler
	# Ubuntu/Debian:
	sudo apt install wine64 mingw-w64

	# Fedora:
	sudo dnf install wine mingw64-gcc

	# Arch:
	sudo pacman -S wine mingw-w64-gcc

	# 2. Initialize 64-bit Wine prefix
	WINEARCH=win64 wineboot --init

	# 3. Compile the Wine loader (one-time)
	x86_64-w64-mingw32-gcc -O2 -o tools/oneocr_loader.exe tools/oneocr_loader.c

	# 4. Test
	python main.py screenshot.png --backend wine
	```

	### Wine Bridge Architecture

	```
	Linux Python ──► subprocess (wine64) ──► oneocr_loader.exe ──► oneocr.dll
	▲ │
	│ ▼
	└──── JSON stdout ◄──── OCR results ◄──── onnxruntime.dll
	```

	DLL Dependencies (all implemented in Wine ≥ 8.0):

	\| DLL \| Functions \| Wine Status \| Notes \|
	\|-----\|-----------\|-------------\|-------\|
	\| `KERNEL32.dll` \| 183 \| ✅ Full \| Standard WinAPI \|
	\| `bcrypt.dll` \| 12 \| ✅ Full \| AES-256-CFB128 for model decryption \|
	\| `dbghelp.dll` \| 5 \| ✅ Stubs \| Debug symbols — non-critical \|
	\| `onnxruntime.dll` \| 1 \| N/A \| Shipped with package \|

	---

	## Project Structure

	```
	ONEOCR/
	├── main.py # CLI entry point (auto-selects backend)
	├── Dockerfile # Docker setup for Linux (Wine + DLL)
	├── pyproject.toml # Project config & dependencies
	├── README.md # This documentation
	├── .gitignore
	│
	├── ocr/ # Core OCR package
	│ ├── __init__.py # Exports all engines & models
	│ ├── engine.py # DLL wrapper (Windows only, 374 lines)
	│ ├── engine_onnx.py # ONNX engine (cross-platform, ~1100 lines)
	│ ├── engine_unified.py # Unified wrapper (DLL → Wine → ONNX)
	│ └── models.py # Data models: OcrResult, OcrLine, OcrWord
	│
	├── tools/ # Utilities
	│ ├── extract_pipeline.py # Extraction pipeline (decrypt→extract→unlock→verify)
	│ ├── visualize_ocr.py # OCR result visualization with bounding boxes
	│ ├── test_quick.py # Quick OCR test on images
	│ ├── wine_bridge.py # Wine bridge for Linux (C loader + Python API)
	│ └── oneocr_loader.c # C source for Wine loader (auto-generated)
	│
	├── ocr_data/ # Runtime data (DO NOT commit)
	│ ├── oneocr.dll # Original DLL (Windows only)
	│ ├── oneocr.onemodel # Encrypted model container
	│ └── onnxruntime.dll # ONNX Runtime DLL
	│
	├── oneocr_extracted/ # Extracted models (auto-generated)
	│ ├── onnx_models/ # 34 raw ONNX (models 11-33 have custom ops)
	│ ├── onnx_models_unlocked/ # 23 unlocked (models 11-33, standard ONNX ops)
	│ └── config_data/ # Character maps, rnn_info, manifest, configs
	│
	├── working_space/ # Test images
	│ └── input/ # 19 test images
	│
	└── _archive/ # Archive — RE scripts, analyses, prototypes
	├── temp/re_output/ # DLL reverse engineering results
	├── attempts/ # Decryption attempts
	├── analysis/ # Cryptographic analyses
	└── hooks/ # Frida hooks
	```

	---

	## Technical Details

	### .onemodel Encryption

	\| Element \| Value \|
	\|---------\|-------\|
	\| Algorithm \| AES-256-CFB128 \|
	\| Master Key \| `kj)TGtrK>f]b[Piow.gU+nC@s""""""4` (32B) \|
	\| IV \| `Copyright @ OneO` (16B) \|
	\| DX key \| `SHA256(master_key + file[8:24])` \|
	\| Config key \| `SHA256(DX[48:64] + DX[32:48])` \|
	\| Chunk key \| `SHA256(chunk_header[16:32] + chunk_header[0:16])` \|

	### OneOCRFeatureExtract — Cracked Custom Op

	Proprietary op (domain `com.microsoft.oneocr`) stores weights as a big-endian float32 blob in a STRING tensor.

	Models 11–32 (21→50 features):
	```
	config_blob (4492B, big-endian float32):
	W[21×50] = 1050 floats (weight matrix)
	b[50] = 50 floats (bias)
	metadata = 23 floats (dimensions [21, 50, 2], flags, calibration)

	Replacement: Gemm(input, W^T, b)
	```

	Model 33 (256→16 channels):
	```
	config_blob (16548B, big-endian float32):
	W[256×16] = 4096 floats (convolution weights)
	b[16] = 16 floats (bias)
	metadata = 25 floats (dimensions [256, 16], flags)

	Replacement: Conv(input, W[in,out].T → [16,256,1,1], b, kernel=1x1)
	```

	### Detector Configuration (from DLL protobuf manifest)

	```
	segment_conf_threshold: 0.7 (field 8)
	textline_conf_threshold per-FPN: P2=0.7, P3=0.8, P4=0.8 (field 9)
	textline_nms_threshold: 0.2 (field 10)
	textline_overlap_threshold: 0.4 (field 11)
	text_confidence_threshold: 0.8 (field 13)
	ambiguous_nms_threshold: 0.3 (field 15)
	ambiguous_overlap_threshold: 0.5 (field 16)
	ambiguous_save_threshold: 0.4 (field 17)
	textline_hardnms_iou_threshold: 0.32 (field 20)
	textline_groupnms_span_ratio_threshold: 0.3 (field 21)
	```

	### PixelLink Detector

	- FPN levels: fpn2 (stride=4), fpn3 (stride=8), fpn4 (stride=16)
	- Outputs per level: `scores_hori/vert` (pixel text probability), `link_scores_hori/vert` (8-neighbor connectivity), `bbox_deltas_hori/vert` (corner offsets)
	- Post-processing: Threshold pixels → Union-Find connected components → bbox regression → NMS
	- Detects TEXT LINES — word splitting comes from the recognizer

	### CTC Recognition

	- Target height: 60px, aspect ratio preserved
	- Input: RGB / 255.0, NCHW format
	- Output: log-softmax [T, 1, N_chars]
	- Decoding: greedy argmax with repeat merging + blank removal
	- Per-character confidence via `exp(max_logprob)`

	---

	## DLL Reverse Engineering — Results & Materials

	### DLL Source Structure (from debug symbols)

	```
	C:\__w\1\s\CoreEngine\Native\
	├── TextDetector/
	│ ├── AdaptiveScaling ← multi-level image scaling
	│ ├── SeglinkProposal ← KEY: detection post-processing
	│ ├── SeglinkGroup.h ← segment grouping into lines
	│ ├── TextLinePolygon ← precise text contouring
	│ ├── RelationRCNNRpn2 ← relational region proposal network
	│ ├── BaseRCNN, DQDETR ← alternative detectors
	│ ├── PolyFitting ← polynomial fitting
	│ └── BarcodePolygon ← barcode detection
	│
	├── TextRecognizer/
	│ ├── TextLineRecognizerImpl ← main CTC implementation
	│ ├── ArgMaxDecoder ← CTC decoding
	│ ├── ConfidenceProcessor ← confidence models (models 11-21)
	│ ├── RejectionProcessor ← rejection models (models 22-32)
	│ ├── DbLstm ← dynamic batch LSTM
	│ └── CharacterMap/ ← per-script character maps
	│
	├── TextAnalyzer/
	│ ├── TextAnalyzerImpl ← text layout analysis
	│ └── AuxMltClsClassifier ← auxiliary classifier
	│
	├── TextNormalizer/
	│ ├── BaseNormalizer ← text line normalization
	│ └── ConcatTextLines ← line concatenation
	│
	├── TextPipeline/
	│ ├── TextPipelineDevImpl ← main pipeline
	│ └── FilterXY ← position-based filtering
	│
	├── CustomOps/onnxruntime/
	│ ├── SeglinkProposalsOp ← ONNX op (NOT in our models)
	│ ├── XYSeglinkProposalsOp ← XY variant
	│ └── FeatureExtractOp ← = Gemm / Conv1x1
	│
	├── ModelParser/
	│ ├── ModelParser ← .onemodel parsing
	│ └── Crypto ← AES-256-CFB128
	│
	└── Common/
	├── ImageUtility ← image conversion
	└── ImageFeature ← image features
	```

	### RE Materials

	Reverse engineering results in `_archive/temp/re_output/`:
	- `03_oneocr_classes.txt` — 186 C++ classes
	- `06_config_strings.txt` — 429 config strings
	- `15_manifest_decoded.txt` — 1182 lines of decoded protobuf manifest
	- `09_constants.txt` — 42 float + 14 double constants (800.0, 0.7, 0.8, 0.92...)
	- `10_disassembly.txt` — disassembly of key exports

	---

	## For Future Developers — Roadmap

	### Priority 1: SeglinkProposals (hardest, highest impact)

	This is the key C++ post-processing in the DLL that is NOT part of the ONNX models.
	Responsible for ~80% of the differences between the DLL and our implementation.

	What it does:
	1. Takes raw pixel_scores + link_scores + bbox_deltas from all 3 FPN levels
	2. Groups segments into lines (SeglinkGroup) — merges neighboring small components into a single line
	3. Multi-stage NMS: textline_nms → hardnms → ambiguous_nms → groupnms
	4. Confidence filtering with `text_confidence_threshold = 0.8`
	5. `K_of_detections` — detection count limit

	Where to look:
	- `_archive/temp/re_output/06_config_strings.txt` — parameter names
	- `_archive/temp/re_output/15_manifest_decoded.txt` — parameter values
	- `SeglinkProposal` class in DLL — ~2000 lines of C++

	Approach:
	- Decompile `SeglinkProposal::Process` with IDA Pro / Ghidra
	- Alternatively: black-box testing of different NMS configurations

	### Priority 2: AdaptiveScaling

	The DLL dynamically scales images based on text size.

	Parameters:
	- `AS_LARGE_TEXT_THRESHOLD` — large text threshold
	- Multi-scale: DLL can run the detector at multiple scales

	### Priority 3: BaseNormalizer

	The DLL normalizes text crops before recognition more effectively than our simple resize.

	### Priority 4: Confidence/Rejection Models (11-32)

	The DLL uses models 11-32 to filter results — we skip them. Integration could improve
	precision by removing false detections.

	---

	## Performance

	\| Operation \| ONNX (CPU) \| DLL \| Notes \|
	\|---\|---\|---\|---\|
	\| Detection (PixelLink) \| ~50-200ms \| ~15-50ms \| Model inference + post-processing \|
	\| ScriptID \| ~5ms \| ~3ms \| Single forward pass \|
	\| Recognition (CTC) \| ~30ms/line \| ~10ms/line \| Per-script LSTM \|
	\| Full pipeline \| ~300-1000ms \| ~15-135ms \| Depends on line count \|

	---

	## License

	For research and educational purposes only.