Spaces:
Running
Running
metadata
title: OCR Dataset Generator
emoji: π
colorFrom: purple
colorTo: blue
sdk: docker
app_file: Dockerfile
app_port: 7860
pinned: false
Synthetic Text Recognition Dataset Generator
Universal OCR Dataset Builder for Low-Resource Languages
A high-performance, client-side OCR dataset generator with a beautiful web interface. Built with Next.js + TypeScript, designed for low-resource languages like Kashmiri, and compatible with all major OCR frameworks.
β¨ Features
- Multi-format output: CRNN, TrOCR, PaddleOCR, CSV/JSON, HuggingFace
- Multi-font support: Upload custom fonts with percentage-based distribution
- 25+ augmentation types: Rotation, blur, noise, brightness, skew, and more
- Custom backgrounds: Use preset paper textures or upload your own images
- Unicode-safe: Full RTL support for Arabic, Hebrew, Kashmiri, etc.
- Reproducible: Seeded random generation
- Client-side: All generation happens in your browser - no data uploaded
π Quick Start
Web Interface (Recommended)
Visit the app and:
- Configure - Set dataset size, image dimensions, upload fonts
- Preview - See live preview of your configuration
- Generate - Click start and download your ZIP file
- Statistics - View generation statistics
Local Development
cd web
npm install
npm run dev
π¦ Output Formats
| Format | Files Generated |
|---|---|
| CRNN / PaddleOCR | labels.txt |
| TrOCR / JSONL | data.jsonl |
| CSV | data.csv |
| JSON | data.json |
| HuggingFace | dataset_dict.json |
π¨ Background Styles
- Preset styles: Aged paper, notebook, newspaper, parchment, and more
- Custom images: Upload your own background textures
- Mix mode: Randomly combine multiple backgrounds by percentage
π License
MIT