Spaces:

Omarrran
/

OCR_DATASET_MAKER

Running

App Files Files Community

OCR_DATASET_MAKER / web /README.md

Omarrran

OCR Dataset Generator for HF Spaces

24a732c 4 months ago

preview code

raw

history blame contribute delete

1.84 kB

metadata

title: OCR Dataset Generator
emoji: 📝
colorFrom: purple
colorTo: blue
sdk: docker
app_file: Dockerfile
app_port: 7860
pinned: false

Synthetic Text Recognition Dataset Generator

Universal OCR Dataset Builder for Low-Resource Languages

A high-performance, client-side OCR dataset generator with a beautiful web interface. Built with Next.js + TypeScript, designed for low-resource languages like Kashmiri, and compatible with all major OCR frameworks.

✨ Features

Multi-format output: CRNN, TrOCR, PaddleOCR, CSV/JSON, HuggingFace
Multi-font support: Upload custom fonts with percentage-based distribution
25+ augmentation types: Rotation, blur, noise, brightness, skew, and more
Custom backgrounds: Use preset paper textures or upload your own images
Unicode-safe: Full RTL support for Arabic, Hebrew, Kashmiri, etc.
Reproducible: Seeded random generation
Client-side: All generation happens in your browser - no data uploaded

🚀 Quick Start

Web Interface (Recommended)

Visit the app and:

Configure - Set dataset size, image dimensions, upload fonts
Preview - See live preview of your configuration
Generate - Click start and download your ZIP file
Statistics - View generation statistics

Local Development

cd web
npm install
npm run dev

📦 Output Formats

Format	Files Generated
CRNN / PaddleOCR	`labels.txt`
TrOCR / JSONL	`data.jsonl`
CSV	`data.csv`
JSON	`data.json`
HuggingFace	`dataset_dict.json`

🎨 Background Styles

Preset styles: Aged paper, notebook, newspaper, parchment, and more
Custom images: Upload your own background textures
Mix mode: Randomly combine multiple backgrounds by percentage

📜 License

MIT