OCR_DATASET_MAKER / web /README.md
Omarrran's picture
OCR Dataset Generator for HF Spaces
24a732c
metadata
title: OCR Dataset Generator
emoji: πŸ“
colorFrom: purple
colorTo: blue
sdk: docker
app_file: Dockerfile
app_port: 7860
pinned: false

Synthetic Text Recognition Dataset Generator

Universal OCR Dataset Builder for Low-Resource Languages

A high-performance, client-side OCR dataset generator with a beautiful web interface. Built with Next.js + TypeScript, designed for low-resource languages like Kashmiri, and compatible with all major OCR frameworks.

✨ Features

  • Multi-format output: CRNN, TrOCR, PaddleOCR, CSV/JSON, HuggingFace
  • Multi-font support: Upload custom fonts with percentage-based distribution
  • 25+ augmentation types: Rotation, blur, noise, brightness, skew, and more
  • Custom backgrounds: Use preset paper textures or upload your own images
  • Unicode-safe: Full RTL support for Arabic, Hebrew, Kashmiri, etc.
  • Reproducible: Seeded random generation
  • Client-side: All generation happens in your browser - no data uploaded

πŸš€ Quick Start

Web Interface (Recommended)

Visit the app and:

  1. Configure - Set dataset size, image dimensions, upload fonts
  2. Preview - See live preview of your configuration
  3. Generate - Click start and download your ZIP file
  4. Statistics - View generation statistics

Local Development

cd web
npm install
npm run dev

πŸ“¦ Output Formats

Format Files Generated
CRNN / PaddleOCR labels.txt
TrOCR / JSONL data.jsonl
CSV data.csv
JSON data.json
HuggingFace dataset_dict.json

🎨 Background Styles

  • Preset styles: Aged paper, notebook, newspaper, parchment, and more
  • Custom images: Upload your own background textures
  • Mix mode: Randomly combine multiple backgrounds by percentage

πŸ“œ License

MIT