You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

TrOCR-SWIN HTR

Model Description

This is a TrOCR (Transformer-based Optical Character Recognition) model fine-tuned for handwritten text recognition. It uses a SWIN (Shifted Window Transformer) backbone as the image encoder and a BERT-based decoder for text generation.

  • Architecture: VisionEncoderDecoder with SWIN encoder and BERT decoder
  • Encoder: SwinForImageClassification (1024 hidden size, gelu activation)
  • Decoder: BertForMaskedLM (768 hidden size, 12 layers, 12 attention heads)
  • Training Time: 1.75 hours (6292 seconds)

Intended Use

  • Handwritten text recognition (HTR)
  • Document digitization
  • Historical document processing

Training Configuration

Key Hyperparameters:

  • Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-8)
  • Batch Handling: Even batches enabled, seedable sampler
  • Precision: BF16 disabled
  • DataLoader: Pin memory enabled, 0 workers, no drop last

Decoder Specifications:

  • Vocabulary Size: 119,547
  • Max Position Embeddings: 512
  • Hidden Dropout Probability: 0.1
  • Attention Dropout Probability: 0.1
  • Layer Normalization EPS: 1e-12

Accelerator Configuration:

  • Even batches: true
  • Non-blocking: false
  • Split batches: false
  • Use seedable sampler: true

Usage

from transformers import TrOCRProcessor, VisionEncoderDecoderModel

processor = TrOCRProcessor.from_pretrained('model_name')
model = VisionEncoderDecoderModel.from_pretrained('model_name')

# Process image and generate text
inputs = processor(image, return_tensors="pt").pixel_values
outputs = model.generate(inputs)
texts = processor.batch_decode(outputs, skip_special_tokens=True)

print(texts[0])

Limitations

  • Primarily trained on handwritten text samples
  • Performance may vary with printed text or unusual fonts
  • Best results with clear, legible handwriting

Training Data

will be updated

Environmental Impact

  • GPU: NVIDIA T4 (16 GB VRAM)
  • Environment: Google Colab
  • Training Time: 1.75 hours

Model Card Contact

For questions about this model, please check the original training logs or contact the model owner.

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support