TrOCR-SWIN HTR
Model Description
This is a TrOCR (Transformer-based Optical Character Recognition) model fine-tuned for handwritten text recognition. It uses a SWIN (Shifted Window Transformer) backbone as the image encoder and a BERT-based decoder for text generation.
- Architecture: VisionEncoderDecoder with SWIN encoder and BERT decoder
- Encoder: SwinForImageClassification (1024 hidden size, gelu activation)
- Decoder: BertForMaskedLM (768 hidden size, 12 layers, 12 attention heads)
- Training Time: 1.75 hours (6292 seconds)
Intended Use
- Handwritten text recognition (HTR)
- Document digitization
- Historical document processing
Training Configuration
Key Hyperparameters:
- Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-8)
- Batch Handling: Even batches enabled, seedable sampler
- Precision: BF16 disabled
- DataLoader: Pin memory enabled, 0 workers, no drop last
Decoder Specifications:
- Vocabulary Size: 119,547
- Max Position Embeddings: 512
- Hidden Dropout Probability: 0.1
- Attention Dropout Probability: 0.1
- Layer Normalization EPS: 1e-12
Accelerator Configuration:
- Even batches: true
- Non-blocking: false
- Split batches: false
- Use seedable sampler: true
Usage
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
processor = TrOCRProcessor.from_pretrained('model_name')
model = VisionEncoderDecoderModel.from_pretrained('model_name')
# Process image and generate text
inputs = processor(image, return_tensors="pt").pixel_values
outputs = model.generate(inputs)
texts = processor.batch_decode(outputs, skip_special_tokens=True)
print(texts[0])
Limitations
- Primarily trained on handwritten text samples
- Performance may vary with printed text or unusual fonts
- Best results with clear, legible handwriting
Training Data
will be updated
Environmental Impact
- GPU: NVIDIA T4 (16 GB VRAM)
- Environment: Google Colab
- Training Time: 1.75 hours
Model Card Contact
For questions about this model, please check the original training logs or contact the model owner.
- Downloads last month
- -