README / README.md
Swaroopa-jinka's picture
added readme
6343a84
metadata
sdk: static
title: TexTAR
emoji: 📚
license: mit

We introduce TexTAR, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR), capable of handling both positional cues (bold, italic) and visual cues (underline, strikeout) in noisy, multilingual document images.

MMTAD Dataset

MMTAD (Multilingual Multi-domain Textual Attribute Dataset) comprises 1,623 real-world document images—from legislative records and notices to textbooks and notary documents—captured under diverse lighting, layout, and noise conditions. It delivers 1,117,716 word-level annotations for two attribute groups:

  • T1: Bold,Italic,Bold & Italic

  • T2: Underline,Strikeout,Underline & Strikeout

Language & Domain Coverage

  • English, Spanish, and six South Asian languages
  • Distribution: 67.4 % Hindi, 8.2 % Telugu, 8.0 % Marathi, 5.9 % Punjabi, 5.4 % Bengali, 5.2 % Gujarati/Tamil/others
  • 300–500 annotated words per image on average

To address class imbalance (e.g., fewer italic or strikeout samples), we apply context-aware augmentations:

  • Shear transforms to generate additional italics
  • Realistic, noisy underline and strikeout overlays

These augmentations preserve document context and mimic real-world distortions, ensuring a rich, balanced benchmark for textual attribute recognition.

More Information
For detailed documentation and resources, visit our website: TexTAR

Downloading the Dataset

from datasets import load_dataset

ds = load_dataset("textar/MMTAD")
print(ds)

Dataset contains

  • textar-testset: document images
  • testset_labels.json: a JSON array or dict where each key/entry is an image filename and the value is its annotated attribute labels (bold, italic, underline, strikeout, etc. for each word)