|
|
--- |
|
|
sdk: static |
|
|
title: TexTAR |
|
|
emoji: "📚" |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
We introduce **TexTAR**, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR), |
|
|
capable of handling both positional cues (bold, italic) and visual cues (underline, strikeout) in |
|
|
noisy, multilingual document images. |
|
|
## MMTAD Dataset |
|
|
|
|
|
**MMTAD** (Multilingual Multi-domain Textual Attribute Dataset) comprises **1,623** real-world document images—from legislative records and notices to textbooks and notary documents—captured under diverse lighting, layout, and noise conditions. It delivers **1,117,716** word-level annotations for two attribute groups: |
|
|
|
|
|
- **T1**: Bold,Italic,Bold & Italic |
|
|
|
|
|
- **T2**: Underline,Strikeout,Underline & Strikeout |
|
|
|
|
|
**Language & Domain Coverage** |
|
|
- English, Spanish, and six South Asian languages |
|
|
- Distribution: 67.4 % Hindi, 8.2 % Telugu, 8.0 % Marathi, 5.9 % Punjabi, 5.4 % Bengali, 5.2 % Gujarati/Tamil/others |
|
|
- 300–500 annotated words per image on average |
|
|
|
|
|
To address class imbalance (e.g., fewer italic or strikeout samples), we apply **context-aware augmentations**: |
|
|
- Shear transforms to generate additional italics |
|
|
- Realistic, noisy underline and strikeout overlays |
|
|
|
|
|
These augmentations preserve document context and mimic real-world distortions, ensuring a rich, balanced benchmark for textual attribute recognition. |
|
|
|
|
|
**More Information** |
|
|
For detailed documentation and resources, visit our website: [TexTAR](https://tex-tar.github.io/) |
|
|
|
|
|
**Downloading the Dataset** |
|
|
``` |
|
|
from datasets import load_dataset |
|
|
|
|
|
ds = load_dataset("textar/MMTAD") |
|
|
print(ds) |
|
|
``` |
|
|
Dataset contains |
|
|
- `textar-testset`: document images |
|
|
- `testset_labels.json`: a JSON array or dict where each key/entry is an image filename and the value is its annotated attribute labels (bold, italic, underline, strikeout, etc. for each word) |