Commit ·
4f8f851
1
Parent(s): 32fc2af
added readme
Browse files
README.md
CHANGED
|
@@ -1,10 +1,35 @@
|
|
| 1 |
-
--
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
We introduce **TexTAR**, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR),
|
| 2 |
+
capable of handling both positional cues (bold, italic) and visual cues (underline, strikeout) in
|
| 3 |
+
noisy, multilingual document images.
|
| 4 |
+
## MMTAD Dataset
|
| 5 |
+
|
| 6 |
+
**MMTAD** (Multilingual Multi-domain Textual Attribute Dataset) comprises **1,623** real-world document images—from legislative records and notices to textbooks and notary documents—captured under diverse lighting, layout, and noise conditions. It delivers **1,117,716** word-level annotations for two attribute groups:
|
| 7 |
+
|
| 8 |
+
- **T1**: Bold,Italic,Bold & Italic
|
| 9 |
+
|
| 10 |
+
- **T2**: Underline,Strikeout,Underline & Strikeout
|
| 11 |
+
|
| 12 |
+
**Language & Domain Coverage**
|
| 13 |
+
- English, Spanish, and six South Asian languages
|
| 14 |
+
- Distribution: 67.4 % Hindi, 8.2 % Telugu, 8.0 % Marathi, 5.9 % Punjabi, 5.4 % Bengali, 5.2 % Gujarati/Tamil/others
|
| 15 |
+
- 300–500 annotated words per image on average
|
| 16 |
+
|
| 17 |
+
To address class imbalance (e.g., fewer italic or strikeout samples), we apply **context-aware augmentations**:
|
| 18 |
+
- Shear transforms to generate additional italics
|
| 19 |
+
- Realistic, noisy underline and strikeout overlays
|
| 20 |
+
|
| 21 |
+
These augmentations preserve document context and mimic real-world distortions, ensuring a rich, balanced benchmark for textual attribute recognition.
|
| 22 |
+
|
| 23 |
+
**More Information**
|
| 24 |
+
For detailed documentation and resources, visit our website: [TexTAR](https://tex-tar.github.io/)
|
| 25 |
+
|
| 26 |
+
**Downloading the Dataset**
|
| 27 |
+
```
|
| 28 |
+
from datasets import load_dataset
|
| 29 |
+
|
| 30 |
+
ds = load_dataset("textar/MMTAD")
|
| 31 |
+
print(ds)
|
| 32 |
+
```
|
| 33 |
+
Dataset contains
|
| 34 |
+
- `textar-testset`: document images
|
| 35 |
+
- `testset_labels.json`: a JSON array or dict where each key/entry is an image filename and the value is its annotated attribute labels (bold, italic, underline, strikeout, etc. for each word)
|