Swaroopa-jinka commited on
Commit
4f8f851
·
1 Parent(s): 32fc2af

added readme

Browse files
Files changed (1) hide show
  1. README.md +35 -10
README.md CHANGED
@@ -1,10 +1,35 @@
1
- ---
2
- title: README
3
- emoji: 👀
4
- colorFrom: purple
5
- colorTo: red
6
- sdk: static
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ We introduce **TexTAR**, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR),
2
+ capable of handling both positional cues (bold, italic) and visual cues (underline, strikeout) in
3
+ noisy, multilingual document images.
4
+ ## MMTAD Dataset
5
+
6
+ **MMTAD** (Multilingual Multi-domain Textual Attribute Dataset) comprises **1,623** real-world document images—from legislative records and notices to textbooks and notary documents—captured under diverse lighting, layout, and noise conditions. It delivers **1,117,716** word-level annotations for two attribute groups:
7
+
8
+ - **T1**: Bold,Italic,Bold & Italic
9
+
10
+ - **T2**: Underline,Strikeout,Underline & Strikeout
11
+
12
+ **Language & Domain Coverage**
13
+ - English, Spanish, and six South Asian languages
14
+ - Distribution: 67.4 % Hindi, 8.2 % Telugu, 8.0 % Marathi, 5.9 % Punjabi, 5.4 % Bengali, 5.2 % Gujarati/Tamil/others
15
+ - 300–500 annotated words per image on average
16
+
17
+ To address class imbalance (e.g., fewer italic or strikeout samples), we apply **context-aware augmentations**:
18
+ - Shear transforms to generate additional italics
19
+ - Realistic, noisy underline and strikeout overlays
20
+
21
+ These augmentations preserve document context and mimic real-world distortions, ensuring a rich, balanced benchmark for textual attribute recognition.
22
+
23
+ **More Information**
24
+ For detailed documentation and resources, visit our website: [TexTAR](https://tex-tar.github.io/)
25
+
26
+ **Downloading the Dataset**
27
+ ```
28
+ from datasets import load_dataset
29
+
30
+ ds = load_dataset("textar/MMTAD")
31
+ print(ds)
32
+ ```
33
+ Dataset contains
34
+ - `textar-testset`: document images
35
+ - `testset_labels.json`: a JSON array or dict where each key/entry is an image filename and the value is its annotated attribute labels (bold, italic, underline, strikeout, etc. for each word)