Spaces:

Tex-TAR
/

README

Running

App Files Files Community

Swaroopa-jinka commited on Jul 18, 2025

Commit

4f8f851

1 Parent(s): 32fc2af

added readme

Browse files

Files changed (1) hide show

README.md +35 -10

README.md CHANGED Viewed

@@ -1,10 +1,35 @@
----
-title: README
-emoji: 👀
-colorFrom: purple
-colorTo: red
-sdk: static
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+We introduce **TexTAR**, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR),
+capable of handling both positional cues (bold, italic) and visual cues (underline, strikeout) in
+noisy, multilingual document images.
+## MMTAD Dataset
+**MMTAD** (Multilingual Multi-domain Textual Attribute Dataset) comprises **1,623** real-world document images—from legislative records and notices to textbooks and notary documents—captured under diverse lighting, layout, and noise conditions. It delivers **1,117,716** word-level annotations for two attribute groups:
+- **T1**: Bold,Italic,Bold & Italic
+- **T2**: Underline,Strikeout,Underline & Strikeout
+**Language & Domain Coverage**
+- English, Spanish, and six South Asian languages
+- Distribution: 67.4 % Hindi, 8.2 % Telugu, 8.0 % Marathi, 5.9 % Punjabi, 5.4 % Bengali, 5.2 % Gujarati/Tamil/others
+- 300–500 annotated words per image on average
+To address class imbalance (e.g., fewer italic or strikeout samples), we apply **context-aware augmentations**:
+- Shear transforms to generate additional italics
+- Realistic, noisy underline and strikeout overlays
+These augmentations preserve document context and mimic real-world distortions, ensuring a rich, balanced benchmark for textual attribute recognition.
+**More Information**
+For detailed documentation and resources, visit our website: [TexTAR](https://tex-tar.github.io/)
+**Downloading the Dataset**
+ ```
+from datasets import load_dataset
+ds = load_dataset("textar/MMTAD")
+print(ds)
+```
+Dataset contains
+- `textar-testset`: document images
+- `testset_labels.json`: a JSON array or dict where each key/entry is an image filename and the value is its annotated attribute labels (bold, italic, underline, strikeout, etc. for each word)