Spaces:

Tex-TAR
/

README

Running

App Files Files Community

README / README.md

Swaroopa-jinka

added readme

6343a84 7 months ago

preview code

raw

history blame contribute delete

1.86 kB

	---
	sdk: static
	title: TexTAR
	emoji: "📚"
	license: mit
	---

	We introduce TexTAR, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR),
	capable of handling both positional cues (bold, italic) and visual cues (underline, strikeout) in
	noisy, multilingual document images.
	## MMTAD Dataset

	MMTAD (Multilingual Multi-domain Textual Attribute Dataset) comprises 1,623 real-world document images—from legislative records and notices to textbooks and notary documents—captured under diverse lighting, layout, and noise conditions. It delivers 1,117,716 word-level annotations for two attribute groups:

	- T1: Bold,Italic,Bold & Italic

	- T2: Underline,Strikeout,Underline & Strikeout

	Language & Domain Coverage
	- English, Spanish, and six South Asian languages
	- Distribution: 67.4 % Hindi, 8.2 % Telugu, 8.0 % Marathi, 5.9 % Punjabi, 5.4 % Bengali, 5.2 % Gujarati/Tamil/others
	- 300–500 annotated words per image on average

	To address class imbalance (e.g., fewer italic or strikeout samples), we apply context-aware augmentations:
	- Shear transforms to generate additional italics
	- Realistic, noisy underline and strikeout overlays

	These augmentations preserve document context and mimic real-world distortions, ensuring a rich, balanced benchmark for textual attribute recognition.

	More Information
	For detailed documentation and resources, visit our website: [TexTAR](https://tex-tar.github.io/)

	Downloading the Dataset
	```
	from datasets import load_dataset

	ds = load_dataset("textar/MMTAD")
	print(ds)
	```
	Dataset contains
	- `textar-testset`: document images
	- `testset_labels.json`: a JSON array or dict where each key/entry is an image filename and the value is its annotated attribute labels (bold, italic, underline, strikeout, etc. for each word)