Spaces:

tasksource
/

README

Running

App Files Files Community

README / README.md

sileod

Update README.md

cb4f096 verified over 1 year ago

preview code

raw

history blame contribute delete

2.59 kB

	---
	title: README
	emoji: 🚀
	colorFrom: gray
	colorTo: red
	sdk: static
	pinned: false
	---

	## tasksource: 600+ dataset harmonization preprocessings with structured annotations for frictionless extreme multi-task learning and evaluation

	Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably.
	`tasksource` automates this and facilitates reproducible multi-task learning scaling.

	Each dataset is standardized to either `MultipleChoice`, `Classification`, or `TokenClassification` dataset with identical fields. We do not support generation tasks as they are addressed by [promptsource](https://github.com/bigscience-workshop/promptsource). All implemented preprocessings are in [tasks.py](https://github.com/sileod/tasksource/blob/main/src/tasksource/tasks.py) or [tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md). A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.

	GitHub: https://github.com/sileod/tasksource

	### Installation and usage:
	`pip install tasksource`
	```python
	from tasksource import list_tasks, load_task
	df = list_tasks()

	for id in df[df.task_type=="MultipleChoice"].id:
	dataset = load_task(id)
	# all yielded datasets can be used interchangeably
	```

	See supported 600+ tasks in [tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md) (+200 MultipleChoice tasks, +200 Classification tasks) and feel free to request a new task. Datasets are downloaded to `$HF_DATASETS_CACHE` (as any huggingface dataset), so be sure to have >100GB of space there.

	### Pretrained model:

	Text encoder pretrained on tasksource reached state-of-the-art results: [🤗/deberta-v3-base-tasksource-nli](https://hf.co/sileod/deberta-v3-base-tasksource-nli)

	### Contact and citation
	I can help you integrate tasksource in your experiments. `damien.sileo@inria.fr`

	More details on this [article:](https://aclanthology.org/2024.lrec-main.1361/)
	```bib
	@inproceedings{sileo-2024-tasksource-large,
	title = "tasksource: A Large Collection of {NLP} tasks with a Structured Dataset Preprocessing Framework",
	author = "Sileo, Damien",
	booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
	month = may,
	year = "2024",
	address = "Torino, Italia",
	publisher = "ELRA and ICCL",
	url = "https://aclanthology.org/2024.lrec-main.1361",
	pages = "15655--15684",
	}
	```