Spaces:
Running
Running
| title: README | |
| emoji: π | |
| colorFrom: gray | |
| colorTo: red | |
| sdk: static | |
| pinned: false | |
| ## tasksource: 600+ dataset harmonization preprocessings with structured annotations for frictionless extreme multi-task learning and evaluation | |
| Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably. | |
| `tasksource` automates this and facilitates reproducible multi-task learning scaling. | |
| Each dataset is standardized to either `MultipleChoice`, `Classification`, or `TokenClassification` dataset with identical fields. We do not support generation tasks as they are addressed by [promptsource](https://github.com/bigscience-workshop/promptsource). All implemented preprocessings are in [tasks.py](https://github.com/sileod/tasksource/blob/main/src/tasksource/tasks.py) or [tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md). A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable. | |
| GitHub: https://github.com/sileod/tasksource | |
| ### Installation and usage: | |
| `pip install tasksource` | |
| ```python | |
| from tasksource import list_tasks, load_task | |
| df = list_tasks() | |
| for id in df[df.task_type=="MultipleChoice"].id: | |
| dataset = load_task(id) | |
| # all yielded datasets can be used interchangeably | |
| ``` | |
| See supported 600+ tasks in [tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md) (+200 MultipleChoice tasks, +200 Classification tasks) and feel free to request a new task. Datasets are downloaded to `$HF_DATASETS_CACHE` (as any huggingface dataset), so be sure to have >100GB of space there. | |
| ### Pretrained model: | |
| Text encoder pretrained on tasksource reached state-of-the-art results: [π€/deberta-v3-base-tasksource-nli](https://hf.co/sileod/deberta-v3-base-tasksource-nli) | |
| ### Contact and citation | |
| I can help you integrate tasksource in your experiments. `damien.sileo@inria.fr` | |
| More details on this [article:](https://aclanthology.org/2024.lrec-main.1361/) | |
| ```bib | |
| @inproceedings{sileo-2024-tasksource-large, | |
| title = "tasksource: A Large Collection of {NLP} tasks with a Structured Dataset Preprocessing Framework", | |
| author = "Sileo, Damien", | |
| booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", | |
| month = may, | |
| year = "2024", | |
| address = "Torino, Italia", | |
| publisher = "ELRA and ICCL", | |
| url = "https://aclanthology.org/2024.lrec-main.1361", | |
| pages = "15655--15684", | |
| } | |
| ``` |