Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,4 +7,41 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
## tasksource: 500+ dataset harmonization preprocessings with structured annotations for frictionless extreme multi-task learning and evaluation
|
| 11 |
+
|
| 12 |
+
Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably.
|
| 13 |
+
`tasksource` automates this and facilitates reproducible multi-task learning scaling.
|
| 14 |
+
|
| 15 |
+
Each dataset is standardized to either `MultipleChoice`, `Classification`, or `TokenClassification` dataset with identical fields. We do not support generation tasks as they are addressed by [promptsource](https://github.com/bigscience-workshop/promptsource). All implemented preprocessings are in [tasks.py](https://github.com/sileod/tasksource/blob/main/src/tasksource/tasks.py) or [tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md). A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.
|
| 16 |
+
|
| 17 |
+
### Installation and usage:
|
| 18 |
+
`pip install tasksource`
|
| 19 |
+
```python
|
| 20 |
+
from tasksource import list_tasks, load_task
|
| 21 |
+
df = list_tasks()
|
| 22 |
+
|
| 23 |
+
for id in df[df.task_type=="MultipleChoice"].id:
|
| 24 |
+
dataset = load_task(id)
|
| 25 |
+
# all yielded datasets can be used interchangeably
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
See supported 500+ tasks in [tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md) (+200 MultipleChoice tasks, +200 Classification tasks) and feel free to request a new task. Datasets are downloaded to `$HF_DATASETS_CACHE` (as any huggingface dataset), so be sure to have >100GB of space there.
|
| 29 |
+
|
| 30 |
+
### Pretrained model:
|
| 31 |
+
|
| 32 |
+
Text encoder pretrained on tasksource reached state-of-the-art results: [🤗/deberta-v3-base-tasksource-nli](https://hf.co/sileod/deberta-v3-base-tasksource-nli)
|
| 33 |
+
|
| 34 |
+
### Contact and citation
|
| 35 |
+
I can help you integrate tasksource in your experiments. `damien.sileo@inria.fr`
|
| 36 |
+
|
| 37 |
+
More details on this [article:](https://arxiv.org/abs/2301.05948)
|
| 38 |
+
```bib
|
| 39 |
+
@article{sileo2023tasksource,
|
| 40 |
+
title={tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation},
|
| 41 |
+
author={Sileo, Damien},
|
| 42 |
+
url= {https://arxiv.org/abs/2301.05948},
|
| 43 |
+
journal={arXiv preprint arXiv:2301.05948},
|
| 44 |
+
year={2023}
|
| 45 |
+
}
|
| 46 |
+
```
|
| 47 |
+
|