Spaces:
Running
Running
| title: README | |
| emoji: 🐨 | |
| colorFrom: purple | |
| colorTo: indigo | |
| sdk: static | |
| pinned: false | |
| *This organization and its dataset are not actively maintained anymore. Still you are invited to add similar datasets to it.* | |
| **Feel free to join the organization if you want to add a dataset with a similar purpose :) Please [tell me](https://tillwenke.github.io/about/) about your dataset before asking to join the org.** | |
| To test your **RAG** and other **semantic information retrieval solutions** it would be powerful to have access to a dataset that consists of a text corpus, | |
| correct responses to queries (e.g. question-answer) to test the solution end-to-end and maybe even a set of relevant passages | |
| from the text corpus for each query to test the retrieval component separately as well. | |
| We call this a question-answer-passages dataset. | |
| There are plenty of large-scale datasets of this kind such as [Google's Natural Questions](https://ai.google.com/research/NaturalQuestions/). | |
| Still we lack such datasets that are **small-scale** and **narrow-domain** to just test our RAG solution quickly or to see how it performs | |
| in a certain domain context. | |
| We created this space to create a collections of such datasets to boost the developement of RAG solutions and welcome any feedback about how your ideal RAG-Dataset would look like. :) | |
| Datasets consist of: | |
| * A **text corpus** already split into passages, referencing passages by id. | |
| * A dataset for testing consistig of: | |
| * A **question**, and one or ideally both of the followin. | |
| * A correct **short answer**. | |
| * A **list of the passage ids** that are relevant to answer the question. | |