|
|
--- |
|
|
license: cc-by-sa-4.0 |
|
|
language: |
|
|
- bg |
|
|
- cs |
|
|
- da |
|
|
- de |
|
|
- el |
|
|
- en |
|
|
- es |
|
|
- et |
|
|
- fi |
|
|
- fr |
|
|
- ga |
|
|
- hr |
|
|
- hu |
|
|
- it |
|
|
- lt |
|
|
- lv |
|
|
- mt |
|
|
- nl |
|
|
- pl |
|
|
- pt |
|
|
- ro |
|
|
- sk |
|
|
- sl |
|
|
- sv |
|
|
--- |
|
|
# Dactory models |
|
|
|
|
|
## Model description |
|
|
|
|
|
This is a set of fastText-based models to evaluate the quality and domain of text, in the 24 official languages of the European Union. |
|
|
The main usage of these models is to preprocess data from the Common Crawl project, to obtain a training set for large language models. |
|
|
These models can be used as part of the dactory pipeline, released by Kyutai to process Common Crawl. |
|
|
|
|
|
There is one model per language, and each model is a multilabel classifier with the eight following labels: |
|
|
random webpages (`rand`), Wikipedia articles (`wiki`), textbooks (`books`), scientific articles from pes2o (`science`), |
|
|
Stack Exchange websites related to STEM (`stem`), Humanities (`hum`), pop culture (`pop`) and life advices (`life`). |
|
|
The models were trained to distinguish lines sampled uniformly from these different sources. |
|
|
To get training data for the languages other than English, we translated the English training set with MADLAD, except for the `rand` and `wiki` labels, for which data is readily available in all languages. |
|
|
|
|
|
* **Model name**: Dactory models |
|
|
* **Languages**: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish |
|
|
* **Developed by**: Kyutai |
|
|
* **Model type**: Classification |
|
|
* **License**: CC-BY-SA 4.0 |
|
|
* **Version**: 1.0 |
|
|
* **Released**: April 2025 |
|
|
|
|
|
## Use cases |
|
|
|
|
|
These models can we used to evaluate the quality of text, by estimating how similar it is to text from high quality sources. |
|
|
In particular, one can take the score corresponding to the `rand` label as an estimate of the text quality. |
|
|
They can also be used to organize a collection of documents, by similarity to the different data sources used to train the model. |
|
|
For example, a large language model trained mostly on documents labeled as `books` will perform well on multi-choice Q&A benchmarks such as MMLU, while a LLM trained mostly on documents labeled as `wiki` will perform well on general knowledge Q&A benchmark such as TriviaQA. |
|
|
|
|
|
## How to use |
|
|
|
|
|
You can download the files locally by using the [huggingface-hub Python package](https://huggingface.co/docs/hub/en/models-downloading). |
|
|
|
|
|
For example: |
|
|
|
|
|
```python |
|
|
import fasttext |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
local_path = hf_hub_download(repo_id="kyutai/dactory-models", filename="filter_en.bin") |
|
|
model = fasttext.load_model(local_path) |
|
|
print(model.predict("A computer scientist is a scientist who specializes in the academic study of computer science.")) |
|
|
``` |