| --- |
| pipeline_tag: text-classification |
| language: |
| - multilingual |
| license: apache-2.0 |
| library_name: transformers |
| --- |
| |
| # Model Description |
|
|
| This model was build by translating the fine-Edu annotations into 15 languages using the best proprietary LLM for translation in the world: Tower LLM 70B. |
|
|
| The translation model excels at translating entire documents and thus its the perfect fit to translate the texts we will use to train our classifier. |
|
|
| The classifier is trained for English, German, Spanish, Japanese, Chinese, Russian, Hindi, Czech, Ukrainian, Icelandic, Portuguese, French, Dutch, Italian and Korean. Since its build on top of [mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) it should be able to generalize across other languages. |
|
|
| ## Running Model: |
| To run inference you must install |
| ``` |
| pip install transformers[torch] |
| pip install datasets |
| pip install pandas |
| pip install tqdm |
| ``` |
|
|
| After installing those libraries you can sun the following code: |
|
|
| ```python |
| import pandas as pd |
| import torch |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer |
| from tqdm import tqdm |
| |
| |
| device = "cuda" |
| path = "Unbabel/mfineweb-edu-classifier" |
| model = AutoModelForSequenceClassification.from_pretrained( |
| path, |
| device_map=device, |
| trust_remote_code=True, |
| torch_dtype=torch.bfloat16 |
| ) |
| tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True) |
| |
| def get_model_outputs(texts): |
| inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512).to(model.device) |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| score = outputs.logits |
| prob = torch.nn.functional.sigmoid(outputs.binary_logits) |
| return score.cpu(), prob.cpu() |
| |
| def batchify_texts(texts, batch_size): |
| for i in range(0, len(texts), batch_size): |
| yield texts[i:i + batch_size] |
| |
| # TODO: replace the next line with the texts you want to classify |
| texts = LIST_WITH_TEXTS_TO_CLASSIFY |
| batch_size = 64 # Adjust based on your available memory and model capacity |
| num_batches = (len(texts) + batch_size - 1) // batch_size |
| |
| all_scores = [] |
| all_probs = [] |
| with tqdm(total=num_batches, dynamic_ncols=True) as pbar: |
| for batch_num, batch in enumerate(batchify_texts(texts, batch_size), 1): |
| score, probs = get_model_outputs(batch) |
| all_scores.append(score) |
| all_probs.append(probs) |
| pbar.set_description(f"Processing Batch {batch_num}/{num_batches}") |
| pbar.update(1) |
| |
| # SCORES is the output of the regression head and should reflect the |
| # educational score of the text! |
| scores = torch.cat(all_scores, dim=0).squeeze() |
| |
| ## BINARY_PRED is the output of the classification head that tells |
| # if a text has an acceptable educational score or not. |
| # NOTE: Converting the scores into binary predictions is also possible |
| all_probs = torch.cat(all_probs, dim=0).squeeze() |
| binary_pred = (all_probs >= 0.5).numpy().astype(int) |
| ``` |
|
|
| ## English Results: |
|
|
| When testing the model on an english partition with 37537 samples the results are comparable to the original FineEdu-classifier. |
|
|
| Regression head results: |
| ``` |
| precision recall f1-score support |
| |
| 0 0.80 0.53 0.64 5130 |
| 1 0.80 0.88 0.83 21602 |
| 2 0.63 0.58 0.61 7849 |
| 3 0.54 0.62 0.58 2310 |
| 4 0.62 0.48 0.54 645 |
| 5 0.00 0.00 0.00 1 |
| |
| accuracy 0.74 37537 |
| macro avg 0.56 0.51 0.53 37537 |
| weighted avg 0.74 0.74 0.74 37537 |
| ``` |
|
|
| Binary head results: |
| ``` |
| precision recall f1-score support |
| |
| 0 0.98 0.97 0.98 34581 |
| 1 0.71 0.74 0.73 2956 |
| |
| accuracy 0.96 37537 |
| macro avg 0.85 0.86 0.85 37537 |
| weighted avg 0.96 0.96 0.96 37537 |
| ``` |
|
|
| ## Multilingual Results: |
|
|
| If we evaluate on the same texts translated into 15 different languages are almost identical! |
|
|
| Regression head results: |
| ``` |
| precision recall f1-score support |
| |
| 0 0.80 0.50 0.61 5130 |
| 1 0.79 0.87 0.83 21602 |
| 2 0.61 0.58 0.59 7849 |
| 3 0.52 0.61 0.56 2310 |
| 4 0.61 0.38 0.47 645 |
| 5 0.00 0.00 0.00 1 |
| |
| accuracy 0.73 37537 |
| macro avg 0.55 0.49 0.51 37537 |
| weighted avg 0.73 0.73 0.73 37537 |
| ``` |
|
|
| Binary head results: |
| ``` |
| precision recall f1-score support |
| |
| 0 0.98 0.97 0.97 34581 |
| 1 0.70 0.71 0.71 2956 |
| |
| accuracy 0.95 37537 |
| macro avg 0.84 0.84 0.84 37537 |
| weighted avg 0.95 0.95 0.95 37537 |
| ``` |
|
|