ymoslem
/

ModernBERT-large-qe-v1

@@ -37,34 +37,51 @@ datasets:
 - ymoslem/wmt-da-human-evaluation
 model-index:
 - name: Quality Estimation for Machine Translation
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # Quality Estimation for Machine Translation
-This model is a fine-tuned version of [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) on the ymoslem/wmt-da-human-evaluation dataset.
 It achieves the following results on the evaluation set:
-- Loss: 0.0572
 ## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - learning_rate: 8e-05
 - train_batch_size: 128
@@ -78,16 +95,16 @@ The following hyperparameters were used during training:
 | Training Loss | Epoch  | Step  | Validation Loss |
 |:-------------:|:------:|:-----:|:---------------:|
-| 0.0651        | 0.1004 | 1000  | 0.0703          |
-| 0.0623        | 0.2007 | 2000  | 0.0614          |
-| 0.0584        | 0.3011 | 3000  | 0.0597          |
-| 0.0593        | 0.4015 | 4000  | 0.0586          |
-| 0.0577        | 0.5019 | 5000  | 0.0580          |
-| 0.058         | 0.6022 | 6000  | 0.0577          |
-| 0.0587        | 0.7026 | 7000  | 0.0574          |
-| 0.0578        | 0.8030 | 8000  | 0.0573          |
-| 0.0576        | 0.9033 | 9000  | 0.0572          |
-| 0.0577        | 1.0037 | 10000 | 0.0572          |
 ### Framework versions
@@ -96,3 +113,125 @@ The following hyperparameters were used during training:
 - Pytorch 2.4.1+cu124
 - Datasets 3.2.0
 - Tokenizers 0.21.0

 - ymoslem/wmt-da-human-evaluation
 model-index:
 - name: Quality Estimation for Machine Translation
+  results:
+  - task:
+      type: regression
+    dataset:
+      name: ymoslem/wmt-da-human-evaluation
+      type: QE
+    metrics:
+    - name: Pearson Correlation
+      type: Pearson
+      value: 0.4458
+    - name: Mean Absolute Error
+      type: MAE
+      value: 0.1876
+    - name: Root Mean Squared Error
+      type: RMSE
+      value: 0.2393
+    - name: R-Squared
+      type: R2
+      value: 0.1987
+metrics:
+- pearsonr
+- mae
+- r_squared
 ---
 # Quality Estimation for Machine Translation
+This model is a fine-tuned version of [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large)
+on the [ymoslem/wmt-da-human-evaluation](https://huggingface.co/ymoslem/wmt-da-human-evaluation) dataset.
 It achieves the following results on the evaluation set:
+- Loss: 0.0564
 ## Model description
+This model is for reference-free quality estimation (QE) of machine translation (MT) systems.
 ## Training procedure
 ### Training hyperparameters
+This version of the model uses the full maximum length of the tokenizer, which is 8192.
+The model with 512 maximum length can be found here [ymoslem/ModernBERT-large-qe-maxlen512-v1](https://huggingface.co/ymoslem/ModernBERT-large-qe-maxlen512-v1)
 The following hyperparameters were used during training:
 - learning_rate: 8e-05
 - train_batch_size: 128
 | Training Loss | Epoch  | Step  | Validation Loss |
 |:-------------:|:------:|:-----:|:---------------:|
+| 0.0631        | 0.1004 | 1000  | 0.0674          |
+| 0.0614        | 0.2007 | 2000  | 0.0599          |
+| 0.0578        | 0.3011 | 3000  | 0.0585          |
+| 0.0585        | 0.4015 | 4000  | 0.0579          |
+| 0.0568        | 0.5019 | 5000  | 0.0570          |
+| 0.057         | 0.6022 | 6000  | 0.0568          |
+| 0.0579        | 0.7026 | 7000  | 0.0567          |
+| 0.0573        | 0.8030 | 8000  | 0.0565          |
+| 0.0568        | 0.9033 | 9000  | 0.0564          |
+| 0.0571        | 1.0037 | 10000 | 0.0564          |
 ### Framework versions
 - Pytorch 2.4.1+cu124
 - Datasets 3.2.0
 - Tokenizers 0.21.0
+## Inference
+1. Install the required libraries.
+```bash
+pip3 install --upgrade datasets accelerate transformers
+pip3 install --upgrade flash_attn triton
+```
+2. Load the test dataset.
+```python
+from datasets import load_dataset
+test_dataset = load_dataset("ymoslem/wmt-da-human-evaluation",
+                             split="test",
+                             trust_remote_code=True
+                            )
+print(test_dataset)
+```
+3. Load the model and tokenizer:
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+import torch
+# Load the fine-tuned model and tokenizer
+model_name = "ymoslem/ModernBERT-large-qe-v1"
+model = AutoModelForSequenceClassification.from_pretrained(
+    model_name,
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+# Move model to GPU if available
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model.to(device)
+model.eval()
+```
+4. Prepare the dataset. Each source segment `src` and target segment `tgt` are separated by the `sep_token`, which is `'</s>'` for ModernBERT.
+```python
+sep_token = tokenizer.sep_token
+input_test_texts = [f"{src} {sep_token} {tgt}" for src, tgt in zip(test_dataset["src"], test_dataset["mt"])]
+```
+5. Generate predictions.
+If you print `model.config.problem_type`, the output is `regression`.
+Still, you can use the "text-classification" pipeline as follows (cf. [pipeline documentation](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.TextClassificationPipeline)):
+```python
+from transformers import pipeline
+classifier = pipeline("text-classification",
+                      model=model_name,
+                      tokenizer=tokenizer,
+                      device=0,
+                     )
+predictions = classifier(input_test_texts,
+                         batch_size=128,
+                         truncation=True,
+                         padding="max_length",
+                         max_length=tokenizer.model_max_length,
+                       )
+predictions = [prediction["score"] for prediction in predictions]
+```
+Alternatively, you can use an elaborate version of the code, which is slightly faster and provides more control.
+```python
+from torch.utils.data import DataLoader
+import torch
+from tqdm.auto import tqdm
+# Tokenization function
+def process_batch(batch, tokenizer, device):
+    sep_token = tokenizer.sep_token
+    input_texts = [f"{src} {sep_token} {tgt}" for src, tgt in zip(batch["src"], batch["mt"])]
+    tokens = tokenizer(input_texts,
+                       truncation=True,
+                       padding="max_length",
+                       max_length=tokenizer.model_max_length,
+                       return_tensors="pt",
+                      ).to(device)
+    return tokens
+# Create a DataLoader for batching
+test_dataloader = DataLoader(test_dataset,
+                             batch_size=128,   # Adjust batch size as needed
+                             shuffle=False)
+# List to store all predictions
+predictions = []
+with torch.no_grad():
+    for batch in tqdm(test_dataloader, desc="Inference Progress", unit="batch"):
+        tokens = process_batch(batch, tokenizer, device)
+        # Forward pass: Generate model's logits
+        outputs = model(**tokens)
+        # Get logits (predictions)
+        logits = outputs.logits
+        # Extract the regression predicted values
+        batch_predictions = logits.squeeze()
+        # Extend the list with the predictions
+        predictions.extend(batch_predictions.tolist())
+```