AHAAM
/

B2BERT

Text Classification

Dialect Identification

text-embeddings-inference

Model card Files Files and versions

Ali Mekky commited on Feb 12, 2025

Commit

4e4c436

·

verified ·

1 Parent(s): fdf0710

Update README.md

Files changed (1) hide show

README.md +62 -1

README.md CHANGED Viewed

@@ -46,7 +46,7 @@ Users should be aware of biases in dataset annotation and carefully validate out
 - **Testing Data:** NADI 2024 Test set
 - **Metrics:** Macro F1-score, precision, recall
-- **Link to NADI2024 Leaderboard** https://huggingface.co/spaces/AMR-KELEG/NADI2024-leaderboard
@@ -61,4 +61,65 @@ Users should be aware of biases in dataset annotation and carefully validate out
 - **Hardware:** NVIDIA RTX 6000 (24GB VRAM)
 - **Software:** Python, PyTorch, Hugging Face Transformers

 - **Testing Data:** NADI 2024 Test set
 - **Metrics:** Macro F1-score, precision, recall
+- **Link to NADI2024 Leaderboard** https://huggingface.co/spaces/AMR-KELEG/MLADI
 - **Hardware:** NVIDIA RTX 6000 (24GB VRAM)
 - **Software:** Python, PyTorch, Hugging Face Transformers
+## Using the Model
+```
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+# Load the model and tokenizer
+model_name = "AliMekky/MDABERT"
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+# Define dialects
+DIALECTS = [
+    "Algeria", "Bahrain", "Egypt", "Iraq", "Jordan", "Kuwait", "Lebanon", "Libya",
+    "Morocco", "Oman", "Palestine", "Qatar", "Saudi_Arabia", "Sudan", "Syria",
+    "Tunisia", "UAE", "Yemen"
+]
+def predict_binary_outcomes(model, tokenizer, texts, threshold=0.3):
+    """Predict the validity in each dialect by applying a sigmoid activation to each dialect's logit.
+    Dialects with probabilities (sigmoid activations) above the threshold (default 0.3) are predicted as valid.
+    The model generates logits for each dialect in the following order:
+    Algeria, Bahrain, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Morocco, Oman, Palestine, Qatar,
+    Saudi_Arabia, Sudan, Syria, Tunisia, UAE, Yemen.
+    """
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    model.to(device)
+    encodings = tokenizer(
+        texts, truncation=True, padding=True, max_length=128, return_tensors="pt"
+    )
+    input_ids = encodings["input_ids"].to(device)
+    attention_mask = encodings["attention_mask"].to(device)
+    with torch.no_grad():
+        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
+        logits = outputs.logits
+    probabilities = torch.sigmoid(logits).cpu().numpy().reshape(-1)
+    binary_predictions = (probabilities >= threshold).astype(int)
+    # Map indices to actual labels
+    predicted_dialects = [
+        dialect
+        for dialect, dialect_prediction in zip(DIALECTS, binary_predictions)
+        if dialect_prediction == 1
+    ]
+    return predicted_dialects
+text = "كيف حالك؟"
+## Use threshold 0.3 for better results.
+predicted_dialects = predict_binary_outcomes(model, tokenizer, [text])
+print(f"Predicted Dialects: {predicted_dialects}")
+```