|
|
--- |
|
|
library_name: transformers |
|
|
tags: [] |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** Mohammad Essam ([metga97](https://huggingface.co/metga97)) |
|
|
- **Model type:** BERT-style encoder |
|
|
- **Language(s):** Arabic (MSA + Egyptian dialect) |
|
|
- **License:** MIT |
|
|
- **Finetuned from model:** [`metga97/Modern-EgyBert-Base`](https://huggingface.co/metga97/Modern-EgyBert-Base) |
|
|
|
|
|
|
|
|
## Uses |
|
|
|
|
|
This model is intended to be used for generating sentence embeddings for downstream tasks: |
|
|
|
|
|
- Sentence similarity |
|
|
- Semantic retrieval |
|
|
- Clustering of Arabic sentences |
|
|
- Intent classification |
|
|
- Duplicate detection |
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("metga97/Modern-EgyBert-Embedding") |
|
|
model = AutoModel.from_pretrained("metga97/Modern-EgyBert-Embedding") |
|
|
|
|
|
text = ["الجو النهارده جميل"] |
|
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
last_hidden = outputs.last_hidden_state |
|
|
|
|
|
# Mean Pooling |
|
|
attention_mask = inputs["attention_mask"] |
|
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden.size()).float() |
|
|
sum_embeddings = torch.sum(last_hidden * input_mask_expanded, dim=1) |
|
|
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
sentence_embedding = sum_embeddings / sum_mask |
|
|
|
|
|
print(sentence_embedding.shape) # torch.Size([1, 768]) |
|
|
``` |
|
|
|
|
|
|
|
|
|