|
|
--- |
|
|
language: fa |
|
|
license: mit |
|
|
tags: |
|
|
- keyword-extraction |
|
|
- persian |
|
|
- farsi |
|
|
- token-classification |
|
|
- xlm-roberta |
|
|
- nlp |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- precision |
|
|
- recall |
|
|
- f1 |
|
|
widget: |
|
|
- text: "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبههای گردشگری فراوان میباشد." |
|
|
--- |
|
|
|
|
|
# Model Datacard: Persian Keyword Extraction Model |
|
|
|
|
|
## Model Details |
|
|
- **Model Name**: keyword_Roberta_base_per |
|
|
- **Base Model**: xlm-roberta-large |
|
|
- **Task**: Keyword Extraction |
|
|
- **Language**: Persian (Farsi) |
|
|
- **Developer**: PakdamanAli |
|
|
- **Model Version**: 1.0.0 |
|
|
|
|
|
## Intended Use |
|
|
This model is designed to extract keywords from Persian text. It can be used for: |
|
|
- Automatic tagging of content |
|
|
- Search engine optimization |
|
|
- Content categorization |
|
|
- Topic modeling |
|
|
- Information retrieval enhancement |
|
|
|
|
|
### Primary Intended Uses |
|
|
- Content analysis for Persian websites |
|
|
- Academic research on Persian text |
|
|
- Information extraction systems |
|
|
|
|
|
### Out-of-Scope Use Cases |
|
|
- Translation services |
|
|
- Text summarization |
|
|
- Persian named entity recognition (unless specifically trained for this) |
|
|
- Other NLP tasks beyond keyword extraction |
|
|
|
|
|
## Training Data |
|
|
- **Dataset Size**: 40,000 Persian text samples |
|
|
- **Data Preparation**: Fine-tuned on xlm-roberta-large |
|
|
|
|
|
## Performance Evaluation |
|
|
Metrics and evaluation results will be published in a future update. |
|
|
|
|
|
## Limitations |
|
|
- The model may not perform well on domain-specific content that was not represented in the training data |
|
|
- Performance may vary for very short or extremely long texts |
|
|
- The model may occasionally extract words that are not truly "key" to the content |
|
|
- Dialect variations in Persian might affect extraction quality |
|
|
|
|
|
## Ethical Considerations |
|
|
- The model is trained on Persian text and may reflect biases present in that content |
|
|
- Users should verify extracted keywords for sensitive content before implementing in automated systems |
|
|
- The model should not be used to extract or analyze personally identifiable information without proper consent |
|
|
|
|
|
## Technical Specifications |
|
|
- **Input**: Persian text (UTF-8 encoded) |
|
|
- **Output**: List of extracted keywords |
|
|
- **Framework**: Transformers (Hugging Face) |
|
|
- **Requirements**: PyTorch, Transformers |
|
|
|
|
|
## Pipeline Usage |
|
|
To use this model with the Hugging Face pipeline: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Initialize the pipeline |
|
|
keyword_extractor = pipeline( |
|
|
task="token-classification", |
|
|
model="PakdamanAli/keyword_Roberta_base_per", |
|
|
tokenizer="PakdamanAli/keyword_Roberta_base_per" |
|
|
) |
|
|
|
|
|
# Example usage |
|
|
text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبههای گردشگری فراوان میباشد." |
|
|
keywords = keyword_extractor(text) |
|
|
|
|
|
# Process the results based on the model output format |
|
|
# Example: extracted_keywords = [item["word"] for item in keywords] |
|
|
``` |
|
|
|
|
|
## Example |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
extractor = pipeline( |
|
|
task="token-classification", |
|
|
model="PakdamanAli/keyword_Roberta_base_per", |
|
|
tokenizer="PakdamanAli/keyword_Roberta_base_per" |
|
|
) |
|
|
|
|
|
text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبههای گردشگری فراوان میباشد." |
|
|
results = extractor(text) |
|
|
|
|
|
# Extract just the words from the results |
|
|
keywords = [item["word"] for item in results] |
|
|
print(keywords) |
|
|
``` |