PakdamanAli's picture
Update README.md
8881562 verified
---
language: fa
license: mit
tags:
- keyword-extraction
- persian
- farsi
- token-classification
- distilbert
- nlp
datasets:
- custom
metrics:
- precision
- recall
- f1
widget:
- text: "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
---
# Model Datacard: Persian Keyword Extraction Model
## Model Details
- **Model Name**: keyword_distilbert_base_per
- **Base Model**: distilbert
- **Task**: Keyword Extraction
- **Language**: Persian (Farsi)
- **Developer**: PakdamanAli
- **Model Version**: 1.0.0
## Intended Use
This model is designed to extract keywords from Persian text. It can be used for:
- Automatic tagging of content
- Search engine optimization
- Content categorization
- Topic modeling
- Information retrieval enhancement
### Primary Intended Uses
- Content analysis for Persian websites
- Academic research on Persian text
- Information extraction systems
### Out-of-Scope Use Cases
- Translation services
- Text summarization
- Persian named entity recognition (unless specifically trained for this)
- Other NLP tasks beyond keyword extraction
## Training Data
- **Dataset Size**: 40,000 Persian text samples
- **Data Preparation**: Fine-tuned on xlm-roberta-large
## Performance Evaluation
Metrics and evaluation results will be published in a future update.
## Limitations
- The model may not perform well on domain-specific content that was not represented in the training data
- Performance may vary for very short or extremely long texts
- The model may occasionally extract words that are not truly "key" to the content
- Dialect variations in Persian might affect extraction quality
## Ethical Considerations
- The model is trained on Persian text and may reflect biases present in that content
- Users should verify extracted keywords for sensitive content before implementing in automated systems
- The model should not be used to extract or analyze personally identifiable information without proper consent
## Technical Specifications
- **Input**: Persian text (UTF-8 encoded)
- **Output**: List of extracted keywords
- **Framework**: Transformers (Hugging Face)
- **Requirements**: PyTorch, Transformers
## Pipeline Usage
To use this model with the Hugging Face pipeline:
```python
from transformers import pipeline
# Initialize the pipeline
keyword_extractor = pipeline(
task="token-classification",
model="PakdamanAli/keyword_distilbert_base_per",
tokenizer="PakdamanAli/keyword_distilbert_base_per"
)
# Example usage
text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
keywords = keyword_extractor(text)
# Process the results based on the model output format
# Example: extracted_keywords = [item["word"] for item in keywords]
```
## Example
```python
from transformers import pipeline
extractor = pipeline(
task="token-classification",
model="PakdamanAli/keyword_distilbert_base_per",
tokenizer="PakdamanAli/keyword_distilbert_base_per"
)
text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
results = extractor(text)
# Extract just the words from the results
keywords = [item["word"] for item in results]
print(keywords)
```