PakdamanAli
/

keyword_distilbert_base_per

Token Classification

keyword-extraction

Model card Files Files and versions

keyword_distilbert_base_per / README.md

PakdamanAli's picture

Update README.md

8881562 verified 11 months ago

|

history blame contribute delete

3.43 kB

	---
	language: fa
	license: mit
	tags:
	- keyword-extraction
	- persian
	- farsi
	- token-classification
	- distilbert
	- nlp
	datasets:
	- custom
	metrics:
	- precision
	- recall
	- f1
	widget:
	- text: "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
	---

	# Model Datacard: Persian Keyword Extraction Model

	## Model Details
	- Model Name: keyword_distilbert_base_per
	- Base Model: distilbert
	- Task: Keyword Extraction
	- Language: Persian (Farsi)
	- Developer: PakdamanAli
	- Model Version: 1.0.0

	## Intended Use
	This model is designed to extract keywords from Persian text. It can be used for:
	- Automatic tagging of content
	- Search engine optimization
	- Content categorization
	- Topic modeling
	- Information retrieval enhancement

	### Primary Intended Uses
	- Content analysis for Persian websites
	- Academic research on Persian text
	- Information extraction systems

	### Out-of-Scope Use Cases
	- Translation services
	- Text summarization
	- Persian named entity recognition (unless specifically trained for this)
	- Other NLP tasks beyond keyword extraction

	## Training Data
	- Dataset Size: 40,000 Persian text samples
	- Data Preparation: Fine-tuned on xlm-roberta-large

	## Performance Evaluation
	Metrics and evaluation results will be published in a future update.

	## Limitations
	- The model may not perform well on domain-specific content that was not represented in the training data
	- Performance may vary for very short or extremely long texts
	- The model may occasionally extract words that are not truly "key" to the content
	- Dialect variations in Persian might affect extraction quality

	## Ethical Considerations
	- The model is trained on Persian text and may reflect biases present in that content
	- Users should verify extracted keywords for sensitive content before implementing in automated systems
	- The model should not be used to extract or analyze personally identifiable information without proper consent

	## Technical Specifications
	- Input: Persian text (UTF-8 encoded)
	- Output: List of extracted keywords
	- Framework: Transformers (Hugging Face)
	- Requirements: PyTorch, Transformers

	## Pipeline Usage
	To use this model with the Hugging Face pipeline:

	```python
	from transformers import pipeline

	# Initialize the pipeline
	keyword_extractor = pipeline(
	task="token-classification",
	model="PakdamanAli/keyword_distilbert_base_per",
	tokenizer="PakdamanAli/keyword_distilbert_base_per"
	)

	# Example usage
	text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
	keywords = keyword_extractor(text)

	# Process the results based on the model output format
	# Example: extracted_keywords = [item["word"] for item in keywords]
	```

	## Example
	```python
	from transformers import pipeline

	extractor = pipeline(
	task="token-classification",
	model="PakdamanAli/keyword_distilbert_base_per",
	tokenizer="PakdamanAli/keyword_distilbert_base_per"
	)

	text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
	results = extractor(text)

	# Extract just the words from the results
	keywords = [item["word"] for item in results]
	print(keywords)
	```