Spaces:

tugrulkaya
/

advanced-dataset-tutorial

Sleeping

App Files Files Community

advanced-dataset-tutorial / datasets /task_specific_example /README.md

MEHMET TUĞRUL KAYA

Initial commit: Advanced Dataset Tutorial

2e6a47d 3 months ago

preview code

raw

history blame contribute delete

3.39 kB

	# Özel Görevler İçin Datasets

	Bu klasör, specific NLP task'leri için dataset örnekleri içerir.

	## Task'ler

	### ❓ Question Answering

	#### Extractive QA (SQuAD-style)
	```python
	{
	'context': 'Paris is the capital of France...',
	'question': 'What is the capital of France?',
	'answers': {
	'text': ['Paris'],
	'answer_start': [0]
	}
	}
	```

	#### Multiple Choice QA
	```python
	{
	'question': 'What is 2+2?',
	'choices': ['3', '4', '5', '6'],
	'answer': 1 # Index of correct answer
	}
	```

	Best Practices:
	- Validate answer spans
	- Handle impossible questions
	- Question type classification
	- Context length management

	### 📝 Summarization

	#### News Summarization
	```python
	{
	'article': 'Long news article...',
	'summary': 'Brief summary...',
	'compression_ratio': 0.24
	}
	```

	Metrics:
	- ROUGE scores
	- Compression ratio (20-30% optimal)
	- Abstractive vs Extractive

	Best Practices:
	- Multiple reference summaries
	- Length constraints
	- Quality validation

	### 🏷️ Named Entity Recognition

	#### BIO Tagging
	```python
	{
	'tokens': ['John', 'Smith', 'works', 'at', 'Google'],
	'ner_tags': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG']
	}
	```

	Tag Schema:
	- B-PER, I-PER (Person)
	- B-ORG, I-ORG (Organization)
	- B-LOC, I-LOC (Location)
	- O (Outside)

	Best Practices:
	- Consistent tagging scheme
	- Entity type taxonomy
	- Nested entities handling
	- Entity linking (optional)

	### 😊 Sentiment Analysis

	#### Binary/Multi-class
	```python
	{
	'text': 'This product is amazing!',
	'label': 2, # 0: neg, 1: neutral, 2: pos
	'confidence': 0.95
	}
	```

	#### Aspect-Based
	```python
	{
	'text': 'Great product but slow delivery',
	'aspect_sentiments': {
	'product': 'positive',
	'delivery': 'negative'
	}
	}
	```

	Best Practices:
	- Multi-level granularity
	- Confidence scores
	- Domain-specific lexicons
	- Emotion detection

	### 📊 Text Classification

	#### Topic Classification
	```python
	{
	'text': 'Article text...',
	'label': 'technology',
	'label_id': 0
	}
	```

	Best Practices:
	- Balanced classes
	- Hierarchical categories
	- Multi-label support
	- Class imbalance handling

	### 🎯 Multi-Task Learning

	#### Unified Format
	```python
	{
	'text': 'Sample text...',
	'sentiment': 'positive',
	'topic': 'technology',
	'quality_score': 0.85
	}
	```

	Best Practices:
	- Consistent preprocessing
	- Task-specific heads
	- Shared representations
	- Task weighting

	## Dataset Statistics

	\| Task \| Örnekler \| Format \|
	\|------\|----------\|--------\|
	\| QA \| 300 \| Extractive + MC \|
	\| Summarization \| 100 \| News articles \|
	\| NER \| 100 \| BIO tagged \|
	\| Sentiment \| 350 \| Multi-class + Aspect \|
	\| Classification \| 200 \| Topic \|
	\| Multi-Task \| 100 \| Unified \|

	## Quality Metrics

	### QA
	- Exact Match (EM)
	- F1 Score
	- Answer span accuracy

	### Summarization
	- ROUGE-1, ROUGE-2, ROUGE-L
	- Compression ratio
	- Factual consistency

	### NER
	- Precision, Recall, F1 per entity type
	- Exact match
	- Partial match

	### Sentiment
	- Accuracy
	- Macro/Micro F1
	- Confusion matrix

	### Classification
	- Accuracy
	- Per-class F1
	- Macro/Weighted F1

	## Best Practices (Genel)

	✅ Clear annotation guidelines
	✅ Inter-annotator agreement
	✅ Quality control checks
	✅ Regular dataset updates
	✅ Version control
	✅ Documentation
	✅ Ethical considerations
	✅ Bias analysis