File size: 3,391 Bytes
2e6a47d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
# Özel Görevler İçin Datasets
Bu klasör, specific NLP task'leri için dataset örnekleri içerir.
## Task'ler
### ❓ Question Answering
#### Extractive QA (SQuAD-style)
```python
{
'context': 'Paris is the capital of France...',
'question': 'What is the capital of France?',
'answers': {
'text': ['Paris'],
'answer_start': [0]
}
}
```
#### Multiple Choice QA
```python
{
'question': 'What is 2+2?',
'choices': ['3', '4', '5', '6'],
'answer': 1 # Index of correct answer
}
```
**Best Practices:**
- Validate answer spans
- Handle impossible questions
- Question type classification
- Context length management
### 📝 Summarization
#### News Summarization
```python
{
'article': 'Long news article...',
'summary': 'Brief summary...',
'compression_ratio': 0.24
}
```
**Metrics:**
- ROUGE scores
- Compression ratio (20-30% optimal)
- Abstractive vs Extractive
**Best Practices:**
- Multiple reference summaries
- Length constraints
- Quality validation
### 🏷️ Named Entity Recognition
#### BIO Tagging
```python
{
'tokens': ['John', 'Smith', 'works', 'at', 'Google'],
'ner_tags': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG']
}
```
**Tag Schema:**
- B-PER, I-PER (Person)
- B-ORG, I-ORG (Organization)
- B-LOC, I-LOC (Location)
- O (Outside)
**Best Practices:**
- Consistent tagging scheme
- Entity type taxonomy
- Nested entities handling
- Entity linking (optional)
### 😊 Sentiment Analysis
#### Binary/Multi-class
```python
{
'text': 'This product is amazing!',
'label': 2, # 0: neg, 1: neutral, 2: pos
'confidence': 0.95
}
```
#### Aspect-Based
```python
{
'text': 'Great product but slow delivery',
'aspect_sentiments': {
'product': 'positive',
'delivery': 'negative'
}
}
```
**Best Practices:**
- Multi-level granularity
- Confidence scores
- Domain-specific lexicons
- Emotion detection
### 📊 Text Classification
#### Topic Classification
```python
{
'text': 'Article text...',
'label': 'technology',
'label_id': 0
}
```
**Best Practices:**
- Balanced classes
- Hierarchical categories
- Multi-label support
- Class imbalance handling
### 🎯 Multi-Task Learning
#### Unified Format
```python
{
'text': 'Sample text...',
'sentiment': 'positive',
'topic': 'technology',
'quality_score': 0.85
}
```
**Best Practices:**
- Consistent preprocessing
- Task-specific heads
- Shared representations
- Task weighting
## Dataset Statistics
| Task | Örnekler | Format |
|------|----------|--------|
| QA | 300 | Extractive + MC |
| Summarization | 100 | News articles |
| NER | 100 | BIO tagged |
| Sentiment | 350 | Multi-class + Aspect |
| Classification | 200 | Topic |
| Multi-Task | 100 | Unified |
## Quality Metrics
### QA
- Exact Match (EM)
- F1 Score
- Answer span accuracy
### Summarization
- ROUGE-1, ROUGE-2, ROUGE-L
- Compression ratio
- Factual consistency
### NER
- Precision, Recall, F1 per entity type
- Exact match
- Partial match
### Sentiment
- Accuracy
- Macro/Micro F1
- Confusion matrix
### Classification
- Accuracy
- Per-class F1
- Macro/Weighted F1
## Best Practices (Genel)
✅ Clear annotation guidelines
✅ Inter-annotator agreement
✅ Quality control checks
✅ Regular dataset updates
✅ Version control
✅ Documentation
✅ Ethical considerations
✅ Bias analysis
|