Spaces:

tugrulkaya
/

advanced-dataset-tutorial

Sleeping

File size: 3,391 Bytes

2e6a47d

# Özel Görevler İçin Datasets

Bu klasör, specific NLP task'leri için dataset örnekleri içerir.

## Task'ler

### ❓ Question Answering

#### Extractive QA (SQuAD-style)
```python
{
    'context': 'Paris is the capital of France...',
    'question': 'What is the capital of France?',
    'answers': {
        'text': ['Paris'],
        'answer_start': [0]
    }
}
```

#### Multiple Choice QA
```python
{
    'question': 'What is 2+2?',
    'choices': ['3', '4', '5', '6'],
    'answer': 1  # Index of correct answer
}
```

**Best Practices:**
- Validate answer spans
- Handle impossible questions
- Question type classification
- Context length management

### 📝 Summarization

#### News Summarization
```python
{
    'article': 'Long news article...',
    'summary': 'Brief summary...',
    'compression_ratio': 0.24
}
```

**Metrics:**
- ROUGE scores
- Compression ratio (20-30% optimal)
- Abstractive vs Extractive

**Best Practices:**
- Multiple reference summaries
- Length constraints
- Quality validation

### 🏷️ Named Entity Recognition

#### BIO Tagging
```python
{
    'tokens': ['John', 'Smith', 'works', 'at', 'Google'],
    'ner_tags': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG']
}
```

**Tag Schema:**
- B-PER, I-PER (Person)
- B-ORG, I-ORG (Organization)
- B-LOC, I-LOC (Location)
- O (Outside)

**Best Practices:**
- Consistent tagging scheme
- Entity type taxonomy
- Nested entities handling
- Entity linking (optional)

### 😊 Sentiment Analysis

#### Binary/Multi-class
```python
{
    'text': 'This product is amazing!',
    'label': 2,  # 0: neg, 1: neutral, 2: pos
    'confidence': 0.95
}
```

#### Aspect-Based
```python
{
    'text': 'Great product but slow delivery',
    'aspect_sentiments': {
        'product': 'positive',
        'delivery': 'negative'
    }
}
```

**Best Practices:**
- Multi-level granularity
- Confidence scores
- Domain-specific lexicons
- Emotion detection

### 📊 Text Classification

#### Topic Classification
```python
{
    'text': 'Article text...',
    'label': 'technology',
    'label_id': 0
}
```

**Best Practices:**
- Balanced classes
- Hierarchical categories
- Multi-label support
- Class imbalance handling

### 🎯 Multi-Task Learning

#### Unified Format
```python
{
    'text': 'Sample text...',
    'sentiment': 'positive',
    'topic': 'technology',
    'quality_score': 0.85
}
```

**Best Practices:**
- Consistent preprocessing
- Task-specific heads
- Shared representations
- Task weighting

## Dataset Statistics

| Task | Örnekler | Format |
|------|----------|--------|
| QA | 300 | Extractive + MC |
| Summarization | 100 | News articles |
| NER | 100 | BIO tagged |
| Sentiment | 350 | Multi-class + Aspect |
| Classification | 200 | Topic |
| Multi-Task | 100 | Unified |

## Quality Metrics

### QA
- Exact Match (EM)
- F1 Score
- Answer span accuracy

### Summarization
- ROUGE-1, ROUGE-2, ROUGE-L
- Compression ratio
- Factual consistency

### NER
- Precision, Recall, F1 per entity type
- Exact match
- Partial match

### Sentiment
- Accuracy
- Macro/Micro F1
- Confusion matrix

### Classification
- Accuracy
- Per-class F1
- Macro/Weighted F1

## Best Practices (Genel)

✅ Clear annotation guidelines  
✅ Inter-annotator agreement  
✅ Quality control checks  
✅ Regular dataset updates  
✅ Version control  
✅ Documentation  
✅ Ethical considerations  
✅ Bias analysis