MEHMET TUĞRUL KAYA
Initial commit: Advanced Dataset Tutorial
2e6a47d

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

Özel Görevler İçin Datasets

Bu klasör, specific NLP task'leri için dataset örnekleri içerir.

Task'ler

❓ Question Answering

Extractive QA (SQuAD-style)

{
    'context': 'Paris is the capital of France...',
    'question': 'What is the capital of France?',
    'answers': {
        'text': ['Paris'],
        'answer_start': [0]
    }
}

Multiple Choice QA

{
    'question': 'What is 2+2?',
    'choices': ['3', '4', '5', '6'],
    'answer': 1  # Index of correct answer
}

Best Practices:

  • Validate answer spans
  • Handle impossible questions
  • Question type classification
  • Context length management

📝 Summarization

News Summarization

{
    'article': 'Long news article...',
    'summary': 'Brief summary...',
    'compression_ratio': 0.24
}

Metrics:

  • ROUGE scores
  • Compression ratio (20-30% optimal)
  • Abstractive vs Extractive

Best Practices:

  • Multiple reference summaries
  • Length constraints
  • Quality validation

🏷️ Named Entity Recognition

BIO Tagging

{
    'tokens': ['John', 'Smith', 'works', 'at', 'Google'],
    'ner_tags': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG']
}

Tag Schema:

  • B-PER, I-PER (Person)
  • B-ORG, I-ORG (Organization)
  • B-LOC, I-LOC (Location)
  • O (Outside)

Best Practices:

  • Consistent tagging scheme
  • Entity type taxonomy
  • Nested entities handling
  • Entity linking (optional)

😊 Sentiment Analysis

Binary/Multi-class

{
    'text': 'This product is amazing!',
    'label': 2,  # 0: neg, 1: neutral, 2: pos
    'confidence': 0.95
}

Aspect-Based

{
    'text': 'Great product but slow delivery',
    'aspect_sentiments': {
        'product': 'positive',
        'delivery': 'negative'
    }
}

Best Practices:

  • Multi-level granularity
  • Confidence scores
  • Domain-specific lexicons
  • Emotion detection

📊 Text Classification

Topic Classification

{
    'text': 'Article text...',
    'label': 'technology',
    'label_id': 0
}

Best Practices:

  • Balanced classes
  • Hierarchical categories
  • Multi-label support
  • Class imbalance handling

🎯 Multi-Task Learning

Unified Format

{
    'text': 'Sample text...',
    'sentiment': 'positive',
    'topic': 'technology',
    'quality_score': 0.85
}

Best Practices:

  • Consistent preprocessing
  • Task-specific heads
  • Shared representations
  • Task weighting

Dataset Statistics

Task Örnekler Format
QA 300 Extractive + MC
Summarization 100 News articles
NER 100 BIO tagged
Sentiment 350 Multi-class + Aspect
Classification 200 Topic
Multi-Task 100 Unified

Quality Metrics

QA

  • Exact Match (EM)
  • F1 Score
  • Answer span accuracy

Summarization

  • ROUGE-1, ROUGE-2, ROUGE-L
  • Compression ratio
  • Factual consistency

NER

  • Precision, Recall, F1 per entity type
  • Exact match
  • Partial match

Sentiment

  • Accuracy
  • Macro/Micro F1
  • Confusion matrix

Classification

  • Accuracy
  • Per-class F1
  • Macro/Weighted F1

Best Practices (Genel)

✅ Clear annotation guidelines
✅ Inter-annotator agreement
✅ Quality control checks
✅ Regular dataset updates
✅ Version control
✅ Documentation
✅ Ethical considerations
✅ Bias analysis