MEHMET TUĞRUL KAYA
Initial commit: Advanced Dataset Tutorial
2e6a47d

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

İleri Teknikler Örnekleri

Bu klasör, advanced dataset processing teknikleri içerir.

Teknikler

📦 Custom Data Collators

1. Simple Collator

class SimpleCollator:
    def __call__(self, batch):
        texts = [ex['text'] for ex in batch]
        labels = [ex['label'] for ex in batch]
        return {'texts': texts, 'labels': labels}

2. Padding Collator

class PaddingCollator:
    def __call__(self, batch):
        # Dynamic padding
        max_len = max(len(ex['text']) for ex in batch)
        # Pad to max_len...

3. Advanced Collator

class AdvancedCollator:
    def __call__(self, batch):
        # Padding + normalization + stats
        return {
            'input_ids': padded,
            'attention_mask': masks,
            'labels': labels,
            'batch_stats': {...}
        }

🔧 Feature Engineering

  • 10+ feature extraction
  • Normalization (min-max, z-score)
  • Interaction features
  • Domain-specific features

🎲 Data Augmentation

  • Word deletion (random)
  • Word swap
  • Synonym replacement
  • Class balancing (3x veri artışı)

📊 Advanced Sampling

Stratified Sampling

# Balanced train/test splits
train, test = stratified_split(
    dataset, 
    stratify_column='label',
    train_ratio=0.8
)

Diversity Sampling

# Maximum diversity
diverse = max_diversity_sampling(
    dataset,
    n_samples=100,
    feature_columns=['length', 'score']
)

Active Learning

# Uncertainty-based
uncertain = uncertainty_sampling(
    dataset,
    uncertainty_scores,
    n_samples=100
)

📦 Dynamic Batching

Length-Based

# Benzer uzunlukları grupla
batches = length_based_batching(
    dataset,
    length_column='length'
)
# Result: 40% padding azalması

Bucket Batching

# Bucket'lara ayır
batches = bucket_batching(
    dataset,
    n_buckets=5
)

Pipeline Pattern

pipeline = DataPipeline("My Pipeline")
pipeline.add_step("clean", clean_fn)
pipeline.add_step("features", extract_features)
pipeline.add_step("normalize", normalize_fn)

result = pipeline.run(dataset)

Performans

Teknik Artış Use Case
Batch Processing 2.3x Tüm işlemler
Dynamic Batching 40% Padding azalması
Data Augmentation 3x Veri artışı
Stratified Sampling - Balanced splits

Best Practices

✅ Collator'ı modele göre özelleştir
✅ Pipeline pattern kullan
✅ Augmentation ile balance et
✅ Stratified sampling ile generalize et
✅ Dynamic batching ile optimize et