Spaces:

tugrulkaya
/

advanced-dataset-tutorial

Sleeping

File size: 2,663 Bytes

2e6a47d

# İleri Teknikler Örnekleri

Bu klasör, advanced dataset processing teknikleri içerir.

## Teknikler

### 📦 Custom Data Collators

#### 1. Simple Collator
```python
class SimpleCollator:
    def __call__(self, batch):
        texts = [ex['text'] for ex in batch]
        labels = [ex['label'] for ex in batch]
        return {'texts': texts, 'labels': labels}
```

#### 2. Padding Collator
```python
class PaddingCollator:
    def __call__(self, batch):
        # Dynamic padding
        max_len = max(len(ex['text']) for ex in batch)
        # Pad to max_len...
```

#### 3. Advanced Collator
```python
class AdvancedCollator:
    def __call__(self, batch):
        # Padding + normalization + stats
        return {
            'input_ids': padded,
            'attention_mask': masks,
            'labels': labels,
            'batch_stats': {...}
        }
```

### 🔧 Feature Engineering
- 10+ feature extraction
- Normalization (min-max, z-score)
- Interaction features
- Domain-specific features

### 🎲 Data Augmentation
- Word deletion (random)
- Word swap
- Synonym replacement
- Class balancing (3x veri artışı)

### 📊 Advanced Sampling

#### Stratified Sampling
```python
# Balanced train/test splits
train, test = stratified_split(
    dataset, 
    stratify_column='label',
    train_ratio=0.8
)
```

#### Diversity Sampling
```python
# Maximum diversity
diverse = max_diversity_sampling(
    dataset,
    n_samples=100,
    feature_columns=['length', 'score']
)
```

#### Active Learning
```python
# Uncertainty-based
uncertain = uncertainty_sampling(
    dataset,
    uncertainty_scores,
    n_samples=100
)
```

### 📦 Dynamic Batching

#### Length-Based
```python
# Benzer uzunlukları grupla
batches = length_based_batching(
    dataset,
    length_column='length'
)
# Result: 40% padding azalması
```

#### Bucket Batching
```python
# Bucket'lara ayır
batches = bucket_batching(
    dataset,
    n_buckets=5
)
```

## Pipeline Pattern

```python
pipeline = DataPipeline("My Pipeline")
pipeline.add_step("clean", clean_fn)
pipeline.add_step("features", extract_features)
pipeline.add_step("normalize", normalize_fn)

result = pipeline.run(dataset)
```

## Performans

| Teknik | Artış | Use Case |
|--------|-------|----------|
| Batch Processing | 2.3x | Tüm işlemler |
| Dynamic Batching | 40% | Padding azalması |
| Data Augmentation | 3x | Veri artışı |
| Stratified Sampling | - | Balanced splits |

## Best Practices

✅ Collator'ı modele göre özelleştir  
✅ Pipeline pattern kullan  
✅ Augmentation ile balance et  
✅ Stratified sampling ile generalize et  
✅ Dynamic batching ile optimize et