File size: 2,663 Bytes
2e6a47d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# İleri Teknikler Örnekleri

Bu klasör, advanced dataset processing teknikleri içerir.

## Teknikler

### 📦 Custom Data Collators

#### 1. Simple Collator
```python
class SimpleCollator:
    def __call__(self, batch):
        texts = [ex['text'] for ex in batch]
        labels = [ex['label'] for ex in batch]
        return {'texts': texts, 'labels': labels}
```

#### 2. Padding Collator
```python
class PaddingCollator:
    def __call__(self, batch):
        # Dynamic padding
        max_len = max(len(ex['text']) for ex in batch)
        # Pad to max_len...
```

#### 3. Advanced Collator
```python
class AdvancedCollator:
    def __call__(self, batch):
        # Padding + normalization + stats
        return {
            'input_ids': padded,
            'attention_mask': masks,
            'labels': labels,
            'batch_stats': {...}
        }
```

### 🔧 Feature Engineering
- 10+ feature extraction
- Normalization (min-max, z-score)
- Interaction features
- Domain-specific features

### 🎲 Data Augmentation
- Word deletion (random)
- Word swap
- Synonym replacement
- Class balancing (3x veri artışı)

### 📊 Advanced Sampling

#### Stratified Sampling
```python
# Balanced train/test splits
train, test = stratified_split(
    dataset, 
    stratify_column='label',
    train_ratio=0.8
)
```

#### Diversity Sampling
```python
# Maximum diversity
diverse = max_diversity_sampling(
    dataset,
    n_samples=100,
    feature_columns=['length', 'score']
)
```

#### Active Learning
```python
# Uncertainty-based
uncertain = uncertainty_sampling(
    dataset,
    uncertainty_scores,
    n_samples=100
)
```

### 📦 Dynamic Batching

#### Length-Based
```python
# Benzer uzunlukları grupla
batches = length_based_batching(
    dataset,
    length_column='length'
)
# Result: 40% padding azalması
```

#### Bucket Batching
```python
# Bucket'lara ayır
batches = bucket_batching(
    dataset,
    n_buckets=5
)
```

## Pipeline Pattern

```python
pipeline = DataPipeline("My Pipeline")
pipeline.add_step("clean", clean_fn)
pipeline.add_step("features", extract_features)
pipeline.add_step("normalize", normalize_fn)

result = pipeline.run(dataset)
```

## Performans

| Teknik | Artış | Use Case |
|--------|-------|----------|
| Batch Processing | 2.3x | Tüm işlemler |
| Dynamic Batching | 40% | Padding azalması |
| Data Augmentation | 3x | Veri artışı |
| Stratified Sampling | - | Balanced splits |

## Best Practices

✅ Collator'ı modele göre özelleştir  
✅ Pipeline pattern kullan  
✅ Augmentation ile balance et  
✅ Stratified sampling ile generalize et  
✅ Dynamic batching ile optimize et