File size: 2,663 Bytes
2e6a47d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
# İleri Teknikler Örnekleri
Bu klasör, advanced dataset processing teknikleri içerir.
## Teknikler
### 📦 Custom Data Collators
#### 1. Simple Collator
```python
class SimpleCollator:
def __call__(self, batch):
texts = [ex['text'] for ex in batch]
labels = [ex['label'] for ex in batch]
return {'texts': texts, 'labels': labels}
```
#### 2. Padding Collator
```python
class PaddingCollator:
def __call__(self, batch):
# Dynamic padding
max_len = max(len(ex['text']) for ex in batch)
# Pad to max_len...
```
#### 3. Advanced Collator
```python
class AdvancedCollator:
def __call__(self, batch):
# Padding + normalization + stats
return {
'input_ids': padded,
'attention_mask': masks,
'labels': labels,
'batch_stats': {...}
}
```
### 🔧 Feature Engineering
- 10+ feature extraction
- Normalization (min-max, z-score)
- Interaction features
- Domain-specific features
### 🎲 Data Augmentation
- Word deletion (random)
- Word swap
- Synonym replacement
- Class balancing (3x veri artışı)
### 📊 Advanced Sampling
#### Stratified Sampling
```python
# Balanced train/test splits
train, test = stratified_split(
dataset,
stratify_column='label',
train_ratio=0.8
)
```
#### Diversity Sampling
```python
# Maximum diversity
diverse = max_diversity_sampling(
dataset,
n_samples=100,
feature_columns=['length', 'score']
)
```
#### Active Learning
```python
# Uncertainty-based
uncertain = uncertainty_sampling(
dataset,
uncertainty_scores,
n_samples=100
)
```
### 📦 Dynamic Batching
#### Length-Based
```python
# Benzer uzunlukları grupla
batches = length_based_batching(
dataset,
length_column='length'
)
# Result: 40% padding azalması
```
#### Bucket Batching
```python
# Bucket'lara ayır
batches = bucket_batching(
dataset,
n_buckets=5
)
```
## Pipeline Pattern
```python
pipeline = DataPipeline("My Pipeline")
pipeline.add_step("clean", clean_fn)
pipeline.add_step("features", extract_features)
pipeline.add_step("normalize", normalize_fn)
result = pipeline.run(dataset)
```
## Performans
| Teknik | Artış | Use Case |
|--------|-------|----------|
| Batch Processing | 2.3x | Tüm işlemler |
| Dynamic Batching | 40% | Padding azalması |
| Data Augmentation | 3x | Veri artışı |
| Stratified Sampling | - | Balanced splits |
## Best Practices
✅ Collator'ı modele göre özelleştir
✅ Pipeline pattern kullan
✅ Augmentation ile balance et
✅ Stratified sampling ile generalize et
✅ Dynamic batching ile optimize et
|