MEHMET TUĞRUL KAYA
Initial commit: Advanced Dataset Tutorial
2e6a47d
# İleri Teknikler Örnekleri
Bu klasör, advanced dataset processing teknikleri içerir.
## Teknikler
### 📦 Custom Data Collators
#### 1. Simple Collator
```python
class SimpleCollator:
def __call__(self, batch):
texts = [ex['text'] for ex in batch]
labels = [ex['label'] for ex in batch]
return {'texts': texts, 'labels': labels}
```
#### 2. Padding Collator
```python
class PaddingCollator:
def __call__(self, batch):
# Dynamic padding
max_len = max(len(ex['text']) for ex in batch)
# Pad to max_len...
```
#### 3. Advanced Collator
```python
class AdvancedCollator:
def __call__(self, batch):
# Padding + normalization + stats
return {
'input_ids': padded,
'attention_mask': masks,
'labels': labels,
'batch_stats': {...}
}
```
### 🔧 Feature Engineering
- 10+ feature extraction
- Normalization (min-max, z-score)
- Interaction features
- Domain-specific features
### 🎲 Data Augmentation
- Word deletion (random)
- Word swap
- Synonym replacement
- Class balancing (3x veri artışı)
### 📊 Advanced Sampling
#### Stratified Sampling
```python
# Balanced train/test splits
train, test = stratified_split(
dataset,
stratify_column='label',
train_ratio=0.8
)
```
#### Diversity Sampling
```python
# Maximum diversity
diverse = max_diversity_sampling(
dataset,
n_samples=100,
feature_columns=['length', 'score']
)
```
#### Active Learning
```python
# Uncertainty-based
uncertain = uncertainty_sampling(
dataset,
uncertainty_scores,
n_samples=100
)
```
### 📦 Dynamic Batching
#### Length-Based
```python
# Benzer uzunlukları grupla
batches = length_based_batching(
dataset,
length_column='length'
)
# Result: 40% padding azalması
```
#### Bucket Batching
```python
# Bucket'lara ayır
batches = bucket_batching(
dataset,
n_buckets=5
)
```
## Pipeline Pattern
```python
pipeline = DataPipeline("My Pipeline")
pipeline.add_step("clean", clean_fn)
pipeline.add_step("features", extract_features)
pipeline.add_step("normalize", normalize_fn)
result = pipeline.run(dataset)
```
## Performans
| Teknik | Artış | Use Case |
|--------|-------|----------|
| Batch Processing | 2.3x | Tüm işlemler |
| Dynamic Batching | 40% | Padding azalması |
| Data Augmentation | 3x | Veri artışı |
| Stratified Sampling | - | Balanced splits |
## Best Practices
✅ Collator'ı modele göre özelleştir
✅ Pipeline pattern kullan
✅ Augmentation ile balance et
✅ Stratified sampling ile generalize et
✅ Dynamic batching ile optimize et