| # İleri Teknikler Örnekleri | |
| Bu klasör, advanced dataset processing teknikleri içerir. | |
| ## Teknikler | |
| ### 📦 Custom Data Collators | |
| #### 1. Simple Collator | |
| ```python | |
| class SimpleCollator: | |
| def __call__(self, batch): | |
| texts = [ex['text'] for ex in batch] | |
| labels = [ex['label'] for ex in batch] | |
| return {'texts': texts, 'labels': labels} | |
| ``` | |
| #### 2. Padding Collator | |
| ```python | |
| class PaddingCollator: | |
| def __call__(self, batch): | |
| # Dynamic padding | |
| max_len = max(len(ex['text']) for ex in batch) | |
| # Pad to max_len... | |
| ``` | |
| #### 3. Advanced Collator | |
| ```python | |
| class AdvancedCollator: | |
| def __call__(self, batch): | |
| # Padding + normalization + stats | |
| return { | |
| 'input_ids': padded, | |
| 'attention_mask': masks, | |
| 'labels': labels, | |
| 'batch_stats': {...} | |
| } | |
| ``` | |
| ### 🔧 Feature Engineering | |
| - 10+ feature extraction | |
| - Normalization (min-max, z-score) | |
| - Interaction features | |
| - Domain-specific features | |
| ### 🎲 Data Augmentation | |
| - Word deletion (random) | |
| - Word swap | |
| - Synonym replacement | |
| - Class balancing (3x veri artışı) | |
| ### 📊 Advanced Sampling | |
| #### Stratified Sampling | |
| ```python | |
| # Balanced train/test splits | |
| train, test = stratified_split( | |
| dataset, | |
| stratify_column='label', | |
| train_ratio=0.8 | |
| ) | |
| ``` | |
| #### Diversity Sampling | |
| ```python | |
| # Maximum diversity | |
| diverse = max_diversity_sampling( | |
| dataset, | |
| n_samples=100, | |
| feature_columns=['length', 'score'] | |
| ) | |
| ``` | |
| #### Active Learning | |
| ```python | |
| # Uncertainty-based | |
| uncertain = uncertainty_sampling( | |
| dataset, | |
| uncertainty_scores, | |
| n_samples=100 | |
| ) | |
| ``` | |
| ### 📦 Dynamic Batching | |
| #### Length-Based | |
| ```python | |
| # Benzer uzunlukları grupla | |
| batches = length_based_batching( | |
| dataset, | |
| length_column='length' | |
| ) | |
| # Result: 40% padding azalması | |
| ``` | |
| #### Bucket Batching | |
| ```python | |
| # Bucket'lara ayır | |
| batches = bucket_batching( | |
| dataset, | |
| n_buckets=5 | |
| ) | |
| ``` | |
| ## Pipeline Pattern | |
| ```python | |
| pipeline = DataPipeline("My Pipeline") | |
| pipeline.add_step("clean", clean_fn) | |
| pipeline.add_step("features", extract_features) | |
| pipeline.add_step("normalize", normalize_fn) | |
| result = pipeline.run(dataset) | |
| ``` | |
| ## Performans | |
| | Teknik | Artış | Use Case | | |
| |--------|-------|----------| | |
| | Batch Processing | 2.3x | Tüm işlemler | | |
| | Dynamic Batching | 40% | Padding azalması | | |
| | Data Augmentation | 3x | Veri artışı | | |
| | Stratified Sampling | - | Balanced splits | | |
| ## Best Practices | |
| ✅ Collator'ı modele göre özelleştir | |
| ✅ Pipeline pattern kullan | |
| ✅ Augmentation ile balance et | |
| ✅ Stratified sampling ile generalize et | |
| ✅ Dynamic batching ile optimize et | |