# İleri Teknikler Örnekleri Bu klasör, advanced dataset processing teknikleri içerir. ## Teknikler ### 📦 Custom Data Collators #### 1. Simple Collator ```python class SimpleCollator: def __call__(self, batch): texts = [ex['text'] for ex in batch] labels = [ex['label'] for ex in batch] return {'texts': texts, 'labels': labels} ``` #### 2. Padding Collator ```python class PaddingCollator: def __call__(self, batch): # Dynamic padding max_len = max(len(ex['text']) for ex in batch) # Pad to max_len... ``` #### 3. Advanced Collator ```python class AdvancedCollator: def __call__(self, batch): # Padding + normalization + stats return { 'input_ids': padded, 'attention_mask': masks, 'labels': labels, 'batch_stats': {...} } ``` ### 🔧 Feature Engineering - 10+ feature extraction - Normalization (min-max, z-score) - Interaction features - Domain-specific features ### 🎲 Data Augmentation - Word deletion (random) - Word swap - Synonym replacement - Class balancing (3x veri artışı) ### 📊 Advanced Sampling #### Stratified Sampling ```python # Balanced train/test splits train, test = stratified_split( dataset, stratify_column='label', train_ratio=0.8 ) ``` #### Diversity Sampling ```python # Maximum diversity diverse = max_diversity_sampling( dataset, n_samples=100, feature_columns=['length', 'score'] ) ``` #### Active Learning ```python # Uncertainty-based uncertain = uncertainty_sampling( dataset, uncertainty_scores, n_samples=100 ) ``` ### 📦 Dynamic Batching #### Length-Based ```python # Benzer uzunlukları grupla batches = length_based_batching( dataset, length_column='length' ) # Result: 40% padding azalması ``` #### Bucket Batching ```python # Bucket'lara ayır batches = bucket_batching( dataset, n_buckets=5 ) ``` ## Pipeline Pattern ```python pipeline = DataPipeline("My Pipeline") pipeline.add_step("clean", clean_fn) pipeline.add_step("features", extract_features) pipeline.add_step("normalize", normalize_fn) result = pipeline.run(dataset) ``` ## Performans | Teknik | Artış | Use Case | |--------|-------|----------| | Batch Processing | 2.3x | Tüm işlemler | | Dynamic Batching | 40% | Padding azalması | | Data Augmentation | 3x | Veri artışı | | Stratified Sampling | - | Balanced splits | ## Best Practices ✅ Collator'ı modele göre özelleştir ✅ Pipeline pattern kullan ✅ Augmentation ile balance et ✅ Stratified sampling ile generalize et ✅ Dynamic batching ile optimize et