A newer version of the Gradio SDK is available:
6.5.1
İleri Teknikler Örnekleri
Bu klasör, advanced dataset processing teknikleri içerir.
Teknikler
📦 Custom Data Collators
1. Simple Collator
class SimpleCollator:
def __call__(self, batch):
texts = [ex['text'] for ex in batch]
labels = [ex['label'] for ex in batch]
return {'texts': texts, 'labels': labels}
2. Padding Collator
class PaddingCollator:
def __call__(self, batch):
# Dynamic padding
max_len = max(len(ex['text']) for ex in batch)
# Pad to max_len...
3. Advanced Collator
class AdvancedCollator:
def __call__(self, batch):
# Padding + normalization + stats
return {
'input_ids': padded,
'attention_mask': masks,
'labels': labels,
'batch_stats': {...}
}
🔧 Feature Engineering
- 10+ feature extraction
- Normalization (min-max, z-score)
- Interaction features
- Domain-specific features
🎲 Data Augmentation
- Word deletion (random)
- Word swap
- Synonym replacement
- Class balancing (3x veri artışı)
📊 Advanced Sampling
Stratified Sampling
# Balanced train/test splits
train, test = stratified_split(
dataset,
stratify_column='label',
train_ratio=0.8
)
Diversity Sampling
# Maximum diversity
diverse = max_diversity_sampling(
dataset,
n_samples=100,
feature_columns=['length', 'score']
)
Active Learning
# Uncertainty-based
uncertain = uncertainty_sampling(
dataset,
uncertainty_scores,
n_samples=100
)
📦 Dynamic Batching
Length-Based
# Benzer uzunlukları grupla
batches = length_based_batching(
dataset,
length_column='length'
)
# Result: 40% padding azalması
Bucket Batching
# Bucket'lara ayır
batches = bucket_batching(
dataset,
n_buckets=5
)
Pipeline Pattern
pipeline = DataPipeline("My Pipeline")
pipeline.add_step("clean", clean_fn)
pipeline.add_step("features", extract_features)
pipeline.add_step("normalize", normalize_fn)
result = pipeline.run(dataset)
Performans
| Teknik | Artış | Use Case |
|---|---|---|
| Batch Processing | 2.3x | Tüm işlemler |
| Dynamic Batching | 40% | Padding azalması |
| Data Augmentation | 3x | Veri artışı |
| Stratified Sampling | - | Balanced splits |
Best Practices
✅ Collator'ı modele göre özelleştir
✅ Pipeline pattern kullan
✅ Augmentation ile balance et
✅ Stratified sampling ile generalize et
✅ Dynamic batching ile optimize et