Spaces:

tugrulkaya
/

advanced-dataset-tutorial

Sleeping

App Files Files Community

advanced-dataset-tutorial / datasets /advanced_techniques_example /README.md

MEHMET TUĞRUL KAYA

Initial commit: Advanced Dataset Tutorial

2e6a47d 3 months ago

preview code

raw

history blame contribute delete

2.66 kB

	# İleri Teknikler Örnekleri

	Bu klasör, advanced dataset processing teknikleri içerir.

	## Teknikler

	### 📦 Custom Data Collators

	#### 1. Simple Collator
	```python
	class SimpleCollator:
	def __call__(self, batch):
	texts = [ex['text'] for ex in batch]
	labels = [ex['label'] for ex in batch]
	return {'texts': texts, 'labels': labels}
	```

	#### 2. Padding Collator
	```python
	class PaddingCollator:
	def __call__(self, batch):
	# Dynamic padding
	max_len = max(len(ex['text']) for ex in batch)
	# Pad to max_len...
	```

	#### 3. Advanced Collator
	```python
	class AdvancedCollator:
	def __call__(self, batch):
	# Padding + normalization + stats
	return {
	'input_ids': padded,
	'attention_mask': masks,
	'labels': labels,
	'batch_stats': {...}
	}
	```

	### 🔧 Feature Engineering
	- 10+ feature extraction
	- Normalization (min-max, z-score)
	- Interaction features
	- Domain-specific features

	### 🎲 Data Augmentation
	- Word deletion (random)
	- Word swap
	- Synonym replacement
	- Class balancing (3x veri artışı)

	### 📊 Advanced Sampling

	#### Stratified Sampling
	```python
	# Balanced train/test splits
	train, test = stratified_split(
	dataset,
	stratify_column='label',
	train_ratio=0.8
	)
	```

	#### Diversity Sampling
	```python
	# Maximum diversity
	diverse = max_diversity_sampling(
	dataset,
	n_samples=100,
	feature_columns=['length', 'score']
	)
	```

	#### Active Learning
	```python
	# Uncertainty-based
	uncertain = uncertainty_sampling(
	dataset,
	uncertainty_scores,
	n_samples=100
	)
	```

	### 📦 Dynamic Batching

	#### Length-Based
	```python
	# Benzer uzunlukları grupla
	batches = length_based_batching(
	dataset,
	length_column='length'
	)
	# Result: 40% padding azalması
	```

	#### Bucket Batching
	```python
	# Bucket'lara ayır
	batches = bucket_batching(
	dataset,
	n_buckets=5
	)
	```

	## Pipeline Pattern

	```python
	pipeline = DataPipeline("My Pipeline")
	pipeline.add_step("clean", clean_fn)
	pipeline.add_step("features", extract_features)
	pipeline.add_step("normalize", normalize_fn)

	result = pipeline.run(dataset)
	```

	## Performans

	\| Teknik \| Artış \| Use Case \|
	\|--------\|-------\|----------\|
	\| Batch Processing \| 2.3x \| Tüm işlemler \|
	\| Dynamic Batching \| 40% \| Padding azalması \|
	\| Data Augmentation \| 3x \| Veri artışı \|
	\| Stratified Sampling \| - \| Balanced splits \|

	## Best Practices

	✅ Collator'ı modele göre özelleştir
	✅ Pipeline pattern kullan
	✅ Augmentation ile balance et
	✅ Stratified sampling ile generalize et
	✅ Dynamic batching ile optimize et