Spaces:

tugrulkaya
/

advanced-dataset-tutorial

Sleeping

App Files Files Community

advanced-dataset-tutorial / README.md

MEHMET TUĞRUL KAYA

Initial commit: Advanced Dataset Tutorial

2e6a47d 4 months ago

preview code

raw

history blame contribute delete

8.34 kB

	---
	title: Advanced Dataset Tutorial - Hugging Face Datasets İleri Seviye
	emoji: 📚
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 4.44.0
	app_file: space/app.py
	pinned: false
	license: mit
	tags:
	- datasets
	- tutorial
	- nlp
	- machine-learning
	- data-processing
	- Turkish
	---

	# 📚 Advanced Dataset Tutorial - Hugging Face Datasets İleri Seviye

	Hugging Face Datasets kütüphanesi ile ileri seviye veri işleme teknikleri için kapsamlı Türkçe eğitim materyali.

	## 🎯 Proje Hakkında

	Bu proje, Hugging Face Datasets kütüphanesini profesyonel düzeyde kullanmak isteyenler için hazırlanmış kapsamlı bir eğitim serisidir. 4 ana modül ve 20+ pratik örnek içerir.

	## 📖 Modüller

	### 1️⃣ Büyük Ölçekli Datasets
	- Streaming ile büyük veri işleme (750GB+ datasets)
	- Memory-efficient preprocessing
	- Batch processing optimizasyonu (2.3x hızlandırma)
	- Multi-process parallelization (64x hızlandırma)
	- Cache yönetimi (12.1x hızlandırma)
	- Dataset sharding ve distributed training

	Performans Kazanımları:
	- ⚡ Batch processing: 2.3x daha hızlı
	- 💾 Cache kullanımı: 12.1x daha hızlı
	- 🚀 Multi-processing: 64x daha hızlı
	- 📦 Generator pattern: Minimal RAM kullanımı

	### 2️⃣ Domain-Specific Datasets
	- Bilimsel makaleler (arXiv, PubMed style)
	- Kod datasets (6 programlama dili)
	- Finansal analiz (sentiment + market data)
	- Tıbbi/sağlık (PHI anonymization)
	- Cross-domain integration (3 çözüm yöntemi)

	Üretilen Datasets:
	- 🔬 2,000 bilimsel makale
	- 💻 2,000 kod örneği
	- 💰 2,000 finansal kayıt
	- 🏥 2,000 tıbbi kayıt

	### 3️⃣ İleri Teknikler
	- Custom Data Collators (3 farklı tip)
	- Advanced Feature Extraction (10+ feature)
	- Preprocessing Pipelines (modular & reusable)
	- Data Augmentation (3x veri artışı)
	- Stratified Sampling (balanced splits)
	- Dynamic Batching (40% padding azalması)
	- Active Learning integration

	Teknikler:
	- 📦 Simple, Padding, Advanced Collators
	- 🔧 Feature Engineering Pipeline
	- 🎲 Smart Data Augmentation
	- 📊 Diversity & Uncertainty Sampling

	### 4️⃣ Özel Görevler İçin Datasets
	- Question Answering (SQuAD-style)
	- Summarization (CNN/DailyMail)
	- Named Entity Recognition (BIO tagging)
	- Sentiment Analysis (aspect-based)
	- Text Classification (multi-class)
	- Multi-Task Learning

	Task-Specific Datasets:
	- ❓ 200 QA pairs + 100 multiple choice
	- 📝 100 summarization pairs
	- 🏷️ 100 NER annotated sentences
	- 😊 300 sentiment reviews
	- 📊 200 topic classification

	## 🚀 Hızlı Başlangıç

	### Online Demo (Gradio)
	```bash
	# Space'i çalıştır
	python space/app.py
	```

	### Manuel Kullanım
	```python
	from datasets import load_dataset

	# Örnek: Büyük dataset streaming
	dataset = load_dataset("tugrulkaya/advanced-dataset-tutorial")
	```

	## 💻 Kurulum

	```bash
	# Gerekli kütüphaneler
	pip install datasets transformers numpy pandas

	# Opsiyonel
	pip install gradio # İnteraktif demo için
	```

	## 📂 Proje Yapısı

	```
	advanced-dataset-tutorial/
	├── 📊 datasets/ # Örnek dataset'ler
	│ ├── large_scale_example/ # Büyük ölçekli örnekler
	│ ├── domain_specific_example/ # Domain-specific örnekler
	│ ├── advanced_techniques_example/ # İleri teknik örnekleri
	│ └── task_specific_example/ # Task-specific örnekler
	│
	├── 🌐 space/ # Gradio Space
	│ ├── app.py # Ana uygulama
	│ ├── modules/ # Tüm modül scriptleri
	│ │ ├── 01_buyuk_olcekli_datasets_complete.py
	│ │ ├── 02_domain_specific_datasets.py
	│ │ ├── 02b_cross_domain_fix.py
	│ │ ├── 03_ileri_teknikler_part1.py
	│ │ ├── 03_ileri_teknikler_part2.py
	│ │ └── 04_ozel_gorevler.py
	│ └── README.md
	│
	└── README.md # Bu dosya
	```

	## 🎓 Öğrenme Yolu

	### Başlangıç Seviyesi
	1. ✅ Bölüm 1: Büyük Ölçekli Datasets
	- Streaming basics
	- Batch processing
	- Memory management

	### Orta Seviye
	2. ✅ Bölüm 2: Domain-Specific Datasets
	- Scientific data
	- Code datasets
	- Cross-domain integration

	### İleri Seviye
	3. ✅ Bölüm 3: İleri Teknikler
	- Custom collators
	- Pipeline patterns
	- Advanced sampling

	### Uzman Seviye
	4. ✅ Bölüm 4: Özel Görevler
	- Task-specific preprocessing
	- Quality metrics
	- Multi-task learning

	## 📊 Performans Metrikleri

	\| Teknik \| Performans Artışı \| Kullanım Senaryosu \|
	\|--------\|-------------------\|-------------------\|
	\| Batch Processing \| 2.3x daha hızlı \| Tüm preprocessing \|
	\| Cache Kullanımı \| 12.1x daha hızlı \| Tekrarlanan işlemler \|
	\| Multi-Processing \| 64x daha hızlı \| CPU-intensive tasks \|
	\| Dynamic Batching \| 40% padding azalması \| Training efficiency \|
	\| Data Augmentation \| 3x veri artışı \| Class imbalance \|

	## 🔧 Best Practices

	### Memory Efficiency
	```python
	# ✅ DOĞRU: Streaming ile büyük veri
	dataset = load_dataset("huge_dataset", streaming=True)

	# ❌ YANLIŞ: Tüm veriyi RAM'e yükleme
	dataset = load_dataset("huge_dataset") # 100GB RAM!
	```

	### Batch Processing
	```python
	# ✅ DOĞRU: Batched operations
	dataset.map(process_fn, batched=True, batch_size=1000)

	# ❌ YANLIŞ: Tek tek işleme
	dataset.map(process_fn, batched=False) # 10x-100x yavaş!
	```

	### Cross-Domain Integration
	```python
	# ✅ DOĞRU: Ortak schema'ya normalize et
	def normalize(example, domain):
	return {
	'text': example.get('text') or example.get('content'),
	'domain': domain,
	'metadata': json.dumps(example.get('meta', {}))
	}

	# ❌ YANLIŞ: Farklı schema'ları direkt birleştirme
	combined = concatenate_datasets([ds1, ds2]) # ArrowTypeError!
	```

	## 🎯 Kullanım Örnekleri

	### 1. Büyük Dataset İşleme
	```python
	from datasets import load_dataset

	# Streaming mode
	dataset = load_dataset("c4", "en", split="train", streaming=True)

	# İlk 1000 örneği işle
	for i, example in enumerate(dataset.take(1000)):
	process(example)
	```

	### 2. Custom Collator
	```python
	class CustomCollator:
	def __call__(self, batch):
	texts = [ex['text'] for ex in batch]
	labels = [ex['label'] for ex in batch]
	return {'texts': texts, 'labels': labels}

	# DataLoader ile kullan
	collator = CustomCollator()
	dataloader = DataLoader(dataset, collate_fn=collator)
	```

	### 3. Data Augmentation
	```python
	def augment(example):
	# Word deletion
	words = example['text'].split()
	augmented = ' '.join(random.sample(words, k=len(words)-2))
	return {'text': augmented, 'label': example['label']}

	augmented_dataset = dataset.map(augment)
	```

	## 📈 İstatistikler

	- Toplam Kod Satırı: 5,000+
	- Örnek Sayısı: 20,000+
	- Teknik Sayısı: 50+
	- Best Practices: 100+

	## 🤝 Katkıda Bulunma

	Bu proje açık kaynaklıdır ve katkılara açıktır!

	1. Fork edin
	2. Feature branch oluşturun (`git checkout -b feature/amazing`)
	3. Commit edin (`git commit -m 'Add amazing feature'`)
	4. Push edin (`git push origin feature/amazing`)
	5. Pull Request açın

	## 📝 Lisans

	MIT License - detaylar için [LICENSE](LICENSE) dosyasına bakın.

	## 👨‍💻 Yazar

	Bu eğitim materyali, Hugging Face Datasets kullanıcıları için pratik ve uygulanabilir bilgi sağlamak amacıyla hazırlanmıştır.

	## 🙏 Teşekkürler

	- Hugging Face ekibine harika `datasets` kütüphanesi için
	- Açık kaynak topluluğuna sürekli katkıları için

	## 📚 Kaynaklar

	- [Hugging Face Datasets Documentation](https://huggingface.co/docs/datasets)
	- [Hugging Face Hub](https://huggingface.co/datasets)
	- [Apache Arrow](https://arrow.apache.org/)

	## 🔗 Bağlantılar

	- 🌐 [Hugging Face Space](https://huggingface.co/spaces/tugrulkaya/advanced-dataset-tutorial)
	- 📊 [Datasets](https://huggingface.co/datasets/tugrulkaya/advanced-dataset-tutorial)
	- 💬 [Discussions](https://huggingface.co/spaces/tugrulkaya/advanced-dataset-tutorial/discussions)

	---

	⭐ Beğendiyseniz yıldız vermeyi unutmayın!

	🔄 Güncellemeler için takip edin!

	💬 Sorularınız için Discussion açın!