MEHMET TUĞRUL KAYA commited on
Commit ·
2e6a47d
1
Parent(s): e600950
Initial commit: Advanced Dataset Tutorial
Browse files- DEPLOYMENT.md +191 -0
- LICENSE +21 -0
- README.md +288 -7
- datasets/advanced_techniques_example/README.md +131 -0
- datasets/domain_specific_example/README.md +76 -0
- datasets/large_scale_example/README.md +56 -0
- datasets/task_specific_example/README.md +189 -0
- requirements.txt +5 -0
- space/app.py +493 -0
- space/modules/01_buyuk_olcekli_datasets_complete.py +617 -0
- space/modules/02_domain_specific_datasets.py +870 -0
- space/modules/02b_cross_domain_fix.py +498 -0
- space/modules/03_ileri_teknikler_part1.py +856 -0
- space/modules/03_ileri_teknikler_part2.py +776 -0
- space/modules/04_ozel_gorevler.py +1039 -0
DEPLOYMENT.md
ADDED
|
@@ -0,0 +1,191 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🚀 Hugging Face'e Yükleme Talimatları
|
| 2 |
+
|
| 3 |
+
Bu dosya, projeyi Hugging Face'e nasıl yükleyeceğinizi açıklar.
|
| 4 |
+
|
| 5 |
+
## 📋 Ön Hazırlık
|
| 6 |
+
|
| 7 |
+
1. **Hugging Face hesabı** oluşturun: https://huggingface.co/join
|
| 8 |
+
2. **Access token** alın: https://huggingface.co/settings/tokens
|
| 9 |
+
3. **Git LFS** kurun (büyük dosyalar için):
|
| 10 |
+
```bash
|
| 11 |
+
git lfs install
|
| 12 |
+
```
|
| 13 |
+
|
| 14 |
+
## 🌐 Space Olarak Yükleme
|
| 15 |
+
|
| 16 |
+
### 1. Yeni Space Oluştur
|
| 17 |
+
|
| 18 |
+
Hugging Face'te: https://huggingface.co/new-space
|
| 19 |
+
|
| 20 |
+
- **Space name**: `advanced-dataset-tutorial`
|
| 21 |
+
- **License**: MIT
|
| 22 |
+
- **SDK**: Gradio
|
| 23 |
+
- **Hardware**: CPU (basic)
|
| 24 |
+
|
| 25 |
+
### 2. Repository'yi Clone Et
|
| 26 |
+
|
| 27 |
+
```bash
|
| 28 |
+
git clone https://huggingface.co/spaces/YOUR-USERNAME/advanced-dataset-tutorial
|
| 29 |
+
cd advanced-dataset-tutorial
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
### 3. Dosyaları Kopyala
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
# Proje dosyalarını kopyala
|
| 36 |
+
cp -r /path/to/advanced-dataset-tutorial/* .
|
| 37 |
+
|
| 38 |
+
# Yapı:
|
| 39 |
+
# .
|
| 40 |
+
# ├── README.md
|
| 41 |
+
# ├── requirements.txt
|
| 42 |
+
# ├── LICENSE
|
| 43 |
+
# ├── .gitignore
|
| 44 |
+
# ├── datasets/
|
| 45 |
+
# └── space/
|
| 46 |
+
# ├── app.py
|
| 47 |
+
# └── modules/
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
### 4. Push Et
|
| 51 |
+
|
| 52 |
+
```bash
|
| 53 |
+
git add .
|
| 54 |
+
git commit -m "Initial commit: Advanced Dataset Tutorial"
|
| 55 |
+
git push
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
### 5. Space Otomatik Deploy Olacak! 🎉
|
| 59 |
+
|
| 60 |
+
Birkaç dakika içinde: `https://huggingface.co/spaces/YOUR-USERNAME/advanced-dataset-tutorial`
|
| 61 |
+
|
| 62 |
+
## 📊 Dataset Olarak Yükleme (Opsiyonel)
|
| 63 |
+
|
| 64 |
+
### 1. Dataset Repository Oluştur
|
| 65 |
+
|
| 66 |
+
```bash
|
| 67 |
+
# Yeni dataset repository
|
| 68 |
+
huggingface-cli repo create advanced-dataset-tutorial --type dataset
|
| 69 |
+
|
| 70 |
+
# Clone
|
| 71 |
+
git clone https://huggingface.co/datasets/YOUR-USERNAME/advanced-dataset-tutorial
|
| 72 |
+
cd advanced-dataset-tutorial
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
### 2. Dataset Dosyalarını Hazırla
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
# create_datasets.py
|
| 79 |
+
from datasets import Dataset, DatasetDict
|
| 80 |
+
|
| 81 |
+
# Örnek dataset'leri oluştur
|
| 82 |
+
datasets = DatasetDict({
|
| 83 |
+
'large_scale_examples': ...,
|
| 84 |
+
'domain_specific_examples': ...,
|
| 85 |
+
'advanced_techniques_examples': ...,
|
| 86 |
+
'task_specific_examples': ...
|
| 87 |
+
})
|
| 88 |
+
|
| 89 |
+
# Kaydet
|
| 90 |
+
datasets.save_to_disk('dataset')
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### 3. Push Dataset
|
| 94 |
+
|
| 95 |
+
```bash
|
| 96 |
+
git add .
|
| 97 |
+
git commit -m "Add dataset examples"
|
| 98 |
+
git push
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
## 🔗 GitHub Integration (Opsiyonel)
|
| 102 |
+
|
| 103 |
+
### 1. GitHub Repository Oluştur
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
# GitHub'da repo oluştur, sonra:
|
| 107 |
+
git remote add github https://github.com/YOUR-USERNAME/advanced-dataset-tutorial
|
| 108 |
+
git push github main
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
### 2. Hugging Face ile Sync Et
|
| 112 |
+
|
| 113 |
+
Hugging Face Space settings'de:
|
| 114 |
+
- GitHub repository'yi bağla
|
| 115 |
+
- Auto-sync etkinleştir
|
| 116 |
+
|
| 117 |
+
## 📝 Yükleme Sonrası Checklist
|
| 118 |
+
|
| 119 |
+
- [ ] Space çalışıyor mu? Test et
|
| 120 |
+
- [ ] README düzgün görünüyor mu?
|
| 121 |
+
- [ ] Gradio demo açılıyor mu?
|
| 122 |
+
- [ ] Tüm modüller yüklendi mi?
|
| 123 |
+
- [ ] License doğru mu?
|
| 124 |
+
- [ ] Tags eklendi mi?
|
| 125 |
+
|
| 126 |
+
## 🎨 Customization
|
| 127 |
+
|
| 128 |
+
### Space Settings
|
| 129 |
+
|
| 130 |
+
Settings'den düzenle:
|
| 131 |
+
- **Title**: Advanced Dataset Tutorial
|
| 132 |
+
- **Emoji**: 📚
|
| 133 |
+
- **Theme**: Soft (veya istediğiniz)
|
| 134 |
+
- **Hardware**: CPU Basic (ücretsiz)
|
| 135 |
+
|
| 136 |
+
### README Metadata
|
| 137 |
+
|
| 138 |
+
README.md başındaki metadata'yı güncelleyin:
|
| 139 |
+
```yaml
|
| 140 |
+
---
|
| 141 |
+
title: Advanced Dataset Tutorial
|
| 142 |
+
emoji: 📚
|
| 143 |
+
colorFrom: blue
|
| 144 |
+
colorTo: purple
|
| 145 |
+
sdk: gradio
|
| 146 |
+
sdk_version: 4.44.0
|
| 147 |
+
app_file: space/app.py
|
| 148 |
+
---
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
## 🐛 Sorun Giderme
|
| 152 |
+
|
| 153 |
+
### Space Build Hatası
|
| 154 |
+
|
| 155 |
+
1. Logs'u kontrol et
|
| 156 |
+
2. `requirements.txt` doğru mu?
|
| 157 |
+
3. `app.py` path'i doğru mu?
|
| 158 |
+
|
| 159 |
+
### Import Hatası
|
| 160 |
+
|
| 161 |
+
```python
|
| 162 |
+
# app.py'de path ekle
|
| 163 |
+
import sys
|
| 164 |
+
from pathlib import Path
|
| 165 |
+
sys.path.append(str(Path(__file__).parent / "modules"))
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
### Network Hatası
|
| 169 |
+
|
| 170 |
+
Hugging Face'te bazı URL'ler block'lanabilir. Lokal dataset'ler kullanın.
|
| 171 |
+
|
| 172 |
+
## 📚 Kaynaklar
|
| 173 |
+
|
| 174 |
+
- [Hugging Face Spaces Docs](https://huggingface.co/docs/hub/spaces)
|
| 175 |
+
- [Gradio Docs](https://gradio.app/docs/)
|
| 176 |
+
- [Git LFS](https://git-lfs.github.com/)
|
| 177 |
+
|
| 178 |
+
## ✅ Başarı!
|
| 179 |
+
|
| 180 |
+
Space'iniz hazır! Şimdi:
|
| 181 |
+
|
| 182 |
+
1. 🌐 **Demo'yu paylaş**: Space URL'ini arkadaşlarınla paylaş
|
| 183 |
+
2. ⭐ **Community**: Discussion'lar aç, feedback al
|
| 184 |
+
3. 🔄 **Güncelle**: Düzenli olarak yeni örnekler ekle
|
| 185 |
+
4. 📊 **İstatistik**: Space usage'i takip et
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
**İyi şanslar! 🚀**
|
| 190 |
+
|
| 191 |
+
Sorularınız için: [@yourusername](https://huggingface.co/YOUR-USERNAME)
|
LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
MIT License
|
| 2 |
+
|
| 3 |
+
Copyright (c) 2024 Advanced Dataset Tutorial
|
| 4 |
+
|
| 5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
+
of this software and associated documentation files (the "Software"), to deal
|
| 7 |
+
in the Software without restriction, including without limitation the rights
|
| 8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 9 |
+
copies of the Software, and to permit persons to whom the Software is
|
| 10 |
+
furnished to do so, subject to the following conditions:
|
| 11 |
+
|
| 12 |
+
The above copyright notice and this permission notice shall be included in all
|
| 13 |
+
copies or substantial portions of the Software.
|
| 14 |
+
|
| 15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
| 21 |
+
SOFTWARE.
|
README.md
CHANGED
|
@@ -1,12 +1,293 @@
|
|
| 1 |
---
|
| 2 |
-
title: Advanced Dataset Tutorial
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
-
app_file: app.py
|
| 9 |
pinned: false
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Advanced Dataset Tutorial - Hugging Face Datasets İleri Seviye
|
| 3 |
+
emoji: 📚
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.0
|
| 8 |
+
app_file: space/app.py
|
| 9 |
pinned: false
|
| 10 |
+
license: mit
|
| 11 |
+
tags:
|
| 12 |
+
- datasets
|
| 13 |
+
- tutorial
|
| 14 |
+
- nlp
|
| 15 |
+
- machine-learning
|
| 16 |
+
- data-processing
|
| 17 |
+
- Turkish
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# 📚 Advanced Dataset Tutorial - Hugging Face Datasets İleri Seviye
|
| 21 |
+
|
| 22 |
+
Hugging Face Datasets kütüphanesi ile ileri seviye veri işleme teknikleri için kapsamlı Türkçe eğitim materyali.
|
| 23 |
+
|
| 24 |
+
## 🎯 Proje Hakkında
|
| 25 |
+
|
| 26 |
+
Bu proje, Hugging Face Datasets kütüphanesini profesyonel düzeyde kullanmak isteyenler için hazırlanmış kapsamlı bir eğitim serisidir. 4 ana modül ve 20+ pratik örnek içerir.
|
| 27 |
+
|
| 28 |
+
## 📖 Modüller
|
| 29 |
+
|
| 30 |
+
### 1️⃣ Büyük Ölçekli Datasets
|
| 31 |
+
- **Streaming ile büyük veri işleme** (750GB+ datasets)
|
| 32 |
+
- **Memory-efficient preprocessing**
|
| 33 |
+
- **Batch processing optimizasyonu** (2.3x hızlandırma)
|
| 34 |
+
- **Multi-process parallelization** (64x hızlandırma)
|
| 35 |
+
- **Cache yönetimi** (12.1x hızlandırma)
|
| 36 |
+
- **Dataset sharding ve distributed training**
|
| 37 |
+
|
| 38 |
+
**Performans Kazanımları:**
|
| 39 |
+
- ⚡ Batch processing: 2.3x daha hızlı
|
| 40 |
+
- 💾 Cache kullanımı: 12.1x daha hızlı
|
| 41 |
+
- 🚀 Multi-processing: 64x daha hızlı
|
| 42 |
+
- 📦 Generator pattern: Minimal RAM kullanımı
|
| 43 |
+
|
| 44 |
+
### 2️⃣ Domain-Specific Datasets
|
| 45 |
+
- **Bilimsel makaleler** (arXiv, PubMed style)
|
| 46 |
+
- **Kod datasets** (6 programlama dili)
|
| 47 |
+
- **Finansal analiz** (sentiment + market data)
|
| 48 |
+
- **Tıbbi/sağlık** (PHI anonymization)
|
| 49 |
+
- **Cross-domain integration** (3 çözüm yöntemi)
|
| 50 |
+
|
| 51 |
+
**Üretilen Datasets:**
|
| 52 |
+
- 🔬 2,000 bilimsel makale
|
| 53 |
+
- 💻 2,000 kod örneği
|
| 54 |
+
- 💰 2,000 finansal kayıt
|
| 55 |
+
- 🏥 2,000 tıbbi kayıt
|
| 56 |
+
|
| 57 |
+
### 3️⃣ İleri Teknikler
|
| 58 |
+
- **Custom Data Collators** (3 farklı tip)
|
| 59 |
+
- **Advanced Feature Extraction** (10+ feature)
|
| 60 |
+
- **Preprocessing Pipelines** (modular & reusable)
|
| 61 |
+
- **Data Augmentation** (3x veri artışı)
|
| 62 |
+
- **Stratified Sampling** (balanced splits)
|
| 63 |
+
- **Dynamic Batching** (40% padding azalması)
|
| 64 |
+
- **Active Learning integration**
|
| 65 |
+
|
| 66 |
+
**Teknikler:**
|
| 67 |
+
- 📦 Simple, Padding, Advanced Collators
|
| 68 |
+
- 🔧 Feature Engineering Pipeline
|
| 69 |
+
- 🎲 Smart Data Augmentation
|
| 70 |
+
- 📊 Diversity & Uncertainty Sampling
|
| 71 |
+
|
| 72 |
+
### 4️⃣ Özel Görevler İçin Datasets
|
| 73 |
+
- **Question Answering** (SQuAD-style)
|
| 74 |
+
- **Summarization** (CNN/DailyMail)
|
| 75 |
+
- **Named Entity Recognition** (BIO tagging)
|
| 76 |
+
- **Sentiment Analysis** (aspect-based)
|
| 77 |
+
- **Text Classification** (multi-class)
|
| 78 |
+
- **Multi-Task Learning**
|
| 79 |
+
|
| 80 |
+
**Task-Specific Datasets:**
|
| 81 |
+
- ❓ 200 QA pairs + 100 multiple choice
|
| 82 |
+
- 📝 100 summarization pairs
|
| 83 |
+
- 🏷️ 100 NER annotated sentences
|
| 84 |
+
- 😊 300 sentiment reviews
|
| 85 |
+
- 📊 200 topic classification
|
| 86 |
+
|
| 87 |
+
## 🚀 Hızlı Başlangıç
|
| 88 |
+
|
| 89 |
+
### Online Demo (Gradio)
|
| 90 |
+
```bash
|
| 91 |
+
# Space'i çalıştır
|
| 92 |
+
python space/app.py
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
### Manuel Kullanım
|
| 96 |
+
```python
|
| 97 |
+
from datasets import load_dataset
|
| 98 |
+
|
| 99 |
+
# Örnek: Büyük dataset streaming
|
| 100 |
+
dataset = load_dataset("tugrulkaya/advanced-dataset-tutorial")
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
## 💻 Kurulum
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
# Gerekli kütüphaneler
|
| 107 |
+
pip install datasets transformers numpy pandas
|
| 108 |
+
|
| 109 |
+
# Opsiyonel
|
| 110 |
+
pip install gradio # İnteraktif demo için
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
## 📂 Proje Yapısı
|
| 114 |
+
|
| 115 |
+
```
|
| 116 |
+
advanced-dataset-tutorial/
|
| 117 |
+
├── 📊 datasets/ # Örnek dataset'ler
|
| 118 |
+
│ ├── large_scale_example/ # Büyük ölçekli örnekler
|
| 119 |
+
│ ├── domain_specific_example/ # Domain-specific örnekler
|
| 120 |
+
│ ├── advanced_techniques_example/ # İleri teknik örnekleri
|
| 121 |
+
│ └── task_specific_example/ # Task-specific örnekler
|
| 122 |
+
│
|
| 123 |
+
├── 🌐 space/ # Gradio Space
|
| 124 |
+
│ ├── app.py # Ana uygulama
|
| 125 |
+
│ ├── modules/ # Tüm modül scriptleri
|
| 126 |
+
│ │ ├── 01_buyuk_olcekli_datasets_complete.py
|
| 127 |
+
│ │ ├── 02_domain_specific_datasets.py
|
| 128 |
+
│ │ ├── 02b_cross_domain_fix.py
|
| 129 |
+
│ │ ├── 03_ileri_teknikler_part1.py
|
| 130 |
+
│ │ ├── 03_ileri_teknikler_part2.py
|
| 131 |
+
│ │ └── 04_ozel_gorevler.py
|
| 132 |
+
│ └── README.md
|
| 133 |
+
│
|
| 134 |
+
└── README.md # Bu dosya
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
## 🎓 Öğrenme Yolu
|
| 138 |
+
|
| 139 |
+
### Başlangıç Seviyesi
|
| 140 |
+
1. ✅ Bölüm 1: Büyük Ölçekli Datasets
|
| 141 |
+
- Streaming basics
|
| 142 |
+
- Batch processing
|
| 143 |
+
- Memory management
|
| 144 |
+
|
| 145 |
+
### Orta Seviye
|
| 146 |
+
2. ✅ Bölüm 2: Domain-Specific Datasets
|
| 147 |
+
- Scientific data
|
| 148 |
+
- Code datasets
|
| 149 |
+
- Cross-domain integration
|
| 150 |
+
|
| 151 |
+
### İleri Seviye
|
| 152 |
+
3. ✅ Bölüm 3: İleri Teknikler
|
| 153 |
+
- Custom collators
|
| 154 |
+
- Pipeline patterns
|
| 155 |
+
- Advanced sampling
|
| 156 |
+
|
| 157 |
+
### Uzman Seviye
|
| 158 |
+
4. ✅ Bölüm 4: Özel Görevler
|
| 159 |
+
- Task-specific preprocessing
|
| 160 |
+
- Quality metrics
|
| 161 |
+
- Multi-task learning
|
| 162 |
+
|
| 163 |
+
## 📊 Performans Metrikleri
|
| 164 |
+
|
| 165 |
+
| Teknik | Performans Artışı | Kullanım Senaryosu |
|
| 166 |
+
|--------|-------------------|-------------------|
|
| 167 |
+
| Batch Processing | 2.3x daha hızlı | Tüm preprocessing |
|
| 168 |
+
| Cache Kullanımı | 12.1x daha hızlı | Tekrarlanan işlemler |
|
| 169 |
+
| Multi-Processing | 64x daha hızlı | CPU-intensive tasks |
|
| 170 |
+
| Dynamic Batching | 40% padding azalması | Training efficiency |
|
| 171 |
+
| Data Augmentation | 3x veri artışı | Class imbalance |
|
| 172 |
+
|
| 173 |
+
## 🔧 Best Practices
|
| 174 |
+
|
| 175 |
+
### Memory Efficiency
|
| 176 |
+
```python
|
| 177 |
+
# ✅ DOĞRU: Streaming ile büyük veri
|
| 178 |
+
dataset = load_dataset("huge_dataset", streaming=True)
|
| 179 |
+
|
| 180 |
+
# ❌ YANLIŞ: Tüm veriyi RAM'e yükleme
|
| 181 |
+
dataset = load_dataset("huge_dataset") # 100GB RAM!
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
### Batch Processing
|
| 185 |
+
```python
|
| 186 |
+
# ✅ DOĞRU: Batched operations
|
| 187 |
+
dataset.map(process_fn, batched=True, batch_size=1000)
|
| 188 |
+
|
| 189 |
+
# ❌ YANLIŞ: Tek tek işleme
|
| 190 |
+
dataset.map(process_fn, batched=False) # 10x-100x yavaş!
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
### Cross-Domain Integration
|
| 194 |
+
```python
|
| 195 |
+
# ✅ DOĞRU: Ortak schema'ya normalize et
|
| 196 |
+
def normalize(example, domain):
|
| 197 |
+
return {
|
| 198 |
+
'text': example.get('text') or example.get('content'),
|
| 199 |
+
'domain': domain,
|
| 200 |
+
'metadata': json.dumps(example.get('meta', {}))
|
| 201 |
+
}
|
| 202 |
+
|
| 203 |
+
# ❌ YANLIŞ: Farklı schema'ları direkt birleştirme
|
| 204 |
+
combined = concatenate_datasets([ds1, ds2]) # ArrowTypeError!
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
## 🎯 Kullanım Örnekleri
|
| 208 |
+
|
| 209 |
+
### 1. Büyük Dataset İşleme
|
| 210 |
+
```python
|
| 211 |
+
from datasets import load_dataset
|
| 212 |
+
|
| 213 |
+
# Streaming mode
|
| 214 |
+
dataset = load_dataset("c4", "en", split="train", streaming=True)
|
| 215 |
+
|
| 216 |
+
# İlk 1000 örneği işle
|
| 217 |
+
for i, example in enumerate(dataset.take(1000)):
|
| 218 |
+
process(example)
|
| 219 |
+
```
|
| 220 |
+
|
| 221 |
+
### 2. Custom Collator
|
| 222 |
+
```python
|
| 223 |
+
class CustomCollator:
|
| 224 |
+
def __call__(self, batch):
|
| 225 |
+
texts = [ex['text'] for ex in batch]
|
| 226 |
+
labels = [ex['label'] for ex in batch]
|
| 227 |
+
return {'texts': texts, 'labels': labels}
|
| 228 |
+
|
| 229 |
+
# DataLoader ile kullan
|
| 230 |
+
collator = CustomCollator()
|
| 231 |
+
dataloader = DataLoader(dataset, collate_fn=collator)
|
| 232 |
+
```
|
| 233 |
+
|
| 234 |
+
### 3. Data Augmentation
|
| 235 |
+
```python
|
| 236 |
+
def augment(example):
|
| 237 |
+
# Word deletion
|
| 238 |
+
words = example['text'].split()
|
| 239 |
+
augmented = ' '.join(random.sample(words, k=len(words)-2))
|
| 240 |
+
return {'text': augmented, 'label': example['label']}
|
| 241 |
+
|
| 242 |
+
augmented_dataset = dataset.map(augment)
|
| 243 |
+
```
|
| 244 |
+
|
| 245 |
+
## 📈 İstatistikler
|
| 246 |
+
|
| 247 |
+
- **Toplam Kod Satırı**: 5,000+
|
| 248 |
+
- **Örnek Sayısı**: 20,000+
|
| 249 |
+
- **Teknik Sayısı**: 50+
|
| 250 |
+
- **Best Practices**: 100+
|
| 251 |
+
|
| 252 |
+
## 🤝 Katkıda Bulunma
|
| 253 |
+
|
| 254 |
+
Bu proje açık kaynaklıdır ve katkılara açıktır!
|
| 255 |
+
|
| 256 |
+
1. Fork edin
|
| 257 |
+
2. Feature branch oluşturun (`git checkout -b feature/amazing`)
|
| 258 |
+
3. Commit edin (`git commit -m 'Add amazing feature'`)
|
| 259 |
+
4. Push edin (`git push origin feature/amazing`)
|
| 260 |
+
5. Pull Request açın
|
| 261 |
+
|
| 262 |
+
## 📝 Lisans
|
| 263 |
+
|
| 264 |
+
MIT License - detaylar için [LICENSE](LICENSE) dosyasına bakın.
|
| 265 |
+
|
| 266 |
+
## 👨💻 Yazar
|
| 267 |
+
|
| 268 |
+
Bu eğitim materyali, Hugging Face Datasets kullanıcıları için pratik ve uygulanabilir bilgi sağlamak amacıyla hazırlanmıştır.
|
| 269 |
+
|
| 270 |
+
## 🙏 Teşekkürler
|
| 271 |
+
|
| 272 |
+
- Hugging Face ekibine harika `datasets` kütüphanesi için
|
| 273 |
+
- Açık kaynak topluluğuna sürekli katkıları için
|
| 274 |
+
|
| 275 |
+
## 📚 Kaynaklar
|
| 276 |
+
|
| 277 |
+
- [Hugging Face Datasets Documentation](https://huggingface.co/docs/datasets)
|
| 278 |
+
- [Hugging Face Hub](https://huggingface.co/datasets)
|
| 279 |
+
- [Apache Arrow](https://arrow.apache.org/)
|
| 280 |
+
|
| 281 |
+
## 🔗 Bağlantılar
|
| 282 |
+
|
| 283 |
+
- 🌐 [Hugging Face Space](https://huggingface.co/spaces/tugrulkaya/advanced-dataset-tutorial)
|
| 284 |
+
- 📊 [Datasets](https://huggingface.co/datasets/tugrulkaya/advanced-dataset-tutorial)
|
| 285 |
+
- 💬 [Discussions](https://huggingface.co/spaces/tugrulkaya/advanced-dataset-tutorial/discussions)
|
| 286 |
+
|
| 287 |
+
---
|
| 288 |
+
|
| 289 |
+
**⭐ Beğendiyseniz yıldız vermeyi unutmayın!**
|
| 290 |
+
|
| 291 |
+
**🔄 Güncellemeler için takip edin!**
|
| 292 |
+
|
| 293 |
+
**💬 Sorularınız için Discussion açın!**
|
datasets/advanced_techniques_example/README.md
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# İleri Teknikler Örnekleri
|
| 2 |
+
|
| 3 |
+
Bu klasör, advanced dataset processing teknikleri içerir.
|
| 4 |
+
|
| 5 |
+
## Teknikler
|
| 6 |
+
|
| 7 |
+
### 📦 Custom Data Collators
|
| 8 |
+
|
| 9 |
+
#### 1. Simple Collator
|
| 10 |
+
```python
|
| 11 |
+
class SimpleCollator:
|
| 12 |
+
def __call__(self, batch):
|
| 13 |
+
texts = [ex['text'] for ex in batch]
|
| 14 |
+
labels = [ex['label'] for ex in batch]
|
| 15 |
+
return {'texts': texts, 'labels': labels}
|
| 16 |
+
```
|
| 17 |
+
|
| 18 |
+
#### 2. Padding Collator
|
| 19 |
+
```python
|
| 20 |
+
class PaddingCollator:
|
| 21 |
+
def __call__(self, batch):
|
| 22 |
+
# Dynamic padding
|
| 23 |
+
max_len = max(len(ex['text']) for ex in batch)
|
| 24 |
+
# Pad to max_len...
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
#### 3. Advanced Collator
|
| 28 |
+
```python
|
| 29 |
+
class AdvancedCollator:
|
| 30 |
+
def __call__(self, batch):
|
| 31 |
+
# Padding + normalization + stats
|
| 32 |
+
return {
|
| 33 |
+
'input_ids': padded,
|
| 34 |
+
'attention_mask': masks,
|
| 35 |
+
'labels': labels,
|
| 36 |
+
'batch_stats': {...}
|
| 37 |
+
}
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
### 🔧 Feature Engineering
|
| 41 |
+
- 10+ feature extraction
|
| 42 |
+
- Normalization (min-max, z-score)
|
| 43 |
+
- Interaction features
|
| 44 |
+
- Domain-specific features
|
| 45 |
+
|
| 46 |
+
### 🎲 Data Augmentation
|
| 47 |
+
- Word deletion (random)
|
| 48 |
+
- Word swap
|
| 49 |
+
- Synonym replacement
|
| 50 |
+
- Class balancing (3x veri artışı)
|
| 51 |
+
|
| 52 |
+
### 📊 Advanced Sampling
|
| 53 |
+
|
| 54 |
+
#### Stratified Sampling
|
| 55 |
+
```python
|
| 56 |
+
# Balanced train/test splits
|
| 57 |
+
train, test = stratified_split(
|
| 58 |
+
dataset,
|
| 59 |
+
stratify_column='label',
|
| 60 |
+
train_ratio=0.8
|
| 61 |
+
)
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
#### Diversity Sampling
|
| 65 |
+
```python
|
| 66 |
+
# Maximum diversity
|
| 67 |
+
diverse = max_diversity_sampling(
|
| 68 |
+
dataset,
|
| 69 |
+
n_samples=100,
|
| 70 |
+
feature_columns=['length', 'score']
|
| 71 |
+
)
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
#### Active Learning
|
| 75 |
+
```python
|
| 76 |
+
# Uncertainty-based
|
| 77 |
+
uncertain = uncertainty_sampling(
|
| 78 |
+
dataset,
|
| 79 |
+
uncertainty_scores,
|
| 80 |
+
n_samples=100
|
| 81 |
+
)
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
### 📦 Dynamic Batching
|
| 85 |
+
|
| 86 |
+
#### Length-Based
|
| 87 |
+
```python
|
| 88 |
+
# Benzer uzunlukları grupla
|
| 89 |
+
batches = length_based_batching(
|
| 90 |
+
dataset,
|
| 91 |
+
length_column='length'
|
| 92 |
+
)
|
| 93 |
+
# Result: 40% padding azalması
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
#### Bucket Batching
|
| 97 |
+
```python
|
| 98 |
+
# Bucket'lara ayır
|
| 99 |
+
batches = bucket_batching(
|
| 100 |
+
dataset,
|
| 101 |
+
n_buckets=5
|
| 102 |
+
)
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
## Pipeline Pattern
|
| 106 |
+
|
| 107 |
+
```python
|
| 108 |
+
pipeline = DataPipeline("My Pipeline")
|
| 109 |
+
pipeline.add_step("clean", clean_fn)
|
| 110 |
+
pipeline.add_step("features", extract_features)
|
| 111 |
+
pipeline.add_step("normalize", normalize_fn)
|
| 112 |
+
|
| 113 |
+
result = pipeline.run(dataset)
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
## Performans
|
| 117 |
+
|
| 118 |
+
| Teknik | Artış | Use Case |
|
| 119 |
+
|--------|-------|----------|
|
| 120 |
+
| Batch Processing | 2.3x | Tüm işlemler |
|
| 121 |
+
| Dynamic Batching | 40% | Padding azalması |
|
| 122 |
+
| Data Augmentation | 3x | Veri artışı |
|
| 123 |
+
| Stratified Sampling | - | Balanced splits |
|
| 124 |
+
|
| 125 |
+
## Best Practices
|
| 126 |
+
|
| 127 |
+
✅ Collator'ı modele göre özelleştir
|
| 128 |
+
✅ Pipeline pattern kullan
|
| 129 |
+
✅ Augmentation ile balance et
|
| 130 |
+
✅ Stratified sampling ile generalize et
|
| 131 |
+
✅ Dynamic batching ile optimize et
|
datasets/domain_specific_example/README.md
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Domain-Specific Datasets Örnekleri
|
| 2 |
+
|
| 3 |
+
Bu klasör, farklı domain'ler için özelleştirilmiş dataset örnekleri içerir.
|
| 4 |
+
|
| 5 |
+
## Domain'ler
|
| 6 |
+
|
| 7 |
+
### 🔬 Bilimsel Makaleler
|
| 8 |
+
- arXiv, PubMed style
|
| 9 |
+
- 2,000 örnek
|
| 10 |
+
- Citation tracking
|
| 11 |
+
- Abstract + full text
|
| 12 |
+
|
| 13 |
+
### 💻 Kod Datasets
|
| 14 |
+
- 6 programlama dili
|
| 15 |
+
- 2,000 kod örneği
|
| 16 |
+
- Syntax parsing
|
| 17 |
+
- Docstring extraction
|
| 18 |
+
|
| 19 |
+
### 💰 Finansal Veri
|
| 20 |
+
- Sentiment analysis
|
| 21 |
+
- Market data
|
| 22 |
+
- 2,000 kayıt
|
| 23 |
+
- Time series
|
| 24 |
+
|
| 25 |
+
### 🏥 Tıbbi Veri
|
| 26 |
+
- PHI anonymization
|
| 27 |
+
- HIPAA compliance
|
| 28 |
+
- 2,000 kayıt
|
| 29 |
+
- Clinical notes
|
| 30 |
+
|
| 31 |
+
## Cross-Domain Integration
|
| 32 |
+
|
| 33 |
+
### Problem: Schema Mismatch
|
| 34 |
+
```python
|
| 35 |
+
# ❌ Bu HATA verir
|
| 36 |
+
combined = concatenate_datasets([sci_ds, code_ds])
|
| 37 |
+
# ArrowTypeError: struct fields don't match
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
### Çözüm 1: Flatten Approach
|
| 41 |
+
```python
|
| 42 |
+
# ✅ Ortak schema
|
| 43 |
+
def normalize(ex, domain):
|
| 44 |
+
return {
|
| 45 |
+
'text': ex.get('text'),
|
| 46 |
+
'domain': domain,
|
| 47 |
+
'field1': ex.get('field1'),
|
| 48 |
+
'field2': ex.get('field2'),
|
| 49 |
+
# ... tüm field'lar
|
| 50 |
+
}
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
### Çözüm 2: JSON Metadata
|
| 54 |
+
```python
|
| 55 |
+
# ✅ Esnek yapı
|
| 56 |
+
def normalize(ex, domain):
|
| 57 |
+
return {
|
| 58 |
+
'text': ex.get('text'),
|
| 59 |
+
'domain': domain,
|
| 60 |
+
'metadata_json': json.dumps(ex.get('meta', {}))
|
| 61 |
+
}
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
### Çözüm 3: Separate Tables
|
| 65 |
+
```python
|
| 66 |
+
# ✅ Database-style
|
| 67 |
+
unified_table + metadata_tables
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## Best Practices
|
| 71 |
+
|
| 72 |
+
✅ Domain expertise kullan
|
| 73 |
+
✅ Specialized tokenization
|
| 74 |
+
✅ Quality filtering
|
| 75 |
+
✅ Ethical guidelines
|
| 76 |
+
✅ Schema normalization
|
datasets/large_scale_example/README.md
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Büyük Ölçekli Datasets Örnekleri
|
| 2 |
+
|
| 3 |
+
Bu klasör, büyük ölçekli dataset işleme teknikleri için örnek kodlar içerir.
|
| 4 |
+
|
| 5 |
+
## Teknikler
|
| 6 |
+
|
| 7 |
+
### 1. Streaming
|
| 8 |
+
- 750GB+ veri işleme
|
| 9 |
+
- RAM kullanımı minimal
|
| 10 |
+
- Generator pattern
|
| 11 |
+
|
| 12 |
+
### 2. Batch Processing
|
| 13 |
+
- 2.3x hızlandırma
|
| 14 |
+
- Vectorized operations
|
| 15 |
+
- Optimal batch size: 32-1000
|
| 16 |
+
|
| 17 |
+
### 3. Multi-Processing
|
| 18 |
+
- 64x hızlandırma
|
| 19 |
+
- CPU parallelization
|
| 20 |
+
- num_proc optimization
|
| 21 |
+
|
| 22 |
+
### 4. Cache Yönetimi
|
| 23 |
+
- 12.1x hızlandırma
|
| 24 |
+
- Disk caching
|
| 25 |
+
- Arrow format
|
| 26 |
+
|
| 27 |
+
## Kullanım
|
| 28 |
+
|
| 29 |
+
```python
|
| 30 |
+
# Streaming örneği
|
| 31 |
+
from datasets import load_dataset
|
| 32 |
+
|
| 33 |
+
dataset = load_dataset(
|
| 34 |
+
"c4",
|
| 35 |
+
"en",
|
| 36 |
+
split="train",
|
| 37 |
+
streaming=True
|
| 38 |
+
)
|
| 39 |
+
|
| 40 |
+
for example in dataset.take(1000):
|
| 41 |
+
process(example)
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
## Performans Metrikleri
|
| 45 |
+
|
| 46 |
+
- Batch processing: **2.3x** daha hızlı
|
| 47 |
+
- Cache: **12.1x** daha hızlı
|
| 48 |
+
- Multi-processing: **64x** daha hızlı
|
| 49 |
+
|
| 50 |
+
## Best Practices
|
| 51 |
+
|
| 52 |
+
✅ Her zaman `batched=True` kullan
|
| 53 |
+
✅ Optimal batch_size seç (32-1000)
|
| 54 |
+
✅ `num_proc` ile paralelize et
|
| 55 |
+
✅ Cache stratejisi belirle
|
| 56 |
+
✅ Streaming ile büyük veri işle
|
datasets/task_specific_example/README.md
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Özel Görevler İçin Datasets
|
| 2 |
+
|
| 3 |
+
Bu klasör, specific NLP task'leri için dataset örnekleri içerir.
|
| 4 |
+
|
| 5 |
+
## Task'ler
|
| 6 |
+
|
| 7 |
+
### ❓ Question Answering
|
| 8 |
+
|
| 9 |
+
#### Extractive QA (SQuAD-style)
|
| 10 |
+
```python
|
| 11 |
+
{
|
| 12 |
+
'context': 'Paris is the capital of France...',
|
| 13 |
+
'question': 'What is the capital of France?',
|
| 14 |
+
'answers': {
|
| 15 |
+
'text': ['Paris'],
|
| 16 |
+
'answer_start': [0]
|
| 17 |
+
}
|
| 18 |
+
}
|
| 19 |
+
```
|
| 20 |
+
|
| 21 |
+
#### Multiple Choice QA
|
| 22 |
+
```python
|
| 23 |
+
{
|
| 24 |
+
'question': 'What is 2+2?',
|
| 25 |
+
'choices': ['3', '4', '5', '6'],
|
| 26 |
+
'answer': 1 # Index of correct answer
|
| 27 |
+
}
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
**Best Practices:**
|
| 31 |
+
- Validate answer spans
|
| 32 |
+
- Handle impossible questions
|
| 33 |
+
- Question type classification
|
| 34 |
+
- Context length management
|
| 35 |
+
|
| 36 |
+
### 📝 Summarization
|
| 37 |
+
|
| 38 |
+
#### News Summarization
|
| 39 |
+
```python
|
| 40 |
+
{
|
| 41 |
+
'article': 'Long news article...',
|
| 42 |
+
'summary': 'Brief summary...',
|
| 43 |
+
'compression_ratio': 0.24
|
| 44 |
+
}
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
**Metrics:**
|
| 48 |
+
- ROUGE scores
|
| 49 |
+
- Compression ratio (20-30% optimal)
|
| 50 |
+
- Abstractive vs Extractive
|
| 51 |
+
|
| 52 |
+
**Best Practices:**
|
| 53 |
+
- Multiple reference summaries
|
| 54 |
+
- Length constraints
|
| 55 |
+
- Quality validation
|
| 56 |
+
|
| 57 |
+
### 🏷️ Named Entity Recognition
|
| 58 |
+
|
| 59 |
+
#### BIO Tagging
|
| 60 |
+
```python
|
| 61 |
+
{
|
| 62 |
+
'tokens': ['John', 'Smith', 'works', 'at', 'Google'],
|
| 63 |
+
'ner_tags': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG']
|
| 64 |
+
}
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
**Tag Schema:**
|
| 68 |
+
- B-PER, I-PER (Person)
|
| 69 |
+
- B-ORG, I-ORG (Organization)
|
| 70 |
+
- B-LOC, I-LOC (Location)
|
| 71 |
+
- O (Outside)
|
| 72 |
+
|
| 73 |
+
**Best Practices:**
|
| 74 |
+
- Consistent tagging scheme
|
| 75 |
+
- Entity type taxonomy
|
| 76 |
+
- Nested entities handling
|
| 77 |
+
- Entity linking (optional)
|
| 78 |
+
|
| 79 |
+
### 😊 Sentiment Analysis
|
| 80 |
+
|
| 81 |
+
#### Binary/Multi-class
|
| 82 |
+
```python
|
| 83 |
+
{
|
| 84 |
+
'text': 'This product is amazing!',
|
| 85 |
+
'label': 2, # 0: neg, 1: neutral, 2: pos
|
| 86 |
+
'confidence': 0.95
|
| 87 |
+
}
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
#### Aspect-Based
|
| 91 |
+
```python
|
| 92 |
+
{
|
| 93 |
+
'text': 'Great product but slow delivery',
|
| 94 |
+
'aspect_sentiments': {
|
| 95 |
+
'product': 'positive',
|
| 96 |
+
'delivery': 'negative'
|
| 97 |
+
}
|
| 98 |
+
}
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
**Best Practices:**
|
| 102 |
+
- Multi-level granularity
|
| 103 |
+
- Confidence scores
|
| 104 |
+
- Domain-specific lexicons
|
| 105 |
+
- Emotion detection
|
| 106 |
+
|
| 107 |
+
### 📊 Text Classification
|
| 108 |
+
|
| 109 |
+
#### Topic Classification
|
| 110 |
+
```python
|
| 111 |
+
{
|
| 112 |
+
'text': 'Article text...',
|
| 113 |
+
'label': 'technology',
|
| 114 |
+
'label_id': 0
|
| 115 |
+
}
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
**Best Practices:**
|
| 119 |
+
- Balanced classes
|
| 120 |
+
- Hierarchical categories
|
| 121 |
+
- Multi-label support
|
| 122 |
+
- Class imbalance handling
|
| 123 |
+
|
| 124 |
+
### 🎯 Multi-Task Learning
|
| 125 |
+
|
| 126 |
+
#### Unified Format
|
| 127 |
+
```python
|
| 128 |
+
{
|
| 129 |
+
'text': 'Sample text...',
|
| 130 |
+
'sentiment': 'positive',
|
| 131 |
+
'topic': 'technology',
|
| 132 |
+
'quality_score': 0.85
|
| 133 |
+
}
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
**Best Practices:**
|
| 137 |
+
- Consistent preprocessing
|
| 138 |
+
- Task-specific heads
|
| 139 |
+
- Shared representations
|
| 140 |
+
- Task weighting
|
| 141 |
+
|
| 142 |
+
## Dataset Statistics
|
| 143 |
+
|
| 144 |
+
| Task | Örnekler | Format |
|
| 145 |
+
|------|----------|--------|
|
| 146 |
+
| QA | 300 | Extractive + MC |
|
| 147 |
+
| Summarization | 100 | News articles |
|
| 148 |
+
| NER | 100 | BIO tagged |
|
| 149 |
+
| Sentiment | 350 | Multi-class + Aspect |
|
| 150 |
+
| Classification | 200 | Topic |
|
| 151 |
+
| Multi-Task | 100 | Unified |
|
| 152 |
+
|
| 153 |
+
## Quality Metrics
|
| 154 |
+
|
| 155 |
+
### QA
|
| 156 |
+
- Exact Match (EM)
|
| 157 |
+
- F1 Score
|
| 158 |
+
- Answer span accuracy
|
| 159 |
+
|
| 160 |
+
### Summarization
|
| 161 |
+
- ROUGE-1, ROUGE-2, ROUGE-L
|
| 162 |
+
- Compression ratio
|
| 163 |
+
- Factual consistency
|
| 164 |
+
|
| 165 |
+
### NER
|
| 166 |
+
- Precision, Recall, F1 per entity type
|
| 167 |
+
- Exact match
|
| 168 |
+
- Partial match
|
| 169 |
+
|
| 170 |
+
### Sentiment
|
| 171 |
+
- Accuracy
|
| 172 |
+
- Macro/Micro F1
|
| 173 |
+
- Confusion matrix
|
| 174 |
+
|
| 175 |
+
### Classification
|
| 176 |
+
- Accuracy
|
| 177 |
+
- Per-class F1
|
| 178 |
+
- Macro/Weighted F1
|
| 179 |
+
|
| 180 |
+
## Best Practices (Genel)
|
| 181 |
+
|
| 182 |
+
✅ Clear annotation guidelines
|
| 183 |
+
✅ Inter-annotator agreement
|
| 184 |
+
✅ Quality control checks
|
| 185 |
+
✅ Regular dataset updates
|
| 186 |
+
✅ Version control
|
| 187 |
+
✅ Documentation
|
| 188 |
+
✅ Ethical considerations
|
| 189 |
+
✅ Bias analysis
|
requirements.txt
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
datasets>=2.14.0
|
| 2 |
+
transformers>=4.30.0
|
| 3 |
+
gradio>=4.44.0
|
| 4 |
+
numpy>=1.24.0
|
| 5 |
+
pandas>=2.0.0
|
space/app.py
ADDED
|
@@ -0,0 +1,493 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Advanced Dataset Tutorial - Interactive Gradio Demo
|
| 3 |
+
===================================================
|
| 4 |
+
|
| 5 |
+
Hugging Face Datasets ile ileri seviye teknikler için interaktif demo
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import gradio as gr
|
| 9 |
+
import sys
|
| 10 |
+
import os
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
|
| 13 |
+
# Modülleri import edebilmek için path ekle
|
| 14 |
+
sys.path.append(str(Path(__file__).parent / "modules"))
|
| 15 |
+
|
| 16 |
+
# Demo için basit örnekler
|
| 17 |
+
DEMO_CODES = {
|
| 18 |
+
"Büyük Ölçekli - Streaming": """
|
| 19 |
+
from datasets import load_dataset
|
| 20 |
+
|
| 21 |
+
# Streaming mode - RAM'i patlatmadan büyük veri
|
| 22 |
+
dataset = load_dataset(
|
| 23 |
+
"c4",
|
| 24 |
+
"en",
|
| 25 |
+
split="train",
|
| 26 |
+
streaming=True # ✨ Anahtar parametre
|
| 27 |
+
)
|
| 28 |
+
|
| 29 |
+
# İlk 1000 örneği işle
|
| 30 |
+
for i, example in enumerate(dataset.take(1000)):
|
| 31 |
+
print(f"Example {i}: {example['text'][:100]}...")
|
| 32 |
+
""",
|
| 33 |
+
|
| 34 |
+
"Büyük Ölçekli - Batch Processing": """
|
| 35 |
+
from datasets import load_dataset
|
| 36 |
+
|
| 37 |
+
dataset = load_dataset("imdb", split="train")
|
| 38 |
+
|
| 39 |
+
# ❌ YAVAŞ: Tek tek işleme
|
| 40 |
+
def process_single(example):
|
| 41 |
+
return {'length': len(example['text'])}
|
| 42 |
+
|
| 43 |
+
slow = dataset.map(process_single)
|
| 44 |
+
|
| 45 |
+
# ✅ HIZLI: Batch processing
|
| 46 |
+
def process_batch(examples):
|
| 47 |
+
return {'length': [len(t) for t in examples['text']]}
|
| 48 |
+
|
| 49 |
+
fast = dataset.map(
|
| 50 |
+
process_batch,
|
| 51 |
+
batched=True, # 🚀 10x-100x daha hızlı!
|
| 52 |
+
batch_size=1000
|
| 53 |
+
)
|
| 54 |
+
""",
|
| 55 |
+
|
| 56 |
+
"Domain-Specific - Cross-Domain Fix": """
|
| 57 |
+
from datasets import Dataset, concatenate_datasets
|
| 58 |
+
import json
|
| 59 |
+
|
| 60 |
+
# ❌ PROBLEM: Farklı schema'lar
|
| 61 |
+
sci_data = Dataset.from_dict({
|
| 62 |
+
'text': ['Scientific paper...'],
|
| 63 |
+
'metadata': {'year': 2024, 'citations': 10}
|
| 64 |
+
})
|
| 65 |
+
|
| 66 |
+
code_data = Dataset.from_dict({
|
| 67 |
+
'code': ['def hello(): pass'],
|
| 68 |
+
'language': 'Python'
|
| 69 |
+
})
|
| 70 |
+
|
| 71 |
+
# Bu HATA verir! ArrowTypeError
|
| 72 |
+
# combined = concatenate_datasets([sci_data, code_data])
|
| 73 |
+
|
| 74 |
+
# ✅ ÇÖZÜM: JSON metadata approach
|
| 75 |
+
def normalize_to_json(example, domain):
|
| 76 |
+
return {
|
| 77 |
+
'text': example.get('text') or example.get('code'),
|
| 78 |
+
'domain': domain,
|
| 79 |
+
'metadata_json': json.dumps(example.get('metadata', {}))
|
| 80 |
+
}
|
| 81 |
+
|
| 82 |
+
sci_norm = sci_data.map(lambda x: normalize_to_json(x, 'scientific'))
|
| 83 |
+
code_norm = code_data.map(lambda x: normalize_to_json(x, 'code'))
|
| 84 |
+
|
| 85 |
+
# Şimdi ÇALIŞIR! ✅
|
| 86 |
+
combined = concatenate_datasets([sci_norm, code_norm])
|
| 87 |
+
""",
|
| 88 |
+
|
| 89 |
+
"İleri Teknikler - Custom Collator": """
|
| 90 |
+
from datasets import Dataset
|
| 91 |
+
|
| 92 |
+
class AdvancedCollator:
|
| 93 |
+
def __init__(self, max_length=128, pad_token='[PAD]'):
|
| 94 |
+
self.max_length = max_length
|
| 95 |
+
self.pad_token = pad_token
|
| 96 |
+
|
| 97 |
+
def __call__(self, batch):
|
| 98 |
+
# Tokenize (basit örnek)
|
| 99 |
+
tokenized = [ex['text'].split()[:self.max_length]
|
| 100 |
+
for ex in batch]
|
| 101 |
+
|
| 102 |
+
# Dynamic padding - batch içindeki max length'e göre
|
| 103 |
+
max_len = max(len(tokens) for tokens in tokenized)
|
| 104 |
+
|
| 105 |
+
padded = []
|
| 106 |
+
masks = []
|
| 107 |
+
for tokens in tokenized:
|
| 108 |
+
pad_len = max_len - len(tokens)
|
| 109 |
+
padded.append(tokens + [self.pad_token] * pad_len)
|
| 110 |
+
masks.append([1] * len(tokens) + [0] * pad_len)
|
| 111 |
+
|
| 112 |
+
return {
|
| 113 |
+
'input_tokens': padded,
|
| 114 |
+
'attention_mask': masks,
|
| 115 |
+
'labels': [ex['label'] for ex in batch]
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
# Kullanım
|
| 119 |
+
collator = AdvancedCollator()
|
| 120 |
+
batch = [
|
| 121 |
+
{'text': 'Short text', 'label': 0},
|
| 122 |
+
{'text': 'Much longer text here', 'label': 1}
|
| 123 |
+
]
|
| 124 |
+
collated = collator(batch)
|
| 125 |
+
""",
|
| 126 |
+
|
| 127 |
+
"İleri Teknikler - Data Augmentation": """
|
| 128 |
+
from datasets import Dataset
|
| 129 |
+
import random
|
| 130 |
+
|
| 131 |
+
class DataAugmenter:
|
| 132 |
+
def augment(self, text):
|
| 133 |
+
words = text.split()
|
| 134 |
+
|
| 135 |
+
# Random word deletion
|
| 136 |
+
if random.random() < 0.3:
|
| 137 |
+
words = [w for w in words if random.random() > 0.1]
|
| 138 |
+
|
| 139 |
+
# Random word swap
|
| 140 |
+
if len(words) > 1 and random.random() < 0.3:
|
| 141 |
+
i, j = random.sample(range(len(words)), 2)
|
| 142 |
+
words[i], words[j] = words[j], words[i]
|
| 143 |
+
|
| 144 |
+
return ' '.join(words) if words else text
|
| 145 |
+
|
| 146 |
+
def augment_dataset(self, dataset, n_augmentations=2):
|
| 147 |
+
augmented = []
|
| 148 |
+
|
| 149 |
+
for example in dataset:
|
| 150 |
+
# Original
|
| 151 |
+
augmented.append({
|
| 152 |
+
**example,
|
| 153 |
+
'is_augmented': False
|
| 154 |
+
})
|
| 155 |
+
|
| 156 |
+
# Augmented versions
|
| 157 |
+
for _ in range(n_augmentations):
|
| 158 |
+
augmented.append({
|
| 159 |
+
**example,
|
| 160 |
+
'text': self.augment(example['text']),
|
| 161 |
+
'is_augmented': True
|
| 162 |
+
})
|
| 163 |
+
|
| 164 |
+
return Dataset.from_list(augmented)
|
| 165 |
+
|
| 166 |
+
# Kullanım: 1 örnek → 3 örnek (1 original + 2 augmented)
|
| 167 |
+
augmenter = DataAugmenter()
|
| 168 |
+
original = Dataset.from_dict({'text': ['Hello world'], 'label': [0]})
|
| 169 |
+
augmented = augmenter.augment_dataset(original, n_augmentations=2)
|
| 170 |
+
print(f"Dataset boyutu: {len(original)} → {len(augmented)}")
|
| 171 |
+
""",
|
| 172 |
+
|
| 173 |
+
"Özel Görevler - Question Answering": """
|
| 174 |
+
from datasets import Dataset
|
| 175 |
+
|
| 176 |
+
# SQuAD-style QA dataset
|
| 177 |
+
qa_dataset = Dataset.from_dict({
|
| 178 |
+
'context': [
|
| 179 |
+
'The Eiffel Tower is in Paris. It was built in 1889.'
|
| 180 |
+
],
|
| 181 |
+
'question': [
|
| 182 |
+
'Where is the Eiffel Tower?'
|
| 183 |
+
],
|
| 184 |
+
'answers': [{
|
| 185 |
+
'text': ['Paris'],
|
| 186 |
+
'answer_start': [23] # Character position
|
| 187 |
+
}]
|
| 188 |
+
})
|
| 189 |
+
|
| 190 |
+
# Preprocessing
|
| 191 |
+
def preprocess_qa(example):
|
| 192 |
+
# Answer'ı validate et
|
| 193 |
+
context = example['context']
|
| 194 |
+
answer = example['answers']['text'][0]
|
| 195 |
+
start = example['answers']['answer_start'][0]
|
| 196 |
+
|
| 197 |
+
# Extract ve kontrol et
|
| 198 |
+
extracted = context[start:start + len(answer)]
|
| 199 |
+
is_valid = extracted == answer
|
| 200 |
+
|
| 201 |
+
return {
|
| 202 |
+
**example,
|
| 203 |
+
'is_valid': is_valid,
|
| 204 |
+
'question_type': example['question'].split()[0].lower()
|
| 205 |
+
}
|
| 206 |
+
|
| 207 |
+
qa_processed = qa_dataset.map(preprocess_qa)
|
| 208 |
+
""",
|
| 209 |
+
|
| 210 |
+
"Özel Görevler - NER": """
|
| 211 |
+
from datasets import Dataset
|
| 212 |
+
|
| 213 |
+
# Named Entity Recognition (BIO tagging)
|
| 214 |
+
ner_dataset = Dataset.from_dict({
|
| 215 |
+
'tokens': [
|
| 216 |
+
['John', 'Smith', 'works', 'at', 'Google']
|
| 217 |
+
],
|
| 218 |
+
'ner_tags': [
|
| 219 |
+
['B-PER', 'I-PER', 'O', 'O', 'B-ORG']
|
| 220 |
+
]
|
| 221 |
+
})
|
| 222 |
+
|
| 223 |
+
# Tag to ID mapping
|
| 224 |
+
tag2id = {
|
| 225 |
+
'O': 0,
|
| 226 |
+
'B-PER': 1, 'I-PER': 2,
|
| 227 |
+
'B-ORG': 3, 'I-ORG': 4,
|
| 228 |
+
'B-LOC': 5, 'I-LOC': 6
|
| 229 |
+
}
|
| 230 |
+
|
| 231 |
+
# Convert tags to IDs
|
| 232 |
+
def convert_tags(example):
|
| 233 |
+
return {
|
| 234 |
+
**example,
|
| 235 |
+
'ner_tag_ids': [tag2id[tag] for tag in example['ner_tags']],
|
| 236 |
+
'sentence': ' '.join(example['tokens'])
|
| 237 |
+
}
|
| 238 |
+
|
| 239 |
+
ner_processed = ner_dataset.map(convert_tags)
|
| 240 |
+
|
| 241 |
+
# Entity statistics
|
| 242 |
+
def count_entities(dataset):
|
| 243 |
+
entity_types = {}
|
| 244 |
+
for ex in dataset:
|
| 245 |
+
for tag in ex['ner_tags']:
|
| 246 |
+
if tag.startswith('B-'):
|
| 247 |
+
entity_type = tag.split('-')[1]
|
| 248 |
+
entity_types[entity_type] = entity_types.get(entity_type, 0) + 1
|
| 249 |
+
return entity_types
|
| 250 |
+
|
| 251 |
+
print(count_entities(ner_processed))
|
| 252 |
+
""",
|
| 253 |
+
|
| 254 |
+
"Özel Görevler - Sentiment Analysis": """
|
| 255 |
+
from datasets import Dataset
|
| 256 |
+
|
| 257 |
+
# Sentiment classification dataset
|
| 258 |
+
sentiment_dataset = Dataset.from_dict({
|
| 259 |
+
'text': [
|
| 260 |
+
'This product is amazing!',
|
| 261 |
+
'Terrible, waste of money.',
|
| 262 |
+
'It\\'s okay, nothing special.'
|
| 263 |
+
],
|
| 264 |
+
'label': [2, 0, 1], # 0: negative, 1: neutral, 2: positive
|
| 265 |
+
'label_text': ['positive', 'negative', 'neutral']
|
| 266 |
+
})
|
| 267 |
+
|
| 268 |
+
# Feature extraction
|
| 269 |
+
def extract_sentiment_features(example):
|
| 270 |
+
text = example['text'].lower()
|
| 271 |
+
|
| 272 |
+
positive_words = ['amazing', 'great', 'excellent', 'love']
|
| 273 |
+
negative_words = ['terrible', 'waste', 'bad', 'poor']
|
| 274 |
+
|
| 275 |
+
pos_count = sum(1 for word in positive_words if word in text)
|
| 276 |
+
neg_count = sum(1 for word in negative_words if word in text)
|
| 277 |
+
|
| 278 |
+
return {
|
| 279 |
+
**example,
|
| 280 |
+
'positive_words': pos_count,
|
| 281 |
+
'negative_words': neg_count,
|
| 282 |
+
'sentiment_score': pos_count - neg_count,
|
| 283 |
+
'has_exclamation': '!' in example['text']
|
| 284 |
+
}
|
| 285 |
+
|
| 286 |
+
sentiment_featured = sentiment_dataset.map(extract_sentiment_features)
|
| 287 |
+
|
| 288 |
+
# Class balancing with augmentation
|
| 289 |
+
def balance_classes(dataset, target_per_class=100):
|
| 290 |
+
from collections import defaultdict
|
| 291 |
+
|
| 292 |
+
# Group by label
|
| 293 |
+
by_label = defaultdict(list)
|
| 294 |
+
for ex in dataset:
|
| 295 |
+
by_label[ex['label']].append(ex)
|
| 296 |
+
|
| 297 |
+
# Augment minority classes
|
| 298 |
+
balanced = []
|
| 299 |
+
for label, examples in by_label.items():
|
| 300 |
+
balanced.extend(examples)
|
| 301 |
+
|
| 302 |
+
# Add augmented copies if needed
|
| 303 |
+
while len([e for e in balanced if e['label'] == label]) < target_per_class:
|
| 304 |
+
# Simple augmentation: copy with modified text
|
| 305 |
+
ex = examples[len(balanced) % len(examples)]
|
| 306 |
+
balanced.append({
|
| 307 |
+
**ex,
|
| 308 |
+
'is_augmented': True
|
| 309 |
+
})
|
| 310 |
+
|
| 311 |
+
return Dataset.from_list(balanced)
|
| 312 |
+
"""
|
| 313 |
+
}
|
| 314 |
+
|
| 315 |
+
BEST_PRACTICES = """
|
| 316 |
+
# 🎯 Best Practices Özeti
|
| 317 |
+
|
| 318 |
+
## Memory Efficiency
|
| 319 |
+
```python
|
| 320 |
+
# ✅ DOĞRU: Streaming
|
| 321 |
+
dataset = load_dataset("huge_data", streaming=True)
|
| 322 |
+
|
| 323 |
+
# ❌ YANLIŞ: Tüm veriyi RAM'e yükleme
|
| 324 |
+
dataset = load_dataset("huge_data") # 100GB RAM!
|
| 325 |
+
```
|
| 326 |
+
|
| 327 |
+
## Batch Processing
|
| 328 |
+
```python
|
| 329 |
+
# ✅ DOĞRU: Batched=True
|
| 330 |
+
dataset.map(fn, batched=True, batch_size=1000)
|
| 331 |
+
|
| 332 |
+
# ❌ YANLIŞ: Tek tek
|
| 333 |
+
dataset.map(fn) # 10x-100x yavaş!
|
| 334 |
+
```
|
| 335 |
+
|
| 336 |
+
## Cross-Domain
|
| 337 |
+
```python
|
| 338 |
+
# ✅ DOĞRU: Normalize et
|
| 339 |
+
def normalize(ex, domain):
|
| 340 |
+
return {'text': ex.get('text'), 'domain': domain}
|
| 341 |
+
|
| 342 |
+
# ❌ YANLIŞ: Direkt birleştir
|
| 343 |
+
concatenate_datasets([ds1, ds2]) # Error!
|
| 344 |
+
```
|
| 345 |
+
|
| 346 |
+
## Performans
|
| 347 |
+
- **Streaming**: RAM tasarrufu
|
| 348 |
+
- **Batched**: 10x-100x hız
|
| 349 |
+
- **num_proc**: CPU parallelization
|
| 350 |
+
- **Cache**: Tekrar kullanım
|
| 351 |
+
"""
|
| 352 |
+
|
| 353 |
+
def show_code(module_name):
|
| 354 |
+
"""Seçilen modül için kod göster"""
|
| 355 |
+
return DEMO_CODES.get(module_name, "Kod örneği yükleniyor...")
|
| 356 |
+
|
| 357 |
+
def show_best_practices():
|
| 358 |
+
"""Best practices göster"""
|
| 359 |
+
return BEST_PRACTICES
|
| 360 |
+
|
| 361 |
+
# Gradio Interface
|
| 362 |
+
with gr.Blocks(title="Advanced Dataset Tutorial", theme=gr.themes.Soft()) as demo:
|
| 363 |
+
gr.Markdown("""
|
| 364 |
+
# 📚 Advanced Dataset Tutorial
|
| 365 |
+
## Hugging Face Datasets - İleri Seviye Türkçe Eğitim
|
| 366 |
+
|
| 367 |
+
Bu interaktif demo, 4 modül ve 20+ teknik içeren kapsamlı dataset eğitiminin özetini sunar.
|
| 368 |
+
""")
|
| 369 |
+
|
| 370 |
+
with gr.Tabs():
|
| 371 |
+
with gr.Tab("🚀 Kod Örnekleri"):
|
| 372 |
+
gr.Markdown("### Her modülden pratik kod örnekleri")
|
| 373 |
+
|
| 374 |
+
module_dropdown = gr.Dropdown(
|
| 375 |
+
choices=list(DEMO_CODES.keys()),
|
| 376 |
+
label="Modül Seçin",
|
| 377 |
+
value=list(DEMO_CODES.keys())[0]
|
| 378 |
+
)
|
| 379 |
+
|
| 380 |
+
code_output = gr.Code(
|
| 381 |
+
label="Kod Örneği",
|
| 382 |
+
language="python",
|
| 383 |
+
value=DEMO_CODES[list(DEMO_CODES.keys())[0]]
|
| 384 |
+
)
|
| 385 |
+
|
| 386 |
+
module_dropdown.change(
|
| 387 |
+
fn=show_code,
|
| 388 |
+
inputs=[module_dropdown],
|
| 389 |
+
outputs=[code_output]
|
| 390 |
+
)
|
| 391 |
+
|
| 392 |
+
with gr.Tab("📖 Modüller"):
|
| 393 |
+
gr.Markdown("""
|
| 394 |
+
## 4 Ana Modül
|
| 395 |
+
|
| 396 |
+
### 1️⃣ Büyük Ölçekli Datasets
|
| 397 |
+
- ⚡ Streaming (750GB+ data)
|
| 398 |
+
- 💾 Batch processing (2.3x hızlı)
|
| 399 |
+
- 🚀 Multi-processing (64x hızlı)
|
| 400 |
+
- 📦 Cache (12.1x hızlı)
|
| 401 |
+
|
| 402 |
+
### 2️⃣ Domain-Specific Datasets
|
| 403 |
+
- 🔬 Bilimsel makaleler (2,000 örnek)
|
| 404 |
+
- 💻 Kod datasets (6 dil, 2,000 örnek)
|
| 405 |
+
- 💰 Finansal veri (2,000 kayıt)
|
| 406 |
+
- 🏥 Tıbbi veri (PHI anonymization)
|
| 407 |
+
|
| 408 |
+
### 3️⃣ İleri Teknikler
|
| 409 |
+
- 📦 Custom Collators (3 tip)
|
| 410 |
+
- 🔧 Feature Engineering (10+ feature)
|
| 411 |
+
- 🎲 Data Augmentation (3x veri)
|
| 412 |
+
- 📊 Advanced Sampling (diversity, stratified)
|
| 413 |
+
|
| 414 |
+
### 4️⃣ Özel Görevler
|
| 415 |
+
- ❓ Question Answering (SQuAD)
|
| 416 |
+
- 📝 Summarization (ROUGE)
|
| 417 |
+
- 🏷️ NER (BIO tagging)
|
| 418 |
+
- 😊 Sentiment Analysis
|
| 419 |
+
- 📊 Multi-Task Learning
|
| 420 |
+
""")
|
| 421 |
+
|
| 422 |
+
with gr.Tab("🎯 Best Practices"):
|
| 423 |
+
gr.Code(
|
| 424 |
+
value=BEST_PRACTICES,
|
| 425 |
+
label="Best Practices",
|
| 426 |
+
language="python"
|
| 427 |
+
)
|
| 428 |
+
|
| 429 |
+
with gr.Tab("📊 Performans"):
|
| 430 |
+
gr.Markdown("""
|
| 431 |
+
## Performans Metrikleri
|
| 432 |
+
|
| 433 |
+
| Teknik | Artış | Kullanım |
|
| 434 |
+
|--------|-------|----------|
|
| 435 |
+
| **Batch Processing** | 2.3x | Tüm preprocessing |
|
| 436 |
+
| **Cache** | 12.1x | Tekrar işlemler |
|
| 437 |
+
| **Multi-Processing** | 64x | CPU tasks |
|
| 438 |
+
| **Dynamic Batching** | 40% | Padding azalması |
|
| 439 |
+
| **Data Augmentation** | 3x | Veri artışı |
|
| 440 |
+
|
| 441 |
+
## İstatistikler
|
| 442 |
+
|
| 443 |
+
- 📝 **5,000+** kod satırı
|
| 444 |
+
- 🔢 **20,000+** örnek dataset
|
| 445 |
+
- 🛠️ **50+** teknik
|
| 446 |
+
- ✅ **100+** best practice
|
| 447 |
+
|
| 448 |
+
## Kazanımlar
|
| 449 |
+
|
| 450 |
+
✅ Büyük ölçekli veri işleme
|
| 451 |
+
✅ Domain-specific preprocessing
|
| 452 |
+
✅ Production-ready pipelines
|
| 453 |
+
✅ Task-specific optimization
|
| 454 |
+
✅ Multi-task learning
|
| 455 |
+
""")
|
| 456 |
+
|
| 457 |
+
with gr.Tab("ℹ️ Hakkında"):
|
| 458 |
+
gr.Markdown("""
|
| 459 |
+
## Proje Bilgileri
|
| 460 |
+
|
| 461 |
+
**Amaç:** Hugging Face Datasets kütüphanesini profesyonel düzeyde kullanmak isteyenler için kapsamlı Türkçe kaynak
|
| 462 |
+
|
| 463 |
+
**İçerik:**
|
| 464 |
+
- 4 ana modül
|
| 465 |
+
- 20+ pratik örnek
|
| 466 |
+
- 50+ teknik
|
| 467 |
+
- 100+ best practice
|
| 468 |
+
|
| 469 |
+
**Hedef Kitle:**
|
| 470 |
+
- NLP mühendisleri
|
| 471 |
+
- ML researchers
|
| 472 |
+
- Data scientists
|
| 473 |
+
- AI developers
|
| 474 |
+
|
| 475 |
+
**Lisans:** MIT
|
| 476 |
+
|
| 477 |
+
**Kaynaklar:**
|
| 478 |
+
- [Hugging Face Datasets Docs](https://huggingface.co/docs/datasets)
|
| 479 |
+
- [GitHub Repository](https://github.com/yourusername/advanced-dataset-tutorial)
|
| 480 |
+
- [Hugging Face Hub](https://huggingface.co/datasets)
|
| 481 |
+
|
| 482 |
+
---
|
| 483 |
+
|
| 484 |
+
⭐ **Beğendiyseniz yıldız vermeyi unutmayın!**
|
| 485 |
+
""")
|
| 486 |
+
|
| 487 |
+
gr.Markdown("""
|
| 488 |
+
---
|
| 489 |
+
💡 **Not:** Bu demo, tam eğitim materyalinin özeti içindir. Detaylı örnekler ve açıklamalar için modül scriptlerine bakın.
|
| 490 |
+
""")
|
| 491 |
+
|
| 492 |
+
if __name__ == "__main__":
|
| 493 |
+
demo.launch()
|
space/modules/01_buyuk_olcekli_datasets_complete.py
ADDED
|
@@ -0,0 +1,617 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
BÜYÜK ÖLÇEKLİ DATASETS - İLERİ SEVİYE HUGGING FACE
|
| 3 |
+
=================================================
|
| 4 |
+
Network bağımsız versiyon - Sentetik ve lokal örneklerle
|
| 5 |
+
|
| 6 |
+
Bu modülde öğrenecekleriniz:
|
| 7 |
+
1. Streaming simülasyonu ve büyük veri prensipleri
|
| 8 |
+
2. Dataset sharding ve chunk'lama
|
| 9 |
+
3. Memory-efficient preprocessing
|
| 10 |
+
4. Batch processing optimizasyonu
|
| 11 |
+
5. Cache yönetimi
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from datasets import Dataset, DatasetDict, IterableDataset
|
| 15 |
+
from datasets import concatenate_datasets
|
| 16 |
+
import numpy as np
|
| 17 |
+
from typing import Iterator, Dict, List
|
| 18 |
+
import time
|
| 19 |
+
from functools import partial
|
| 20 |
+
import sys
|
| 21 |
+
|
| 22 |
+
print("="*60)
|
| 23 |
+
print("1. STREAMING DATASET SİMÜLASYONU")
|
| 24 |
+
print("="*60)
|
| 25 |
+
|
| 26 |
+
# Büyük bir dataset simülasyonu
|
| 27 |
+
def generate_large_dataset(num_samples=100000):
|
| 28 |
+
"""
|
| 29 |
+
Büyük dataset simülasyonu - Generator pattern kullanarak
|
| 30 |
+
Bu, gerçek streaming dataset'lerin çalışma prensibidir
|
| 31 |
+
"""
|
| 32 |
+
def gen():
|
| 33 |
+
for i in range(num_samples):
|
| 34 |
+
yield {
|
| 35 |
+
"id": i,
|
| 36 |
+
"text": f"Bu {i}. örnek metindir. " * np.random.randint(10, 100),
|
| 37 |
+
"label": np.random.randint(0, 5),
|
| 38 |
+
"metadata": {
|
| 39 |
+
"source": f"source_{i % 10}",
|
| 40 |
+
"timestamp": i * 1000
|
| 41 |
+
}
|
| 42 |
+
}
|
| 43 |
+
|
| 44 |
+
return gen
|
| 45 |
+
|
| 46 |
+
# Iterable Dataset oluştur (streaming gibi çalışır)
|
| 47 |
+
print("\n📚 Büyük Dataset (100K örnekli) - Generator Pattern")
|
| 48 |
+
print("Normal yükleme = Tüm veri RAM'de")
|
| 49 |
+
print("Streaming/Generator = Sadece işlenen kısım RAM'de\n")
|
| 50 |
+
|
| 51 |
+
streaming_dataset = Dataset.from_generator(
|
| 52 |
+
generate_large_dataset(10000),
|
| 53 |
+
cache_dir=None
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
print(f"Dataset boyutu: {len(streaming_dataset)} örnek")
|
| 57 |
+
print(f"Bellek kullanımı: Minimal (generator pattern)")
|
| 58 |
+
|
| 59 |
+
# İlk 3 örneği al
|
| 60 |
+
print("\nİlk 3 örnek:")
|
| 61 |
+
for i in range(3):
|
| 62 |
+
example = streaming_dataset[i]
|
| 63 |
+
print(f"\n{i+1}. Örnek:")
|
| 64 |
+
print(f" ID: {example['id']}")
|
| 65 |
+
print(f" Text uzunluğu: {len(example['text'])} karakter")
|
| 66 |
+
print(f" Label: {example['label']}")
|
| 67 |
+
print(f" İlk 80 karakter: {example['text'][:80]}...")
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
print("\n" + "="*60)
|
| 71 |
+
print("2. DATASET SHARDING VE PARALEL İŞLEME")
|
| 72 |
+
print("="*60)
|
| 73 |
+
|
| 74 |
+
print("\n🔀 Dataset Sharding - Distributed Training için")
|
| 75 |
+
|
| 76 |
+
# Dataset'i parçalara böl
|
| 77 |
+
num_shards = 4
|
| 78 |
+
dataset_size = len(streaming_dataset)
|
| 79 |
+
shard_size = dataset_size // num_shards
|
| 80 |
+
|
| 81 |
+
print(f"\nToplam dataset: {dataset_size} örnek")
|
| 82 |
+
print(f"Shard sayısı: {num_shards}")
|
| 83 |
+
print(f"Her shard: ~{shard_size} örnek")
|
| 84 |
+
|
| 85 |
+
for shard_id in range(num_shards):
|
| 86 |
+
start_idx = shard_id * shard_size
|
| 87 |
+
end_idx = start_idx + shard_size if shard_id < num_shards - 1 else dataset_size
|
| 88 |
+
|
| 89 |
+
shard = streaming_dataset.select(range(start_idx, end_idx))
|
| 90 |
+
|
| 91 |
+
print(f"\n Shard {shard_id}:")
|
| 92 |
+
print(f" - İndeksler: {start_idx} - {end_idx}")
|
| 93 |
+
print(f" - Boyut: {len(shard)} örnek")
|
| 94 |
+
print(f" - İlk örnek ID: {shard[0]['id']}")
|
| 95 |
+
print(f" - Use case: GPU {shard_id} için")
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
print("\n" + "="*60)
|
| 99 |
+
print("3. BATCH PROCESSING - VERİMLİ PREPROCESSING")
|
| 100 |
+
print("="*60)
|
| 101 |
+
|
| 102 |
+
print("\n⚡ Batch vs Single Processing Karşılaştırması")
|
| 103 |
+
|
| 104 |
+
# Test dataset'i
|
| 105 |
+
test_dataset = streaming_dataset.select(range(1000))
|
| 106 |
+
|
| 107 |
+
# YÖNTEM 1: Tek tek işleme (YAVAŞ)
|
| 108 |
+
print("\n1️⃣ TEK TEK İŞLEME:")
|
| 109 |
+
start = time.time()
|
| 110 |
+
|
| 111 |
+
def process_single(example):
|
| 112 |
+
"""Her örneği tek tek işle"""
|
| 113 |
+
example['text_length'] = len(example['text'])
|
| 114 |
+
example['word_count'] = len(example['text'].split())
|
| 115 |
+
example['label_squared'] = example['label'] ** 2
|
| 116 |
+
return example
|
| 117 |
+
|
| 118 |
+
processed_single = test_dataset.map(
|
| 119 |
+
process_single,
|
| 120 |
+
desc="Single processing"
|
| 121 |
+
)
|
| 122 |
+
time_single = time.time() - start
|
| 123 |
+
print(f" Süre: {time_single:.3f}s")
|
| 124 |
+
print(f" İşlem hızı: {len(test_dataset)/time_single:.0f} örnek/saniye")
|
| 125 |
+
|
| 126 |
+
# YÖNTEM 2: Batch processing (HIZLI)
|
| 127 |
+
print("\n2️⃣ BATCH İŞLEME:")
|
| 128 |
+
start = time.time()
|
| 129 |
+
|
| 130 |
+
def process_batch(examples):
|
| 131 |
+
"""Batch'i bir arada işle - VECTORIZED!"""
|
| 132 |
+
examples['text_length'] = [len(text) for text in examples['text']]
|
| 133 |
+
examples['word_count'] = [len(text.split()) for text in examples['text']]
|
| 134 |
+
examples['label_squared'] = [label ** 2 for label in examples['label']]
|
| 135 |
+
return examples
|
| 136 |
+
|
| 137 |
+
processed_batch = test_dataset.map(
|
| 138 |
+
process_batch,
|
| 139 |
+
batched=True,
|
| 140 |
+
batch_size=100, # 100 örneği birlikte işle
|
| 141 |
+
desc="Batch processing"
|
| 142 |
+
)
|
| 143 |
+
time_batch = time.time() - start
|
| 144 |
+
print(f" Süre: {time_batch:.3f}s")
|
| 145 |
+
print(f" İşlem hızı: {len(test_dataset)/time_batch:.0f} örnek/saniye")
|
| 146 |
+
print(f"\n ⚡ HIZ ARTIŞI: {time_single/time_batch:.1f}x DAHA HIZLI!")
|
| 147 |
+
|
| 148 |
+
# Sonuçları kontrol et
|
| 149 |
+
print("\n✅ Sonuç kontrolü:")
|
| 150 |
+
print(f" İlk örnek - text_length: {processed_batch[0]['text_length']}")
|
| 151 |
+
print(f" İlk örnek - word_count: {processed_batch[0]['word_count']}")
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
print("\n" + "="*60)
|
| 155 |
+
print("4. MEMORY-EFFICIENT FILTERING")
|
| 156 |
+
print("="*60)
|
| 157 |
+
|
| 158 |
+
print("\n🔍 Büyük Dataset'te Filtreleme")
|
| 159 |
+
|
| 160 |
+
# Farklı filtre stratejileri
|
| 161 |
+
print("\n📊 Orijinal dataset:")
|
| 162 |
+
print(f" Toplam: {len(streaming_dataset)} örnek")
|
| 163 |
+
|
| 164 |
+
# Filtre 1: Kısa metinleri çıkar
|
| 165 |
+
filtered_1 = streaming_dataset.filter(
|
| 166 |
+
lambda x: len(x['text']) > 500,
|
| 167 |
+
desc="Filtering short texts"
|
| 168 |
+
)
|
| 169 |
+
print(f"\n1️⃣ Uzun metinler (>500 char): {len(filtered_1)} örnek")
|
| 170 |
+
|
| 171 |
+
# Filtre 2: Belirli label'ları al
|
| 172 |
+
filtered_2 = streaming_dataset.filter(
|
| 173 |
+
lambda x: x['label'] in [0, 1],
|
| 174 |
+
desc="Filtering by label"
|
| 175 |
+
)
|
| 176 |
+
print(f"2️⃣ Label 0 veya 1: {len(filtered_2)} örnek")
|
| 177 |
+
|
| 178 |
+
# Filtre 3: Kompleks filtre - BATCH ile daha hızlı
|
| 179 |
+
def complex_filter(examples):
|
| 180 |
+
"""
|
| 181 |
+
Batch filtering - çok daha hızlı!
|
| 182 |
+
"""
|
| 183 |
+
return [
|
| 184 |
+
len(text) > 300 and len(text) < 1000 and label < 3
|
| 185 |
+
for text, label in zip(examples['text'], examples['label'])
|
| 186 |
+
]
|
| 187 |
+
|
| 188 |
+
start = time.time()
|
| 189 |
+
filtered_3 = streaming_dataset.filter(
|
| 190 |
+
complex_filter,
|
| 191 |
+
batched=True,
|
| 192 |
+
batch_size=1000,
|
| 193 |
+
desc="Complex batch filtering"
|
| 194 |
+
)
|
| 195 |
+
filter_time = time.time() - start
|
| 196 |
+
print(f"3️⃣ Kompleks filtre (300-1000 char, label<3): {len(filtered_3)} örnek")
|
| 197 |
+
print(f" Filtreleme süresi: {filter_time:.3f}s")
|
| 198 |
+
|
| 199 |
+
|
| 200 |
+
print("\n" + "="*60)
|
| 201 |
+
print("5. CHUNK-BASED PROCESSING")
|
| 202 |
+
print("="*60)
|
| 203 |
+
|
| 204 |
+
print("\n📦 Chunk Tabanlı İşleme - Çok Büyük Datasets için")
|
| 205 |
+
|
| 206 |
+
def process_in_chunks(dataset, chunk_size=2000, num_chunks=5):
|
| 207 |
+
"""
|
| 208 |
+
Dataset'i chunk'lar halinde işle
|
| 209 |
+
Her chunk işlendikten sonra sonuçlar kaydedilir, bellek temizlenir
|
| 210 |
+
"""
|
| 211 |
+
chunk_results = []
|
| 212 |
+
total_size = len(dataset)
|
| 213 |
+
|
| 214 |
+
print(f"\nToplam: {total_size} örnek")
|
| 215 |
+
print(f"Chunk boyutu: {chunk_size}")
|
| 216 |
+
print(f"İşlenecek chunk: {num_chunks}")
|
| 217 |
+
|
| 218 |
+
for chunk_id in range(num_chunks):
|
| 219 |
+
start_idx = chunk_id * chunk_size
|
| 220 |
+
end_idx = min(start_idx + chunk_size, total_size)
|
| 221 |
+
|
| 222 |
+
if start_idx >= total_size:
|
| 223 |
+
break
|
| 224 |
+
|
| 225 |
+
print(f"\n Chunk {chunk_id + 1}/{num_chunks} işleniyor...")
|
| 226 |
+
|
| 227 |
+
# Bir chunk al
|
| 228 |
+
chunk = dataset.select(range(start_idx, end_idx))
|
| 229 |
+
|
| 230 |
+
# İstatistikleri hesapla
|
| 231 |
+
lengths = [len(ex['text']) for ex in chunk]
|
| 232 |
+
labels = [ex['label'] for ex in chunk]
|
| 233 |
+
|
| 234 |
+
chunk_results.append({
|
| 235 |
+
'chunk_id': chunk_id,
|
| 236 |
+
'size': len(chunk),
|
| 237 |
+
'avg_length': np.mean(lengths),
|
| 238 |
+
'max_length': np.max(lengths),
|
| 239 |
+
'min_length': np.min(lengths),
|
| 240 |
+
'label_dist': {i: labels.count(i) for i in range(5)}
|
| 241 |
+
})
|
| 242 |
+
|
| 243 |
+
# Chunk işlendi, bellek temizleniyor
|
| 244 |
+
del chunk
|
| 245 |
+
|
| 246 |
+
return chunk_results
|
| 247 |
+
|
| 248 |
+
# Dataset'i chunk'lar halinde işle
|
| 249 |
+
results = process_in_chunks(streaming_dataset, chunk_size=2000, num_chunks=5)
|
| 250 |
+
|
| 251 |
+
print("\n📊 Chunk İstatistikleri:")
|
| 252 |
+
for result in results:
|
| 253 |
+
print(f"\n Chunk {result['chunk_id']}:")
|
| 254 |
+
print(f" Boyut: {result['size']:,} örnek")
|
| 255 |
+
print(f" Ortalama uzunluk: {result['avg_length']:.0f} karakter")
|
| 256 |
+
print(f" Min/Max: {result['min_length']:.0f} / {result['max_length']:.0f}")
|
| 257 |
+
print(f" Label dağılımı: {result['label_dist']}")
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
print("\n" + "="*60)
|
| 261 |
+
print("6. DATASET BİRLEŞTİRME VE KARMAŞIK İŞLEMLER")
|
| 262 |
+
print("="*60)
|
| 263 |
+
|
| 264 |
+
print("\n🔄 Birden Fazla Dataset'i Birleştirme")
|
| 265 |
+
|
| 266 |
+
# Farklı kaynaklardan dataset'ler oluştur
|
| 267 |
+
def create_dataset(name, size, label_shift=0):
|
| 268 |
+
def gen():
|
| 269 |
+
for i in range(size):
|
| 270 |
+
yield {
|
| 271 |
+
"text": f"Dataset {name}: Örnek {i}. " * np.random.randint(5, 20),
|
| 272 |
+
"label": (i % 3) + label_shift,
|
| 273 |
+
"source": name
|
| 274 |
+
}
|
| 275 |
+
return Dataset.from_generator(gen)
|
| 276 |
+
|
| 277 |
+
dataset_a = create_dataset("A", 1000, label_shift=0)
|
| 278 |
+
dataset_b = create_dataset("B", 1500, label_shift=2)
|
| 279 |
+
dataset_c = create_dataset("C", 800, label_shift=4)
|
| 280 |
+
|
| 281 |
+
print(f"Dataset A: {len(dataset_a)} örnek")
|
| 282 |
+
print(f"Dataset B: {len(dataset_b)} örnek")
|
| 283 |
+
print(f"Dataset C: {len(dataset_c)} örnek")
|
| 284 |
+
|
| 285 |
+
# Datasets'leri birleştir
|
| 286 |
+
combined = concatenate_datasets([dataset_a, dataset_b, dataset_c])
|
| 287 |
+
print(f"\n✅ Birleştirilmiş: {len(combined)} örnek")
|
| 288 |
+
|
| 289 |
+
# Her kaynaktan örnek sayıları
|
| 290 |
+
sources = [ex['source'] for ex in combined.select(range(min(100, len(combined))))]
|
| 291 |
+
print("\nİlk 100 örnekte kaynak dağılımı:")
|
| 292 |
+
for source in ['A', 'B', 'C']:
|
| 293 |
+
count = sources.count(source)
|
| 294 |
+
print(f" {source}: {count} örnek")
|
| 295 |
+
|
| 296 |
+
|
| 297 |
+
print("\n" + "="*60)
|
| 298 |
+
print("7. CACHE YÖNETİMİ VE OPTIMIZATION")
|
| 299 |
+
print("="*60)
|
| 300 |
+
|
| 301 |
+
print("\n💾 Cache Kullanımı - İşlem Hızlandırma")
|
| 302 |
+
|
| 303 |
+
# Ağır bir preprocessing işlemi simüle et
|
| 304 |
+
def heavy_preprocessing(examples):
|
| 305 |
+
"""
|
| 306 |
+
Ağır işlem simülasyonu
|
| 307 |
+
"""
|
| 308 |
+
time.sleep(0.0001) # Yapay gecikme
|
| 309 |
+
return {
|
| 310 |
+
'processed_text': [text.lower()[:100] for text in examples['text']],
|
| 311 |
+
'features': [[len(text), len(text.split())] for text in examples['text']]
|
| 312 |
+
}
|
| 313 |
+
|
| 314 |
+
test_set = streaming_dataset.select(range(1000))
|
| 315 |
+
|
| 316 |
+
# İlk işleme - cache oluştur
|
| 317 |
+
print("\n1️⃣ İlk işleme (cache oluşturuluyor):")
|
| 318 |
+
start = time.time()
|
| 319 |
+
processed_1 = test_set.map(
|
| 320 |
+
heavy_preprocessing,
|
| 321 |
+
batched=True,
|
| 322 |
+
batch_size=100,
|
| 323 |
+
desc="Processing with cache"
|
| 324 |
+
)
|
| 325 |
+
first_time = time.time() - start
|
| 326 |
+
print(f" Süre: {first_time:.3f}s")
|
| 327 |
+
|
| 328 |
+
# İkinci işleme - cache'den oku (aynı fonksiyon)
|
| 329 |
+
print("\n2️⃣ İkinci işleme (cache kullanılıyor):")
|
| 330 |
+
start = time.time()
|
| 331 |
+
processed_2 = test_set.map(
|
| 332 |
+
heavy_preprocessing,
|
| 333 |
+
batched=True,
|
| 334 |
+
batch_size=100,
|
| 335 |
+
desc="Using cache"
|
| 336 |
+
)
|
| 337 |
+
second_time = time.time() - start
|
| 338 |
+
print(f" Süre: {second_time:.3f}s")
|
| 339 |
+
|
| 340 |
+
if first_time > second_time:
|
| 341 |
+
speedup = first_time / second_time
|
| 342 |
+
print(f"\n ⚡ CACHE HIZ ARTIŞI: {speedup:.1f}x daha hızlı!")
|
| 343 |
+
|
| 344 |
+
|
| 345 |
+
print("\n" + "="*60)
|
| 346 |
+
print("8. MULTI-PROCESS PROCESSING")
|
| 347 |
+
print("="*60)
|
| 348 |
+
|
| 349 |
+
print("\n🚀 Paralel İşleme - Tüm CPU Çekirdeklerini Kullan")
|
| 350 |
+
|
| 351 |
+
import multiprocessing
|
| 352 |
+
num_cores = multiprocessing.cpu_count()
|
| 353 |
+
|
| 354 |
+
print(f"\nSistem CPU çekirdek sayısı: {num_cores}")
|
| 355 |
+
|
| 356 |
+
# CPU-intensive işlem
|
| 357 |
+
def cpu_intensive_processing(examples):
|
| 358 |
+
"""
|
| 359 |
+
CPU-yoğun işlem simülasyonu
|
| 360 |
+
"""
|
| 361 |
+
results = []
|
| 362 |
+
for text in examples['text']:
|
| 363 |
+
# Basit ama CPU kullanan işlem
|
| 364 |
+
result = sum(ord(c) for c in text[:1000])
|
| 365 |
+
results.append(result)
|
| 366 |
+
return {'computed_hash': results}
|
| 367 |
+
|
| 368 |
+
test_parallel = streaming_dataset.select(range(5000))
|
| 369 |
+
|
| 370 |
+
# Tek işlem
|
| 371 |
+
print("\n1️⃣ Tek işlem (num_proc=1):")
|
| 372 |
+
start = time.time()
|
| 373 |
+
processed_single_proc = test_parallel.map(
|
| 374 |
+
cpu_intensive_processing,
|
| 375 |
+
batched=True,
|
| 376 |
+
batch_size=100,
|
| 377 |
+
num_proc=1,
|
| 378 |
+
desc="Single process"
|
| 379 |
+
)
|
| 380 |
+
time_single_proc = time.time() - start
|
| 381 |
+
print(f" Süre: {time_single_proc:.3f}s")
|
| 382 |
+
|
| 383 |
+
# Çoklu işlem (mümkünse)
|
| 384 |
+
if num_cores > 1:
|
| 385 |
+
num_proc = min(4, num_cores)
|
| 386 |
+
print(f"\n2️⃣ Çoklu işlem (num_proc={num_proc}):")
|
| 387 |
+
start = time.time()
|
| 388 |
+
processed_multi_proc = test_parallel.map(
|
| 389 |
+
cpu_intensive_processing,
|
| 390 |
+
batched=True,
|
| 391 |
+
batch_size=100,
|
| 392 |
+
num_proc=num_proc,
|
| 393 |
+
desc="Multi process"
|
| 394 |
+
)
|
| 395 |
+
time_multi_proc = time.time() - start
|
| 396 |
+
print(f" Süre: {time_multi_proc:.3f}s")
|
| 397 |
+
|
| 398 |
+
if time_multi_proc < time_single_proc:
|
| 399 |
+
speedup = time_single_proc / time_multi_proc
|
| 400 |
+
print(f"\n ⚡ PARALEL HIZ ARTIŞI: {speedup:.1f}x daha hızlı!")
|
| 401 |
+
|
| 402 |
+
|
| 403 |
+
print("\n" + "="*60)
|
| 404 |
+
print("9. DATASET İSTATİSTİKLERİ - BÜYÜK VERİDE")
|
| 405 |
+
print("="*60)
|
| 406 |
+
|
| 407 |
+
print("\n📊 Comprehensive Dataset Analysis")
|
| 408 |
+
|
| 409 |
+
def compute_comprehensive_stats(dataset, sample_size=None):
|
| 410 |
+
"""
|
| 411 |
+
Dataset için detaylı istatistikler
|
| 412 |
+
"""
|
| 413 |
+
if sample_size and len(dataset) > sample_size:
|
| 414 |
+
print(f" Dataset çok büyük, {sample_size} örnek üzerinden analiz...")
|
| 415 |
+
dataset = dataset.select(range(sample_size))
|
| 416 |
+
|
| 417 |
+
# Text uzunlukları
|
| 418 |
+
lengths = [len(ex['text']) for ex in dataset]
|
| 419 |
+
word_counts = [len(ex['text'].split()) for ex in dataset]
|
| 420 |
+
labels = [ex['label'] for ex in dataset]
|
| 421 |
+
|
| 422 |
+
return {
|
| 423 |
+
'num_examples': len(dataset),
|
| 424 |
+
'text_length': {
|
| 425 |
+
'mean': np.mean(lengths),
|
| 426 |
+
'median': np.median(lengths),
|
| 427 |
+
'std': np.std(lengths),
|
| 428 |
+
'min': np.min(lengths),
|
| 429 |
+
'max': np.max(lengths),
|
| 430 |
+
'percentile_25': np.percentile(lengths, 25),
|
| 431 |
+
'percentile_75': np.percentile(lengths, 75),
|
| 432 |
+
},
|
| 433 |
+
'word_count': {
|
| 434 |
+
'mean': np.mean(word_counts),
|
| 435 |
+
'median': np.median(word_counts),
|
| 436 |
+
},
|
| 437 |
+
'label_distribution': {
|
| 438 |
+
label: labels.count(label)
|
| 439 |
+
for label in set(labels)
|
| 440 |
+
}
|
| 441 |
+
}
|
| 442 |
+
|
| 443 |
+
stats = compute_comprehensive_stats(streaming_dataset, sample_size=5000)
|
| 444 |
+
|
| 445 |
+
print("\n📈 Dataset İstatistikleri (5000 örnek üzerinden):")
|
| 446 |
+
print(f"\n Toplam örnek: {stats['num_examples']:,}")
|
| 447 |
+
|
| 448 |
+
print("\n 📝 Text Uzunluğu:")
|
| 449 |
+
for key, value in stats['text_length'].items():
|
| 450 |
+
print(f" {key}: {value:.1f} karakter")
|
| 451 |
+
|
| 452 |
+
print("\n 📚 Kelime Sayısı:")
|
| 453 |
+
for key, value in stats['word_count'].items():
|
| 454 |
+
print(f" {key}: {value:.1f} kelime")
|
| 455 |
+
|
| 456 |
+
print("\n 🏷️ Label Dağılımı:")
|
| 457 |
+
total = sum(stats['label_distribution'].values())
|
| 458 |
+
for label, count in sorted(stats['label_distribution'].items()):
|
| 459 |
+
pct = (count / total) * 100
|
| 460 |
+
print(f" Label {label}: {count:,} ({pct:.1f}%)")
|
| 461 |
+
|
| 462 |
+
|
| 463 |
+
print("\n" + "="*60)
|
| 464 |
+
print("10. ADVANCED PATTERNS VE BEST PRACTICES")
|
| 465 |
+
print("="*60)
|
| 466 |
+
|
| 467 |
+
print("\n🎯 Memory-Efficient Data Pipeline")
|
| 468 |
+
|
| 469 |
+
class DataPipeline:
|
| 470 |
+
"""
|
| 471 |
+
Production-ready data pipeline
|
| 472 |
+
"""
|
| 473 |
+
def __init__(self, dataset, batch_size=32):
|
| 474 |
+
self.dataset = dataset
|
| 475 |
+
self.batch_size = batch_size
|
| 476 |
+
self.processed = None
|
| 477 |
+
|
| 478 |
+
def preprocess(self, keep_columns=None):
|
| 479 |
+
"""Step 1: Preprocessing"""
|
| 480 |
+
def process(examples):
|
| 481 |
+
return {
|
| 482 |
+
'text_clean': [text.lower().strip() for text in examples['text']],
|
| 483 |
+
'length': [len(text) for text in examples['text']]
|
| 484 |
+
}
|
| 485 |
+
|
| 486 |
+
self.processed = self.dataset.map(
|
| 487 |
+
process,
|
| 488 |
+
batched=True,
|
| 489 |
+
batch_size=self.batch_size,
|
| 490 |
+
remove_columns=['metadata'] if keep_columns is None else None
|
| 491 |
+
)
|
| 492 |
+
return self
|
| 493 |
+
|
| 494 |
+
def filter_valid(self, min_length=100):
|
| 495 |
+
"""Step 2: Filtering"""
|
| 496 |
+
self.processed = self.processed.filter(
|
| 497 |
+
lambda x: x['length'] >= min_length,
|
| 498 |
+
batched=False
|
| 499 |
+
)
|
| 500 |
+
return self
|
| 501 |
+
|
| 502 |
+
def get_stats(self):
|
| 503 |
+
"""Step 3: Get statistics"""
|
| 504 |
+
lengths = [ex['length'] for ex in self.processed.select(range(min(1000, len(self.processed))))]
|
| 505 |
+
return {
|
| 506 |
+
'count': len(self.processed),
|
| 507 |
+
'avg_length': np.mean(lengths),
|
| 508 |
+
'median_length': np.median(lengths)
|
| 509 |
+
}
|
| 510 |
+
|
| 511 |
+
# Pipeline kullanımı
|
| 512 |
+
print("\n🔧 Pipeline Örneği:")
|
| 513 |
+
pipeline = DataPipeline(streaming_dataset.select(range(5000)), batch_size=100)
|
| 514 |
+
|
| 515 |
+
print("\n Step 1: Preprocessing...")
|
| 516 |
+
pipeline.preprocess()
|
| 517 |
+
|
| 518 |
+
print(" Step 2: Filtering (min_length=400)...")
|
| 519 |
+
pipeline.filter_valid(min_length=400)
|
| 520 |
+
|
| 521 |
+
print(" Step 3: İstatistikler...")
|
| 522 |
+
stats = pipeline.get_stats()
|
| 523 |
+
|
| 524 |
+
print(f"\n ✅ Sonuçlar:")
|
| 525 |
+
print(f" Kalan örnek: {stats['count']:,}")
|
| 526 |
+
print(f" Ortalama uzunluk: {stats['avg_length']:.0f}")
|
| 527 |
+
print(f" Median uzunluk: {stats['median_length']:.0f}")
|
| 528 |
+
|
| 529 |
+
|
| 530 |
+
print("\n" + "="*60)
|
| 531 |
+
print("📚 ÖNEMLİ NOTLAR VE BEST PRACTICES")
|
| 532 |
+
print("="*60)
|
| 533 |
+
|
| 534 |
+
print("""
|
| 535 |
+
✅ STREAMING / GENERATOR PATTERN:
|
| 536 |
+
- Dataset > 10GB → Streaming kullan
|
| 537 |
+
- Generator pattern ile memory efficient
|
| 538 |
+
- İterasyon sırasında veri üretilir
|
| 539 |
+
- Disk I/O'yu minimize et
|
| 540 |
+
|
| 541 |
+
✅ BATCH PROCESSING:
|
| 542 |
+
- DAIMA batched=True kullan!
|
| 543 |
+
- Batch size: 32-1000 arası optimal
|
| 544 |
+
- List comprehension kullan (hızlı)
|
| 545 |
+
- Vectorization mümkünse tercih et
|
| 546 |
+
|
| 547 |
+
✅ MULTI-PROCESSING:
|
| 548 |
+
- CPU-bound işlemler için num_proc kullan
|
| 549 |
+
- num_proc = min(8, cpu_count) genelde optimal
|
| 550 |
+
- I/O-bound işlemlerde fayda sağlamaz
|
| 551 |
+
- Batch size ile beraber ayarla
|
| 552 |
+
|
| 553 |
+
✅ MEMORY YÖNETİMİ:
|
| 554 |
+
- Gereksiz kolonları erken remove_columns ile kaldır
|
| 555 |
+
- Chunk-based processing büyük veri için
|
| 556 |
+
- Cache stratejisi belirle (load_from_disk/save_to_disk)
|
| 557 |
+
- Generator pattern kullan
|
| 558 |
+
|
| 559 |
+
✅ FILTERING:
|
| 560 |
+
- Filtreyi erken uygula (veri pipeline'ın başında)
|
| 561 |
+
- Batch filtering daha hızlı
|
| 562 |
+
- Kompleks filtreler için lambda yerine fonksiyon
|
| 563 |
+
- Filter chain'i yerine tek complex filter
|
| 564 |
+
|
| 565 |
+
✅ PERFORMANCE İPUÇLARI:
|
| 566 |
+
- map() > iterate (her zaman)
|
| 567 |
+
- batched=True > batched=False (10x-100x hızlı)
|
| 568 |
+
- num_proc kullan ama oversubscribe etme
|
| 569 |
+
- Cache akıllıca kullan
|
| 570 |
+
- Arrow format (.arrow) kullan, pickle yerine
|
| 571 |
+
|
| 572 |
+
✅ PRODUCTION PATTERNS:
|
| 573 |
+
- Pipeline pattern kullan (clean code)
|
| 574 |
+
- Error handling ekle
|
| 575 |
+
- Progress bars kullan (desc parameter)
|
| 576 |
+
- Logging ekle
|
| 577 |
+
- Validation adımları ekle
|
| 578 |
+
- Reproducibility için seed kullan
|
| 579 |
+
|
| 580 |
+
✅ BENCHMARK VE PROFILE:
|
| 581 |
+
- time.time() ile zamanla
|
| 582 |
+
- memory_profiler kullan
|
| 583 |
+
- Farklı batch size'ları test et
|
| 584 |
+
- Farklı num_proc değerleri test et
|
| 585 |
+
- Optimal değerleri belirle
|
| 586 |
+
""")
|
| 587 |
+
|
| 588 |
+
print("\n" + "="*60)
|
| 589 |
+
print("✅ BÖLÜM 1 TAMAMLANDI!")
|
| 590 |
+
print("="*60)
|
| 591 |
+
print(f"""
|
| 592 |
+
Bu bölümde öğrendikleriniz:
|
| 593 |
+
✓ Streaming/Generator pattern ile büyük veri
|
| 594 |
+
✓ Memory-efficient preprocessing
|
| 595 |
+
✓ Batch processing {time_single/time_batch:.1f}x hız artışı
|
| 596 |
+
✓ Dataset sharding ve parallelization
|
| 597 |
+
✓ Cache yönetimi
|
| 598 |
+
✓ Chunk-based processing
|
| 599 |
+
✓ Multi-process processing
|
| 600 |
+
✓ Comprehensive statistics
|
| 601 |
+
✓ Production-ready pipeline pattern
|
| 602 |
+
|
| 603 |
+
📊 PERFORMANS KAZANIMLARI:
|
| 604 |
+
- Batch processing: {time_single/time_batch:.1f}x hızlandırma
|
| 605 |
+
- Multi-processing: {num_cores}x CPU çekirdeği
|
| 606 |
+
- Memory: Generator pattern ile minimal kullanım
|
| 607 |
+
|
| 608 |
+
📚 SONRAKI BÖLÜM: Domain-Specific Datasets
|
| 609 |
+
- Bilimsel makaleler (arXiv, PubMed)
|
| 610 |
+
- Kod datasets (The Stack)
|
| 611 |
+
- Finansal veri
|
| 612 |
+
- Tıbbi datasets
|
| 613 |
+
- Özel domain adaptasyonu
|
| 614 |
+
""")
|
| 615 |
+
|
| 616 |
+
print("\n🚀 Harika! İlk bölümü tamamladık!")
|
| 617 |
+
print("Sonraki bölüme geçelim mi? (Evet yazın)")
|
space/modules/02_domain_specific_datasets.py
ADDED
|
@@ -0,0 +1,870 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
DOMAIN-SPECIFIC DATASETS - İLERİ SEVİYE HUGGING FACE
|
| 3 |
+
====================================================
|
| 4 |
+
|
| 5 |
+
Bu modülde öğrenecekleriniz:
|
| 6 |
+
1. Bilimsel Makaleler (arXiv, PubMed) - Academic datasets
|
| 7 |
+
2. Kod Datasets (The Stack, CodeParrot) - Programming datasets
|
| 8 |
+
3. Finansal Analiz Datasets - Finance & Business
|
| 9 |
+
4. Tıbbi/Sağlık Datasets - Medical & Healthcare
|
| 10 |
+
5. Domain-specific preprocessing
|
| 11 |
+
6. Custom tokenization
|
| 12 |
+
7. Domain adaptation techniques
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
from datasets import Dataset, load_dataset, DatasetDict
|
| 16 |
+
import numpy as np
|
| 17 |
+
import json
|
| 18 |
+
from typing import Dict, List
|
| 19 |
+
import time
|
| 20 |
+
from collections import Counter
|
| 21 |
+
import re
|
| 22 |
+
|
| 23 |
+
print("="*70)
|
| 24 |
+
print("🔬 DOMAIN-SPECIFIC DATASETS - İLERİ SEVİYE")
|
| 25 |
+
print("="*70)
|
| 26 |
+
|
| 27 |
+
print("\n" + "="*70)
|
| 28 |
+
print("1. BİLİMSEL MAKALELER - ACADEMIC DATASETS")
|
| 29 |
+
print("="*70)
|
| 30 |
+
|
| 31 |
+
# Sentetik bilimsel makale dataset'i
|
| 32 |
+
def generate_scientific_papers(num_samples=1000):
|
| 33 |
+
"""
|
| 34 |
+
Bilimsel makale formatında sentetik veri
|
| 35 |
+
"""
|
| 36 |
+
domains = ['Physics', 'Computer Science', 'Biology', 'Mathematics', 'Chemistry']
|
| 37 |
+
|
| 38 |
+
def gen():
|
| 39 |
+
for i in range(num_samples):
|
| 40 |
+
domain = np.random.choice(domains)
|
| 41 |
+
|
| 42 |
+
# Makale yapısı
|
| 43 |
+
abstract = f"This paper presents a novel approach to {domain.lower()} research. " \
|
| 44 |
+
f"We propose a methodology that addresses key challenges in the field. " \
|
| 45 |
+
f"Our experimental results show significant improvements over baseline methods. " \
|
| 46 |
+
f"The proposed framework demonstrates applicability across multiple scenarios."
|
| 47 |
+
|
| 48 |
+
yield {
|
| 49 |
+
'id': f'arxiv.{i:06d}',
|
| 50 |
+
'title': f'Advanced Methods in {domain} Research: A Comprehensive Study {i}',
|
| 51 |
+
'abstract': abstract,
|
| 52 |
+
'authors': [f'Author {j}' for j in range(np.random.randint(2, 6))],
|
| 53 |
+
'domain': domain,
|
| 54 |
+
'year': np.random.randint(2015, 2025),
|
| 55 |
+
'citations': np.random.randint(0, 500),
|
| 56 |
+
'keywords': [f'keyword{j}' for j in range(np.random.randint(3, 8))],
|
| 57 |
+
'full_text': abstract + " " + abstract * np.random.randint(5, 15)
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
return Dataset.from_generator(gen)
|
| 61 |
+
|
| 62 |
+
print("\n📚 Bilimsel Makale Dataset'i Oluşturuluyor...")
|
| 63 |
+
scientific_dataset = generate_scientific_papers(2000)
|
| 64 |
+
|
| 65 |
+
print(f"✅ {len(scientific_dataset)} bilimsel makale yüklendi")
|
| 66 |
+
print(f"\nÖrnek makale:")
|
| 67 |
+
sample = scientific_dataset[0]
|
| 68 |
+
print(f" ID: {sample['id']}")
|
| 69 |
+
print(f" Başlık: {sample['title']}")
|
| 70 |
+
print(f" Domain: {sample['domain']}")
|
| 71 |
+
print(f" Yazar sayısı: {len(sample['authors'])}")
|
| 72 |
+
print(f" Yıl: {sample['year']}")
|
| 73 |
+
print(f" Atıf sayısı: {sample['citations']}")
|
| 74 |
+
print(f" Abstract: {sample['abstract'][:150]}...")
|
| 75 |
+
|
| 76 |
+
# Domain bazlı istatistikler
|
| 77 |
+
print("\n📊 Domain Dağılımı:")
|
| 78 |
+
domains = [ex['domain'] for ex in scientific_dataset]
|
| 79 |
+
domain_counts = Counter(domains)
|
| 80 |
+
for domain, count in domain_counts.most_common():
|
| 81 |
+
pct = (count / len(scientific_dataset)) * 100
|
| 82 |
+
print(f" {domain}: {count} ({pct:.1f}%)")
|
| 83 |
+
|
| 84 |
+
# Yıllara göre analiz
|
| 85 |
+
print("\n📅 Yıllara Göre Yayın Sayısı:")
|
| 86 |
+
years = [ex['year'] for ex in scientific_dataset]
|
| 87 |
+
year_counts = Counter(years)
|
| 88 |
+
for year in sorted(year_counts.keys())[-5:]:
|
| 89 |
+
print(f" {year}: {year_counts[year]} makale")
|
| 90 |
+
|
| 91 |
+
# Atıf analizi
|
| 92 |
+
citations = [ex['citations'] for ex in scientific_dataset]
|
| 93 |
+
print(f"\n📈 Atıf İstatistikleri:")
|
| 94 |
+
print(f" Ortalama: {np.mean(citations):.1f}")
|
| 95 |
+
print(f" Median: {np.median(citations):.1f}")
|
| 96 |
+
print(f" En çok atıf: {np.max(citations)}")
|
| 97 |
+
|
| 98 |
+
# Preprocessing - Bilimsel text temizleme
|
| 99 |
+
print("\n🔧 Bilimsel Text Preprocessing:")
|
| 100 |
+
|
| 101 |
+
def preprocess_scientific_text(examples):
|
| 102 |
+
"""
|
| 103 |
+
Bilimsel metin için özel preprocessing
|
| 104 |
+
"""
|
| 105 |
+
processed = []
|
| 106 |
+
|
| 107 |
+
for text in examples['abstract']:
|
| 108 |
+
# Küçük harfe çevir
|
| 109 |
+
text = text.lower()
|
| 110 |
+
|
| 111 |
+
# Özel karakterleri temizle
|
| 112 |
+
text = re.sub(r'[^\w\s\.]', '', text)
|
| 113 |
+
|
| 114 |
+
# Fazla boşlukları temizle
|
| 115 |
+
text = ' '.join(text.split())
|
| 116 |
+
|
| 117 |
+
processed.append(text)
|
| 118 |
+
|
| 119 |
+
return {
|
| 120 |
+
'abstract_clean': processed,
|
| 121 |
+
'abstract_length': [len(t) for t in processed],
|
| 122 |
+
'word_count': [len(t.split()) for t in processed]
|
| 123 |
+
}
|
| 124 |
+
|
| 125 |
+
scientific_processed = scientific_dataset.map(
|
| 126 |
+
preprocess_scientific_text,
|
| 127 |
+
batched=True,
|
| 128 |
+
batch_size=500,
|
| 129 |
+
desc="Preprocessing scientific texts"
|
| 130 |
+
)
|
| 131 |
+
|
| 132 |
+
print(f"✅ {len(scientific_processed)} makale işlendi")
|
| 133 |
+
print(f"\nÖrnek işlenmiş abstract:")
|
| 134 |
+
print(f" Original: {scientific_processed[0]['abstract'][:100]}...")
|
| 135 |
+
print(f" Cleaned: {scientific_processed[0]['abstract_clean'][:100]}...")
|
| 136 |
+
print(f" Word count: {scientific_processed[0]['word_count']}")
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
print("\n" + "="*70)
|
| 140 |
+
print("2. KOD DATASETS - PROGRAMMING & SOFTWARE")
|
| 141 |
+
print("="*70)
|
| 142 |
+
|
| 143 |
+
# Sentetik kod dataset'i
|
| 144 |
+
def generate_code_dataset(num_samples=1000):
|
| 145 |
+
"""
|
| 146 |
+
Çeşitli programlama dilleri için kod örnekleri
|
| 147 |
+
"""
|
| 148 |
+
languages = ['Python', 'JavaScript', 'Java', 'C++', 'Go', 'Rust']
|
| 149 |
+
|
| 150 |
+
code_templates = {
|
| 151 |
+
'Python': '''def {func_name}({params}):
|
| 152 |
+
"""
|
| 153 |
+
{docstring}
|
| 154 |
+
"""
|
| 155 |
+
result = {body}
|
| 156 |
+
return result''',
|
| 157 |
+
|
| 158 |
+
'JavaScript': '''function {func_name}({params}) {{
|
| 159 |
+
// {docstring}
|
| 160 |
+
const result = {body};
|
| 161 |
+
return result;
|
| 162 |
+
}}''',
|
| 163 |
+
|
| 164 |
+
'Java': '''public {return_type} {func_name}({params}) {{
|
| 165 |
+
// {docstring}
|
| 166 |
+
{return_type} result = {body};
|
| 167 |
+
return result;
|
| 168 |
+
}}''',
|
| 169 |
+
}
|
| 170 |
+
|
| 171 |
+
def gen():
|
| 172 |
+
for i in range(num_samples):
|
| 173 |
+
lang = np.random.choice(languages)
|
| 174 |
+
|
| 175 |
+
# Kod özellikleri
|
| 176 |
+
func_name = f"process_data_{i}"
|
| 177 |
+
params = "data, config"
|
| 178 |
+
docstring = f"Process data using method {i}"
|
| 179 |
+
body = "data * 2 + config"
|
| 180 |
+
|
| 181 |
+
if lang in code_templates:
|
| 182 |
+
code = code_templates[lang].format(
|
| 183 |
+
func_name=func_name,
|
| 184 |
+
params=params,
|
| 185 |
+
docstring=docstring,
|
| 186 |
+
body=body,
|
| 187 |
+
return_type='int' if lang == 'Java' else ''
|
| 188 |
+
)
|
| 189 |
+
else:
|
| 190 |
+
code = f"// {lang} code example\n{func_name}({params})"
|
| 191 |
+
|
| 192 |
+
yield {
|
| 193 |
+
'id': f'code_{i:06d}',
|
| 194 |
+
'language': lang,
|
| 195 |
+
'code': code,
|
| 196 |
+
'func_name': func_name,
|
| 197 |
+
'lines_of_code': len(code.split('\n')),
|
| 198 |
+
'has_docstring': 'docstring' in code.lower(),
|
| 199 |
+
'complexity': np.random.choice(['low', 'medium', 'high']),
|
| 200 |
+
'repo': f'github.com/user/repo_{i % 100}',
|
| 201 |
+
'stars': np.random.randint(0, 10000)
|
| 202 |
+
}
|
| 203 |
+
|
| 204 |
+
return Dataset.from_generator(gen)
|
| 205 |
+
|
| 206 |
+
print("\n💻 Kod Dataset'i Oluşturuluyor...")
|
| 207 |
+
code_dataset = generate_code_dataset(2000)
|
| 208 |
+
|
| 209 |
+
print(f"✅ {len(code_dataset)} kod örneği yüklendi")
|
| 210 |
+
print(f"\nÖrnek kod:")
|
| 211 |
+
code_sample = code_dataset[0]
|
| 212 |
+
print(f" ID: {code_sample['id']}")
|
| 213 |
+
print(f" Dil: {code_sample['language']}")
|
| 214 |
+
print(f" Satır sayısı: {code_sample['lines_of_code']}")
|
| 215 |
+
print(f" Karmaşıklık: {code_sample['complexity']}")
|
| 216 |
+
print(f"\n Kod:\n{code_sample['code']}\n")
|
| 217 |
+
|
| 218 |
+
# Dil dağılımı
|
| 219 |
+
print("\n📊 Programlama Dili Dağılımı:")
|
| 220 |
+
languages = [ex['language'] for ex in code_dataset]
|
| 221 |
+
lang_counts = Counter(languages)
|
| 222 |
+
for lang, count in lang_counts.most_common():
|
| 223 |
+
pct = (count / len(code_dataset)) * 100
|
| 224 |
+
print(f" {lang}: {count} ({pct:.1f}%)")
|
| 225 |
+
|
| 226 |
+
# Kod analizi
|
| 227 |
+
print("\n📈 Kod Metrikleri:")
|
| 228 |
+
loc_values = [ex['lines_of_code'] for ex in code_dataset]
|
| 229 |
+
print(f" Ortalama satır sayısı: {np.mean(loc_values):.1f}")
|
| 230 |
+
print(f" Median satır sayısı: {np.median(loc_values):.1f}")
|
| 231 |
+
|
| 232 |
+
has_docstring = sum([1 for ex in code_dataset if ex['has_docstring']])
|
| 233 |
+
print(f" Docstring oranı: {(has_docstring/len(code_dataset)*100):.1f}%")
|
| 234 |
+
|
| 235 |
+
# Kod preprocessing
|
| 236 |
+
print("\n🔧 Kod Preprocessing:")
|
| 237 |
+
|
| 238 |
+
def preprocess_code(examples):
|
| 239 |
+
"""
|
| 240 |
+
Kod için özel preprocessing
|
| 241 |
+
"""
|
| 242 |
+
def extract_functions(code):
|
| 243 |
+
# Fonksiyon isimlerini çıkar (basit regex)
|
| 244 |
+
funcs = re.findall(r'def\s+(\w+)|function\s+(\w+)|public\s+\w+\s+(\w+)', code)
|
| 245 |
+
return [f for group in funcs for f in group if f]
|
| 246 |
+
|
| 247 |
+
def count_comments(code):
|
| 248 |
+
# Yorum satırlarını say
|
| 249 |
+
return len(re.findall(r'#|//|/\*|\*/', code))
|
| 250 |
+
|
| 251 |
+
return {
|
| 252 |
+
'functions': [extract_functions(code) for code in examples['code']],
|
| 253 |
+
'comment_count': [count_comments(code) for code in examples['code']],
|
| 254 |
+
'code_chars': [len(code) for code in examples['code']],
|
| 255 |
+
'code_tokens': [len(code.split()) for code in examples['code']]
|
| 256 |
+
}
|
| 257 |
+
|
| 258 |
+
code_processed = code_dataset.map(
|
| 259 |
+
preprocess_code,
|
| 260 |
+
batched=True,
|
| 261 |
+
batch_size=500,
|
| 262 |
+
desc="Analyzing code"
|
| 263 |
+
)
|
| 264 |
+
|
| 265 |
+
print(f"✅ {len(code_processed)} kod örneği analiz edildi")
|
| 266 |
+
print(f"\nÖrnek analiz:")
|
| 267 |
+
print(f" Fonksiyonlar: {code_processed[0]['functions']}")
|
| 268 |
+
print(f" Yorum sayısı: {code_processed[0]['comment_count']}")
|
| 269 |
+
print(f" Token sayısı: {code_processed[0]['code_tokens']}")
|
| 270 |
+
|
| 271 |
+
|
| 272 |
+
print("\n" + "="*70)
|
| 273 |
+
print("3. FİNANSAL ANALİZ DATASETS")
|
| 274 |
+
print("="*70)
|
| 275 |
+
|
| 276 |
+
# Sentetik finansal veri
|
| 277 |
+
def generate_financial_dataset(num_samples=1000):
|
| 278 |
+
"""
|
| 279 |
+
Finansal haber ve analiz dataset'i
|
| 280 |
+
"""
|
| 281 |
+
companies = ['TechCorp', 'FinanceBank', 'RetailCo', 'EnergyInc', 'HealthMed']
|
| 282 |
+
sentiments = ['positive', 'negative', 'neutral']
|
| 283 |
+
categories = ['earnings', 'merger', 'product_launch', 'scandal', 'expansion']
|
| 284 |
+
|
| 285 |
+
def gen():
|
| 286 |
+
for i in range(num_samples):
|
| 287 |
+
company = np.random.choice(companies)
|
| 288 |
+
sentiment = np.random.choice(sentiments)
|
| 289 |
+
category = np.random.choice(categories)
|
| 290 |
+
|
| 291 |
+
# Finansal haber metni
|
| 292 |
+
if sentiment == 'positive':
|
| 293 |
+
text = f"{company} announces strong quarterly earnings, exceeding market expectations. " \
|
| 294 |
+
f"Stock prices surged following the announcement. Analysts remain optimistic."
|
| 295 |
+
elif sentiment == 'negative':
|
| 296 |
+
text = f"{company} faces challenges in the current market. " \
|
| 297 |
+
f"Quarterly results fell short of expectations. Investors express concern."
|
| 298 |
+
else:
|
| 299 |
+
text = f"{company} maintains steady performance in Q{i%4+1}. " \
|
| 300 |
+
f"Market reaction remains moderate. Company outlook unchanged."
|
| 301 |
+
|
| 302 |
+
yield {
|
| 303 |
+
'id': f'fin_{i:06d}',
|
| 304 |
+
'company': company,
|
| 305 |
+
'text': text,
|
| 306 |
+
'sentiment': sentiment,
|
| 307 |
+
'category': category,
|
| 308 |
+
'date': f'2024-{(i%12)+1:02d}-{(i%28)+1:02d}',
|
| 309 |
+
'stock_change': np.random.uniform(-10, 10),
|
| 310 |
+
'volume': np.random.randint(1000000, 10000000),
|
| 311 |
+
'market_cap': np.random.uniform(1e9, 100e9),
|
| 312 |
+
'sector': np.random.choice(['Tech', 'Finance', 'Retail', 'Energy', 'Healthcare'])
|
| 313 |
+
}
|
| 314 |
+
|
| 315 |
+
return Dataset.from_generator(gen)
|
| 316 |
+
|
| 317 |
+
print("\n💰 Finansal Dataset Oluşturuluyor...")
|
| 318 |
+
financial_dataset = generate_financial_dataset(2000)
|
| 319 |
+
|
| 320 |
+
print(f"✅ {len(financial_dataset)} finansal kayıt yüklendi")
|
| 321 |
+
print(f"\nÖrnek finansal kayıt:")
|
| 322 |
+
fin_sample = financial_dataset[0]
|
| 323 |
+
print(f" ID: {fin_sample['id']}")
|
| 324 |
+
print(f" Şirket: {fin_sample['company']}")
|
| 325 |
+
print(f" Sentiment: {fin_sample['sentiment']}")
|
| 326 |
+
print(f" Kategori: {fin_sample['category']}")
|
| 327 |
+
print(f" Hisse değişimi: {fin_sample['stock_change']:.2f}%")
|
| 328 |
+
print(f" Metin: {fin_sample['text'][:120]}...")
|
| 329 |
+
|
| 330 |
+
# Sentiment analizi
|
| 331 |
+
print("\n📊 Sentiment Dağılımı:")
|
| 332 |
+
sentiments = [ex['sentiment'] for ex in financial_dataset]
|
| 333 |
+
sent_counts = Counter(sentiments)
|
| 334 |
+
for sent, count in sent_counts.items():
|
| 335 |
+
pct = (count / len(financial_dataset)) * 100
|
| 336 |
+
print(f" {sent.capitalize()}: {count} ({pct:.1f}%)")
|
| 337 |
+
|
| 338 |
+
# Şirket bazlı analiz
|
| 339 |
+
print("\n🏢 Şirket Bazlı Analiz:")
|
| 340 |
+
companies = [ex['company'] for ex in financial_dataset]
|
| 341 |
+
company_counts = Counter(companies)
|
| 342 |
+
for company, count in company_counts.most_common():
|
| 343 |
+
avg_change = np.mean([ex['stock_change'] for ex in financial_dataset if ex['company'] == company])
|
| 344 |
+
print(f" {company}: {count} haber, ortalama değişim: {avg_change:+.2f}%")
|
| 345 |
+
|
| 346 |
+
# Finansal preprocessing
|
| 347 |
+
print("\n🔧 Finansal Text Preprocessing:")
|
| 348 |
+
|
| 349 |
+
def preprocess_financial_text(examples):
|
| 350 |
+
"""
|
| 351 |
+
Finansal metin için özel preprocessing
|
| 352 |
+
"""
|
| 353 |
+
def extract_numbers(text):
|
| 354 |
+
# Sayıları ve yüzdeleri çıkar
|
| 355 |
+
numbers = re.findall(r'\d+\.?\d*%?', text)
|
| 356 |
+
return numbers
|
| 357 |
+
|
| 358 |
+
def extract_financial_terms(text):
|
| 359 |
+
# Finansal terimleri say
|
| 360 |
+
terms = ['earnings', 'stock', 'market', 'quarterly', 'revenue',
|
| 361 |
+
'profit', 'loss', 'growth', 'decline']
|
| 362 |
+
count = sum([1 for term in terms if term in text.lower()])
|
| 363 |
+
return count
|
| 364 |
+
|
| 365 |
+
return {
|
| 366 |
+
'numbers_found': [extract_numbers(text) for text in examples['text']],
|
| 367 |
+
'financial_term_count': [extract_financial_terms(text) for text in examples['text']],
|
| 368 |
+
'text_length': [len(text) for text in examples['text']],
|
| 369 |
+
'has_percentage': ['%' in text for text in examples['text']]
|
| 370 |
+
}
|
| 371 |
+
|
| 372 |
+
financial_processed = financial_dataset.map(
|
| 373 |
+
preprocess_financial_text,
|
| 374 |
+
batched=True,
|
| 375 |
+
batch_size=500,
|
| 376 |
+
desc="Processing financial texts"
|
| 377 |
+
)
|
| 378 |
+
|
| 379 |
+
print(f"✅ {len(financial_processed)} finansal kayıt işlendi")
|
| 380 |
+
print(f"\nÖrnek analiz:")
|
| 381 |
+
print(f" Sayılar: {financial_processed[0]['numbers_found']}")
|
| 382 |
+
print(f" Finansal terim sayısı: {financial_processed[0]['financial_term_count']}")
|
| 383 |
+
print(f" Yüzde var mı: {financial_processed[0]['has_percentage']}")
|
| 384 |
+
|
| 385 |
+
|
| 386 |
+
print("\n" + "="*70)
|
| 387 |
+
print("4. TIBBİ/SAĞLIK DATASETS")
|
| 388 |
+
print("="*70)
|
| 389 |
+
|
| 390 |
+
# Sentetik tıbbi veri
|
| 391 |
+
def generate_medical_dataset(num_samples=1000):
|
| 392 |
+
"""
|
| 393 |
+
Tıbbi notlar ve tanılar
|
| 394 |
+
"""
|
| 395 |
+
conditions = ['Diabetes', 'Hypertension', 'Asthma', 'Arthritis', 'Migraine']
|
| 396 |
+
treatments = ['Medication', 'Physical Therapy', 'Surgery', 'Lifestyle Changes']
|
| 397 |
+
severities = ['mild', 'moderate', 'severe']
|
| 398 |
+
|
| 399 |
+
def gen():
|
| 400 |
+
for i in range(num_samples):
|
| 401 |
+
condition = np.random.choice(conditions)
|
| 402 |
+
treatment = np.random.choice(treatments)
|
| 403 |
+
severity = np.random.choice(severities)
|
| 404 |
+
|
| 405 |
+
# Tıbbi not
|
| 406 |
+
note = f"Patient presents with {severity} {condition.lower()}. " \
|
| 407 |
+
f"Symptoms include relevant clinical findings. " \
|
| 408 |
+
f"Recommended treatment: {treatment}. " \
|
| 409 |
+
f"Follow-up scheduled. Patient advised on preventive measures."
|
| 410 |
+
|
| 411 |
+
yield {
|
| 412 |
+
'id': f'med_{i:06d}',
|
| 413 |
+
'patient_id': f'P{i:05d}',
|
| 414 |
+
'condition': condition,
|
| 415 |
+
'severity': severity,
|
| 416 |
+
'treatment': treatment,
|
| 417 |
+
'note': note,
|
| 418 |
+
'age': np.random.randint(18, 90),
|
| 419 |
+
'gender': np.random.choice(['M', 'F']),
|
| 420 |
+
'visit_date': f'2024-{(i%12)+1:02d}-{(i%28)+1:02d}',
|
| 421 |
+
'diagnosis_confidence': np.random.uniform(0.7, 1.0),
|
| 422 |
+
'follow_up_required': np.random.choice([True, False])
|
| 423 |
+
}
|
| 424 |
+
|
| 425 |
+
return Dataset.from_generator(gen)
|
| 426 |
+
|
| 427 |
+
print("\n🏥 Tıbbi Dataset Oluşturuluyor...")
|
| 428 |
+
medical_dataset = generate_medical_dataset(2000)
|
| 429 |
+
|
| 430 |
+
print(f"✅ {len(medical_dataset)} tıbbi kayıt yüklendi")
|
| 431 |
+
print(f"\nÖrnek tıbbi kayıt:")
|
| 432 |
+
med_sample = medical_dataset[0]
|
| 433 |
+
print(f" ID: {med_sample['id']}")
|
| 434 |
+
print(f" Hasta ID: {med_sample['patient_id']}")
|
| 435 |
+
print(f" Durum: {med_sample['condition']}")
|
| 436 |
+
print(f" Şiddet: {med_sample['severity']}")
|
| 437 |
+
print(f" Tedavi: {med_sample['treatment']}")
|
| 438 |
+
print(f" Yaş: {med_sample['age']}")
|
| 439 |
+
print(f" Tanı güveni: {med_sample['diagnosis_confidence']:.2f}")
|
| 440 |
+
print(f" Not: {med_sample['note'][:100]}...")
|
| 441 |
+
|
| 442 |
+
# Durum dağılımı
|
| 443 |
+
print("\n📊 Tıbbi Durum Dağılımı:")
|
| 444 |
+
conditions = [ex['condition'] for ex in medical_dataset]
|
| 445 |
+
cond_counts = Counter(conditions)
|
| 446 |
+
for cond, count in cond_counts.most_common():
|
| 447 |
+
pct = (count / len(medical_dataset)) * 100
|
| 448 |
+
print(f" {cond}: {count} ({pct:.1f}%)")
|
| 449 |
+
|
| 450 |
+
# Şiddet analizi
|
| 451 |
+
print("\n⚠️ Şiddet Dağılımı:")
|
| 452 |
+
severities = [ex['severity'] for ex in medical_dataset]
|
| 453 |
+
sev_counts = Counter(severities)
|
| 454 |
+
for sev, count in sorted(sev_counts.items()):
|
| 455 |
+
pct = (count / len(medical_dataset)) * 100
|
| 456 |
+
print(f" {sev.capitalize()}: {count} ({pct:.1f}%)")
|
| 457 |
+
|
| 458 |
+
# Yaş grupları
|
| 459 |
+
print("\n👥 Yaş Grubu Analizi:")
|
| 460 |
+
ages = [ex['age'] for ex in medical_dataset]
|
| 461 |
+
age_groups = {
|
| 462 |
+
'18-30': sum([1 for age in ages if 18 <= age <= 30]),
|
| 463 |
+
'31-50': sum([1 for age in ages if 31 <= age <= 50]),
|
| 464 |
+
'51-70': sum([1 for age in ages if 51 <= age <= 70]),
|
| 465 |
+
'71+': sum([1 for age in ages if age > 70])
|
| 466 |
+
}
|
| 467 |
+
for group, count in age_groups.items():
|
| 468 |
+
pct = (count / len(ages)) * 100
|
| 469 |
+
print(f" {group}: {count} ({pct:.1f}%)")
|
| 470 |
+
|
| 471 |
+
# Tıbbi preprocessing
|
| 472 |
+
print("\n🔧 Tıbbi Text Preprocessing (PHI Removal):")
|
| 473 |
+
|
| 474 |
+
def preprocess_medical_text(examples):
|
| 475 |
+
"""
|
| 476 |
+
Tıbbi metin için özel preprocessing
|
| 477 |
+
PHI (Protected Health Information) temizleme simülasyonu
|
| 478 |
+
"""
|
| 479 |
+
def anonymize_text(text, patient_id):
|
| 480 |
+
# Hasta ID'lerini anonimleştir
|
| 481 |
+
text = text.replace(patient_id, '[PATIENT_ID]')
|
| 482 |
+
|
| 483 |
+
# Tarihleri anonimleştir
|
| 484 |
+
text = re.sub(r'\d{4}-\d{2}-\d{2}', '[DATE]', text)
|
| 485 |
+
|
| 486 |
+
return text
|
| 487 |
+
|
| 488 |
+
def extract_medical_entities(text):
|
| 489 |
+
# Tıbbi terimleri say (basit örnek)
|
| 490 |
+
terms = ['patient', 'symptoms', 'treatment', 'diagnosis',
|
| 491 |
+
'medication', 'therapy', 'condition']
|
| 492 |
+
count = sum([1 for term in terms if term in text.lower()])
|
| 493 |
+
return count
|
| 494 |
+
|
| 495 |
+
return {
|
| 496 |
+
'note_anonymized': [
|
| 497 |
+
anonymize_text(note, pid)
|
| 498 |
+
for note, pid in zip(examples['note'], examples['patient_id'])
|
| 499 |
+
],
|
| 500 |
+
'medical_entity_count': [extract_medical_entities(note) for note in examples['note']],
|
| 501 |
+
'note_length': [len(note) for note in examples['note']],
|
| 502 |
+
'requires_follow_up': examples['follow_up_required']
|
| 503 |
+
}
|
| 504 |
+
|
| 505 |
+
medical_processed = medical_dataset.map(
|
| 506 |
+
preprocess_medical_text,
|
| 507 |
+
batched=True,
|
| 508 |
+
batch_size=500,
|
| 509 |
+
desc="Anonymizing medical records"
|
| 510 |
+
)
|
| 511 |
+
|
| 512 |
+
print(f"✅ {len(medical_processed)} tıbbi kayıt anonimleştirildi")
|
| 513 |
+
print(f"\nÖrnek anonimleştirilmiş not:")
|
| 514 |
+
print(f" Orijinal: {medical_processed[0]['note'][:100]}...")
|
| 515 |
+
print(f" Anonimleştirilmiş: {medical_processed[0]['note_anonymized'][:100]}...")
|
| 516 |
+
print(f" Tıbbi entity sayısı: {medical_processed[0]['medical_entity_count']}")
|
| 517 |
+
|
| 518 |
+
|
| 519 |
+
print("\n" + "="*70)
|
| 520 |
+
print("5. DOMAIN-SPECIFIC TOKENIZATION")
|
| 521 |
+
print("="*70)
|
| 522 |
+
|
| 523 |
+
print("\n🔤 Domain-Specific Tokenization Stratejileri:")
|
| 524 |
+
|
| 525 |
+
# Bilimsel metin için
|
| 526 |
+
print("\n1️⃣ Bilimsel Metin Tokenization:")
|
| 527 |
+
scientific_sample = scientific_dataset[0]['abstract']
|
| 528 |
+
print(f" Orijinal: {scientific_sample[:80]}...")
|
| 529 |
+
|
| 530 |
+
# Basit word tokenization
|
| 531 |
+
words = scientific_sample.split()
|
| 532 |
+
print(f" Word tokens: {len(words)} kelime")
|
| 533 |
+
print(f" İlk 5 token: {words[:5]}")
|
| 534 |
+
|
| 535 |
+
# Sentence tokenization
|
| 536 |
+
sentences = scientific_sample.split('.')
|
| 537 |
+
print(f" Sentence tokens: {len([s for s in sentences if s.strip()])} cümle")
|
| 538 |
+
|
| 539 |
+
# Kod için
|
| 540 |
+
print("\n2️⃣ Kod Tokenization:")
|
| 541 |
+
code_sample = code_dataset[0]['code']
|
| 542 |
+
print(f" Kod:\n{code_sample}")
|
| 543 |
+
|
| 544 |
+
# Satır bazlı
|
| 545 |
+
lines = code_sample.split('\n')
|
| 546 |
+
print(f" Satır sayısı: {len(lines)}")
|
| 547 |
+
|
| 548 |
+
# Token bazlı (basit)
|
| 549 |
+
code_tokens = re.findall(r'\w+|[^\w\s]', code_sample)
|
| 550 |
+
print(f" Token sayısı: {len(code_tokens)}")
|
| 551 |
+
print(f" İlk 10 token: {code_tokens[:10]}")
|
| 552 |
+
|
| 553 |
+
|
| 554 |
+
print("\n" + "="*70)
|
| 555 |
+
print("6. CROSS-DOMAIN DATASET BİRLEŞTİRME")
|
| 556 |
+
print("="*70)
|
| 557 |
+
|
| 558 |
+
print("\n🔄 Farklı Domain'lerden Dataset Birleştirme:")
|
| 559 |
+
|
| 560 |
+
# Her domain'den küçük subset al
|
| 561 |
+
sci_subset = scientific_dataset.select(range(100))
|
| 562 |
+
code_subset = code_dataset.select(range(100))
|
| 563 |
+
fin_subset = financial_dataset.select(range(100))
|
| 564 |
+
|
| 565 |
+
# Ortak format'a çevir
|
| 566 |
+
def normalize_scientific(example):
|
| 567 |
+
return {
|
| 568 |
+
'text': example['abstract'],
|
| 569 |
+
'domain': 'scientific',
|
| 570 |
+
'metadata': {
|
| 571 |
+
'type': example['domain'],
|
| 572 |
+
'year': example['year']
|
| 573 |
+
}
|
| 574 |
+
}
|
| 575 |
+
|
| 576 |
+
def normalize_code(example):
|
| 577 |
+
return {
|
| 578 |
+
'text': example['code'],
|
| 579 |
+
'domain': 'code',
|
| 580 |
+
'metadata': {
|
| 581 |
+
'language': example['language'],
|
| 582 |
+
'lines': example['lines_of_code']
|
| 583 |
+
}
|
| 584 |
+
}
|
| 585 |
+
|
| 586 |
+
def normalize_financial(example):
|
| 587 |
+
return {
|
| 588 |
+
'text': example['text'],
|
| 589 |
+
'domain': 'financial',
|
| 590 |
+
'metadata': {
|
| 591 |
+
'sentiment': example['sentiment'],
|
| 592 |
+
'company': example['company']
|
| 593 |
+
}
|
| 594 |
+
}
|
| 595 |
+
|
| 596 |
+
print("\n📦 Dataset'leri normalize ediyoruz...")
|
| 597 |
+
sci_norm = sci_subset.map(normalize_scientific, remove_columns=sci_subset.column_names)
|
| 598 |
+
code_norm = code_subset.map(normalize_code, remove_columns=code_subset.column_names)
|
| 599 |
+
fin_norm = fin_subset.map(normalize_financial, remove_columns=fin_subset.column_names)
|
| 600 |
+
|
| 601 |
+
# Birleştir
|
| 602 |
+
from datasets import concatenate_datasets
|
| 603 |
+
multi_domain = concatenate_datasets([sci_norm, code_norm, fin_norm])
|
| 604 |
+
|
| 605 |
+
print(f"✅ Multi-domain dataset: {len(multi_domain)} örnek")
|
| 606 |
+
print(f"\nDomain dağılımı:")
|
| 607 |
+
domains = [ex['domain'] for ex in multi_domain]
|
| 608 |
+
domain_dist = Counter(domains)
|
| 609 |
+
for domain, count in domain_dist.items():
|
| 610 |
+
print(f" {domain}: {count}")
|
| 611 |
+
|
| 612 |
+
print(f"\nÖrnek multi-domain kayıtlar:")
|
| 613 |
+
for i in range(3):
|
| 614 |
+
ex = multi_domain[i * 100] # Her domain'den birer örnek
|
| 615 |
+
print(f"\n {i+1}. Domain: {ex['domain']}")
|
| 616 |
+
print(f" Text: {ex['text'][:80]}...")
|
| 617 |
+
print(f" Metadata: {ex['metadata']}")
|
| 618 |
+
|
| 619 |
+
|
| 620 |
+
print("\n" + "="*70)
|
| 621 |
+
print("7. DOMAIN ADAPTATION TEKNİKLERİ")
|
| 622 |
+
print("="*70)
|
| 623 |
+
|
| 624 |
+
print("\n🎯 Domain Adaptation Stratejileri:")
|
| 625 |
+
|
| 626 |
+
# Örnek: Genel domain'den specific domain'e transfer
|
| 627 |
+
print("\n1️⃣ Domain-Specific Vocabulary Analysis:")
|
| 628 |
+
|
| 629 |
+
def analyze_domain_vocabulary(dataset, text_column, domain_name):
|
| 630 |
+
"""
|
| 631 |
+
Domain-specific kelime dağarcığı analizi
|
| 632 |
+
"""
|
| 633 |
+
all_words = []
|
| 634 |
+
for example in dataset:
|
| 635 |
+
words = example[text_column].lower().split()
|
| 636 |
+
all_words.extend(words)
|
| 637 |
+
|
| 638 |
+
vocab_counts = Counter(all_words)
|
| 639 |
+
|
| 640 |
+
return {
|
| 641 |
+
'domain': domain_name,
|
| 642 |
+
'total_words': len(all_words),
|
| 643 |
+
'unique_words': len(vocab_counts),
|
| 644 |
+
'top_10_words': vocab_counts.most_common(10)
|
| 645 |
+
}
|
| 646 |
+
|
| 647 |
+
# Her domain için vocabulary analizi
|
| 648 |
+
sci_vocab = analyze_domain_vocabulary(
|
| 649 |
+
scientific_dataset.select(range(500)),
|
| 650 |
+
'abstract',
|
| 651 |
+
'Scientific'
|
| 652 |
+
)
|
| 653 |
+
code_vocab = analyze_domain_vocabulary(
|
| 654 |
+
code_dataset.select(range(500)),
|
| 655 |
+
'code',
|
| 656 |
+
'Code'
|
| 657 |
+
)
|
| 658 |
+
fin_vocab = analyze_domain_vocabulary(
|
| 659 |
+
financial_dataset.select(range(500)),
|
| 660 |
+
'text',
|
| 661 |
+
'Financial'
|
| 662 |
+
)
|
| 663 |
+
|
| 664 |
+
print("\n📚 Domain Vocabulary İstatistikleri:")
|
| 665 |
+
for vocab in [sci_vocab, code_vocab, fin_vocab]:
|
| 666 |
+
print(f"\n {vocab['domain']}:")
|
| 667 |
+
print(f" Toplam kelime: {vocab['total_words']:,}")
|
| 668 |
+
print(f" Benzersiz kelime: {vocab['unique_words']:,}")
|
| 669 |
+
print(f" Vocabulary zenginliği: {vocab['unique_words']/vocab['total_words']:.3f}")
|
| 670 |
+
print(f" Top 5 kelime: {[w for w, c in vocab['top_10_words'][:5]]}")
|
| 671 |
+
|
| 672 |
+
|
| 673 |
+
print("\n2️⃣ Domain-Specific Data Augmentation:")
|
| 674 |
+
|
| 675 |
+
def augment_scientific_text(example):
|
| 676 |
+
"""
|
| 677 |
+
Bilimsel metin için data augmentation
|
| 678 |
+
"""
|
| 679 |
+
text = example['abstract']
|
| 680 |
+
|
| 681 |
+
# Synonym replacement (basit simülasyon)
|
| 682 |
+
augmented = text.replace('novel', 'innovative')
|
| 683 |
+
augmented = augmented.replace('propose', 'present')
|
| 684 |
+
augmented = augmented.replace('demonstrate', 'show')
|
| 685 |
+
|
| 686 |
+
return {
|
| 687 |
+
**example,
|
| 688 |
+
'abstract_augmented': augmented
|
| 689 |
+
}
|
| 690 |
+
|
| 691 |
+
print("\n Bilimsel metin augmentation örneği:")
|
| 692 |
+
aug_sample = augment_scientific_text(scientific_dataset[0])
|
| 693 |
+
print(f" Original: {aug_sample['abstract'][:100]}...")
|
| 694 |
+
print(f" Augmented: {aug_sample['abstract_augmented'][:100]}...")
|
| 695 |
+
|
| 696 |
+
|
| 697 |
+
print("\n3️⃣ Domain-Specific Filtering:")
|
| 698 |
+
|
| 699 |
+
def filter_high_quality_scientific(example):
|
| 700 |
+
"""
|
| 701 |
+
Yüksek kaliteli bilimsel makaleleri filtrele
|
| 702 |
+
"""
|
| 703 |
+
return (
|
| 704 |
+
example['citations'] > 50 and # Çok atıf almış
|
| 705 |
+
example['year'] >= 2020 and # Son yıllarda yayınlanmış
|
| 706 |
+
len(example['abstract'].split()) > 100 # Detaylı abstract
|
| 707 |
+
)
|
| 708 |
+
|
| 709 |
+
high_quality_sci = scientific_dataset.filter(
|
| 710 |
+
filter_high_quality_scientific,
|
| 711 |
+
desc="Filtering high-quality papers"
|
| 712 |
+
)
|
| 713 |
+
|
| 714 |
+
print(f"\n Kaliteli makale filtreleme:")
|
| 715 |
+
print(f" Orijinal: {len(scientific_dataset)} makale")
|
| 716 |
+
print(f" Filtrelenmiş: {len(high_quality_sci)} makale")
|
| 717 |
+
print(f" Oran: {len(high_quality_sci)/len(scientific_dataset)*100:.1f}%")
|
| 718 |
+
|
| 719 |
+
|
| 720 |
+
print("\n" + "="*70)
|
| 721 |
+
print("8. DOMAIN-SPECIFIC EVALUATION METRİKLERİ")
|
| 722 |
+
print("="*70)
|
| 723 |
+
|
| 724 |
+
print("\n📊 Domain-Specific Kalite Metrikleri:")
|
| 725 |
+
|
| 726 |
+
def calculate_domain_metrics(dataset, domain_name):
|
| 727 |
+
"""
|
| 728 |
+
Domain-specific kalite metrikleri
|
| 729 |
+
"""
|
| 730 |
+
if domain_name == 'scientific':
|
| 731 |
+
# Bilimsel metrikler
|
| 732 |
+
avg_citations = np.mean([ex['citations'] for ex in dataset])
|
| 733 |
+
avg_authors = np.mean([len(ex['authors']) for ex in dataset])
|
| 734 |
+
recent_papers = sum([1 for ex in dataset if ex['year'] >= 2020])
|
| 735 |
+
|
| 736 |
+
return {
|
| 737 |
+
'domain': domain_name,
|
| 738 |
+
'avg_citations': avg_citations,
|
| 739 |
+
'avg_authors': avg_authors,
|
| 740 |
+
'recent_ratio': recent_papers / len(dataset)
|
| 741 |
+
}
|
| 742 |
+
|
| 743 |
+
elif domain_name == 'code':
|
| 744 |
+
# Kod metrikleri
|
| 745 |
+
avg_loc = np.mean([ex['lines_of_code'] for ex in dataset])
|
| 746 |
+
has_doc = sum([1 for ex in dataset if ex['has_docstring']])
|
| 747 |
+
high_stars = sum([1 for ex in dataset if ex['stars'] > 1000])
|
| 748 |
+
|
| 749 |
+
return {
|
| 750 |
+
'domain': domain_name,
|
| 751 |
+
'avg_lines_of_code': avg_loc,
|
| 752 |
+
'documentation_ratio': has_doc / len(dataset),
|
| 753 |
+
'popular_ratio': high_stars / len(dataset)
|
| 754 |
+
}
|
| 755 |
+
|
| 756 |
+
elif domain_name == 'financial':
|
| 757 |
+
# Finansal metrikler
|
| 758 |
+
sentiments = [ex['sentiment'] for ex in dataset]
|
| 759 |
+
sent_dist = Counter(sentiments)
|
| 760 |
+
avg_change = np.mean([ex['stock_change'] for ex in dataset])
|
| 761 |
+
|
| 762 |
+
return {
|
| 763 |
+
'domain': domain_name,
|
| 764 |
+
'sentiment_distribution': dict(sent_dist),
|
| 765 |
+
'avg_stock_change': avg_change,
|
| 766 |
+
'volatility': np.std([ex['stock_change'] for ex in dataset])
|
| 767 |
+
}
|
| 768 |
+
|
| 769 |
+
print("\n1️⃣ Scientific Metrics:")
|
| 770 |
+
sci_metrics = calculate_domain_metrics(scientific_dataset, 'scientific')
|
| 771 |
+
for key, value in sci_metrics.items():
|
| 772 |
+
print(f" {key}: {value}")
|
| 773 |
+
|
| 774 |
+
print("\n2️⃣ Code Metrics:")
|
| 775 |
+
code_metrics = calculate_domain_metrics(code_dataset, 'code')
|
| 776 |
+
for key, value in code_metrics.items():
|
| 777 |
+
print(f" {key}: {value}")
|
| 778 |
+
|
| 779 |
+
print("\n3️⃣ Financial Metrics:")
|
| 780 |
+
fin_metrics = calculate_domain_metrics(financial_dataset, 'financial')
|
| 781 |
+
for key, value in fin_metrics.items():
|
| 782 |
+
print(f" {key}: {value}")
|
| 783 |
+
|
| 784 |
+
|
| 785 |
+
print("\n" + "="*70)
|
| 786 |
+
print("9. BEST PRACTICES - DOMAIN-SPECIFIC DATASETS")
|
| 787 |
+
print("="*70)
|
| 788 |
+
|
| 789 |
+
print("""
|
| 790 |
+
✅ BİLİMSEL DATASETS:
|
| 791 |
+
- Citation metadata ekle
|
| 792 |
+
- Abstract + full text ayrımı
|
| 793 |
+
- Domain/field classification
|
| 794 |
+
- Author disambiguation
|
| 795 |
+
- Reference parsing
|
| 796 |
+
- LaTeX formül handling
|
| 797 |
+
|
| 798 |
+
✅ KOD DATASETS:
|
| 799 |
+
- Programlama dili ayrımı
|
| 800 |
+
- Syntax parsing
|
| 801 |
+
- Docstring extraction
|
| 802 |
+
- Repository metadata
|
| 803 |
+
- License bilgisi
|
| 804 |
+
- Code quality metrics (complexity, coverage)
|
| 805 |
+
|
| 806 |
+
✅ FİNANSAL DATASETS:
|
| 807 |
+
- Sentiment annotation
|
| 808 |
+
- Entity recognition (companies, people)
|
| 809 |
+
- Temporal information
|
| 810 |
+
- Numerical data extraction
|
| 811 |
+
- Market data integration
|
| 812 |
+
- Real-time updates
|
| 813 |
+
|
| 814 |
+
✅ TIBBİ DATASETS:
|
| 815 |
+
- PHI (Protected Health Information) removal
|
| 816 |
+
- HIPAA compliance
|
| 817 |
+
- Clinical terminology standardization
|
| 818 |
+
- ICD code mapping
|
| 819 |
+
- Anonymization
|
| 820 |
+
- Ethical considerations
|
| 821 |
+
|
| 822 |
+
✅ GENEL PRENSİPLER:
|
| 823 |
+
- Domain expertise gerekir
|
| 824 |
+
- Specialized tokenization
|
| 825 |
+
- Domain-specific validation
|
| 826 |
+
- Quality filtering
|
| 827 |
+
- Ethical guidelines takip et
|
| 828 |
+
- License ve copyright kontrol et
|
| 829 |
+
|
| 830 |
+
✅ DATA QUALITY:
|
| 831 |
+
- Domain experts ile validate et
|
| 832 |
+
- Inter-annotator agreement hesapla
|
| 833 |
+
- Bias analysis yap
|
| 834 |
+
- Coverage analysis
|
| 835 |
+
- Statistical validation
|
| 836 |
+
- Regular updates
|
| 837 |
+
""")
|
| 838 |
+
|
| 839 |
+
|
| 840 |
+
print("\n" + "="*70)
|
| 841 |
+
print("✅ BÖLÜM 2 TAMAMLANDI!")
|
| 842 |
+
print("="*70)
|
| 843 |
+
print(f"""
|
| 844 |
+
Bu bölümde öğrendikleriniz:
|
| 845 |
+
✓ Bilimsel makale datasets ({len(scientific_dataset)} örnek)
|
| 846 |
+
✓ Kod datasets ({len(code_dataset)} örnek)
|
| 847 |
+
✓ Finansal analiz datasets ({len(financial_dataset)} örnek)
|
| 848 |
+
✓ Tıbbi/sağlık datasets ({len(medical_dataset)} örnek)
|
| 849 |
+
✓ Domain-specific preprocessing
|
| 850 |
+
✓ Cross-domain dataset birleştirme
|
| 851 |
+
✓ Domain adaptation teknikleri
|
| 852 |
+
✓ Domain-specific evaluation metrikleri
|
| 853 |
+
|
| 854 |
+
📊 ÜRETİLEN DATASETS:
|
| 855 |
+
- Scientific: {len(scientific_dataset):,} makale
|
| 856 |
+
- Code: {len(code_dataset):,} kod örneği
|
| 857 |
+
- Financial: {len(financial_dataset):,} finansal kayıt
|
| 858 |
+
- Medical: {len(medical_dataset):,} tıbbi kayıt
|
| 859 |
+
- Multi-domain: {len(multi_domain):,} birleştirilmiş örnek
|
| 860 |
+
|
| 861 |
+
📚 SONRAKI BÖLÜM: İleri Teknikler
|
| 862 |
+
- Dataset streaming (büyük datasets için)
|
| 863 |
+
- Custom data collators
|
| 864 |
+
- Feature extraction ve transformation
|
| 865 |
+
- Dataset preprocessing pipelines
|
| 866 |
+
- Advanced filtering strategies
|
| 867 |
+
""")
|
| 868 |
+
|
| 869 |
+
print("\n🚀 Harika! İkinci bölümü tamamladık!")
|
| 870 |
+
print("Üçüncü bölüme (İleri Teknikler) geçelim mi?")
|
space/modules/02b_cross_domain_fix.py
ADDED
|
@@ -0,0 +1,498 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
CROSS-DOMAIN DATASET BİRLEŞTİRME - DOĞRU YÖNTEM
|
| 3 |
+
===============================================
|
| 4 |
+
|
| 5 |
+
Bu modül, farklı domain'lerden dataset'leri birleştirirken
|
| 6 |
+
karşılaşılan schema mismatch problemini çözer ve best practices gösterir.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from datasets import Dataset, concatenate_datasets
|
| 10 |
+
import numpy as np
|
| 11 |
+
import json
|
| 12 |
+
|
| 13 |
+
print("="*70)
|
| 14 |
+
print("🔧 CROSS-DOMAIN DATASET BİRLEŞTİRME - PROBLEM VE ÇÖZÜM")
|
| 15 |
+
print("="*70)
|
| 16 |
+
|
| 17 |
+
# Sentetik dataset'ler oluştur
|
| 18 |
+
def generate_scientific_papers(num_samples=100):
|
| 19 |
+
def gen():
|
| 20 |
+
for i in range(num_samples):
|
| 21 |
+
yield {
|
| 22 |
+
'id': f'sci_{i}',
|
| 23 |
+
'abstract': f'Scientific text {i}',
|
| 24 |
+
'domain': 'Physics',
|
| 25 |
+
'year': 2020 + (i % 5)
|
| 26 |
+
}
|
| 27 |
+
return Dataset.from_generator(gen)
|
| 28 |
+
|
| 29 |
+
def generate_code_dataset(num_samples=100):
|
| 30 |
+
def gen():
|
| 31 |
+
for i in range(num_samples):
|
| 32 |
+
yield {
|
| 33 |
+
'id': f'code_{i}',
|
| 34 |
+
'code': f'def func_{i}(): pass',
|
| 35 |
+
'language': 'Python',
|
| 36 |
+
'lines_of_code': 5
|
| 37 |
+
}
|
| 38 |
+
return Dataset.from_generator(gen)
|
| 39 |
+
|
| 40 |
+
def generate_financial_dataset(num_samples=100):
|
| 41 |
+
def gen():
|
| 42 |
+
for i in range(num_samples):
|
| 43 |
+
yield {
|
| 44 |
+
'id': f'fin_{i}',
|
| 45 |
+
'text': f'Company {i} reports earnings',
|
| 46 |
+
'sentiment': 'positive',
|
| 47 |
+
'company': f'Corp{i}'
|
| 48 |
+
}
|
| 49 |
+
return Dataset.from_generator(gen)
|
| 50 |
+
|
| 51 |
+
print("\n📚 Sample Datasets Oluşturuluyor...")
|
| 52 |
+
sci_dataset = generate_scientific_papers(100)
|
| 53 |
+
code_dataset = generate_code_dataset(100)
|
| 54 |
+
fin_dataset = generate_financial_dataset(100)
|
| 55 |
+
|
| 56 |
+
print(f"✅ Scientific: {len(sci_dataset)} örnekler")
|
| 57 |
+
print(f" Kolonlar: {sci_dataset.column_names}")
|
| 58 |
+
print(f"✅ Code: {len(code_dataset)} örnekler")
|
| 59 |
+
print(f" Kolonlar: {code_dataset.column_names}")
|
| 60 |
+
print(f"✅ Financial: {len(fin_dataset)} örnekler")
|
| 61 |
+
print(f" Kolonlar: {fin_dataset.column_names}")
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
print("\n" + "="*70)
|
| 65 |
+
print("❌ PROBLEM: YANLIŞ YÖNTEM")
|
| 66 |
+
print("="*70)
|
| 67 |
+
|
| 68 |
+
print("""
|
| 69 |
+
Hatalı Yaklaşım:
|
| 70 |
+
- Her dataset farklı metadata structure'ı
|
| 71 |
+
- Schema mismatch hatası
|
| 72 |
+
- Arrow type error
|
| 73 |
+
|
| 74 |
+
Örnek hatalı kod:
|
| 75 |
+
metadata: {'type': domain, 'year': year} # Scientific
|
| 76 |
+
metadata: {'language': lang, 'lines': loc} # Code
|
| 77 |
+
metadata: {'sentiment': sent, 'company': comp} # Financial
|
| 78 |
+
|
| 79 |
+
❌ concatenate_datasets() çalışmaz!
|
| 80 |
+
""")
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
print("\n" + "="*70)
|
| 84 |
+
print("✅ ÇÖZÜM 1: ORTAK SCHEMA - FLATTEN APPROACH")
|
| 85 |
+
print("="*70)
|
| 86 |
+
|
| 87 |
+
print("\n🔧 Tüm alanları flatten edelim (en basit çözüm):")
|
| 88 |
+
|
| 89 |
+
def normalize_to_flat_schema(example, domain_type):
|
| 90 |
+
"""
|
| 91 |
+
Tüm alanları ayrı kolonlara çıkar
|
| 92 |
+
Missing değerler için None kullan
|
| 93 |
+
"""
|
| 94 |
+
base = {
|
| 95 |
+
'id': example.get('id', ''),
|
| 96 |
+
'text': '',
|
| 97 |
+
'domain': domain_type,
|
| 98 |
+
# Scientific fields
|
| 99 |
+
'abstract': None,
|
| 100 |
+
'sci_domain': None,
|
| 101 |
+
'year': None,
|
| 102 |
+
# Code fields
|
| 103 |
+
'code': None,
|
| 104 |
+
'language': None,
|
| 105 |
+
'lines_of_code': None,
|
| 106 |
+
# Financial fields
|
| 107 |
+
'sentiment': None,
|
| 108 |
+
'company': None,
|
| 109 |
+
}
|
| 110 |
+
|
| 111 |
+
# Domain'e göre doldur
|
| 112 |
+
if domain_type == 'scientific':
|
| 113 |
+
base['text'] = example.get('abstract', '')
|
| 114 |
+
base['abstract'] = example.get('abstract', '')
|
| 115 |
+
base['sci_domain'] = example.get('domain', '')
|
| 116 |
+
base['year'] = example.get('year', None)
|
| 117 |
+
elif domain_type == 'code':
|
| 118 |
+
base['text'] = example.get('code', '')
|
| 119 |
+
base['code'] = example.get('code', '')
|
| 120 |
+
base['language'] = example.get('language', '')
|
| 121 |
+
base['lines_of_code'] = example.get('lines_of_code', None)
|
| 122 |
+
elif domain_type == 'financial':
|
| 123 |
+
base['text'] = example.get('text', '')
|
| 124 |
+
base['sentiment'] = example.get('sentiment', '')
|
| 125 |
+
base['company'] = example.get('company', '')
|
| 126 |
+
|
| 127 |
+
return base
|
| 128 |
+
|
| 129 |
+
# Her dataset'i normalize et
|
| 130 |
+
print(" Normalizing scientific dataset...")
|
| 131 |
+
sci_flat = sci_dataset.map(
|
| 132 |
+
lambda x: normalize_to_flat_schema(x, 'scientific'),
|
| 133 |
+
remove_columns=sci_dataset.column_names,
|
| 134 |
+
desc="Flattening scientific"
|
| 135 |
+
)
|
| 136 |
+
|
| 137 |
+
print(" Normalizing code dataset...")
|
| 138 |
+
code_flat = code_dataset.map(
|
| 139 |
+
lambda x: normalize_to_flat_schema(x, 'code'),
|
| 140 |
+
remove_columns=code_dataset.column_names,
|
| 141 |
+
desc="Flattening code"
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
print(" Normalizing financial dataset...")
|
| 145 |
+
fin_flat = fin_dataset.map(
|
| 146 |
+
lambda x: normalize_to_flat_schema(x, 'financial'),
|
| 147 |
+
remove_columns=fin_dataset.column_names,
|
| 148 |
+
desc="Flattening financial"
|
| 149 |
+
)
|
| 150 |
+
|
| 151 |
+
# Şimdi birleştir - ÇALIŞIR!
|
| 152 |
+
print("\n✅ Birleştiriliyor...")
|
| 153 |
+
multi_domain_flat = concatenate_datasets([sci_flat, code_flat, fin_flat])
|
| 154 |
+
|
| 155 |
+
print(f"\n🎉 BAŞARILI! Multi-domain dataset: {len(multi_domain_flat)} örnek")
|
| 156 |
+
print(f"Kolonlar: {multi_domain_flat.column_names}")
|
| 157 |
+
|
| 158 |
+
# Örnekleri göster
|
| 159 |
+
print("\n📊 Her domain'den örnek:")
|
| 160 |
+
print("\n1. Scientific örnek:")
|
| 161 |
+
sci_ex = multi_domain_flat[0]
|
| 162 |
+
print(f" Domain: {sci_ex['domain']}")
|
| 163 |
+
print(f" Text: {sci_ex['text'][:50]}...")
|
| 164 |
+
print(f" Year: {sci_ex['year']}")
|
| 165 |
+
print(f" Language: {sci_ex['language']}") # None olmalı
|
| 166 |
+
|
| 167 |
+
print("\n2. Code örnek:")
|
| 168 |
+
code_ex = multi_domain_flat[100]
|
| 169 |
+
print(f" Domain: {code_ex['domain']}")
|
| 170 |
+
print(f" Text: {code_ex['text'][:50]}...")
|
| 171 |
+
print(f" Language: {code_ex['language']}")
|
| 172 |
+
print(f" Year: {code_ex['year']}") # None olmalı
|
| 173 |
+
|
| 174 |
+
print("\n3. Financial örnek:")
|
| 175 |
+
fin_ex = multi_domain_flat[200]
|
| 176 |
+
print(f" Domain: {fin_ex['domain']}")
|
| 177 |
+
print(f" Text: {fin_ex['text'][:50]}...")
|
| 178 |
+
print(f" Sentiment: {fin_ex['sentiment']}")
|
| 179 |
+
print(f" Company: {fin_ex['company']}")
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
print("\n" + "="*70)
|
| 183 |
+
print("✅ ÇÖZÜM 2: JSON METADATA - FLEXIBLE APPROACH")
|
| 184 |
+
print("="*70)
|
| 185 |
+
|
| 186 |
+
print("\n🔧 Metadata'yı JSON string olarak sakla (daha esnek):")
|
| 187 |
+
|
| 188 |
+
def normalize_to_json_schema(example, domain_type):
|
| 189 |
+
"""
|
| 190 |
+
Domain-specific metadata'yı JSON string olarak sakla
|
| 191 |
+
Bu yaklaşım daha esnek ve genişletilebilir
|
| 192 |
+
"""
|
| 193 |
+
base = {
|
| 194 |
+
'id': example.get('id', ''),
|
| 195 |
+
'text': '',
|
| 196 |
+
'domain': domain_type,
|
| 197 |
+
'metadata_json': ''
|
| 198 |
+
}
|
| 199 |
+
|
| 200 |
+
metadata = {}
|
| 201 |
+
|
| 202 |
+
if domain_type == 'scientific':
|
| 203 |
+
base['text'] = example.get('abstract', '')
|
| 204 |
+
metadata = {
|
| 205 |
+
'domain': example.get('domain', ''),
|
| 206 |
+
'year': example.get('year', None)
|
| 207 |
+
}
|
| 208 |
+
elif domain_type == 'code':
|
| 209 |
+
base['text'] = example.get('code', '')
|
| 210 |
+
metadata = {
|
| 211 |
+
'language': example.get('language', ''),
|
| 212 |
+
'lines_of_code': example.get('lines_of_code', None)
|
| 213 |
+
}
|
| 214 |
+
elif domain_type == 'financial':
|
| 215 |
+
base['text'] = example.get('text', '')
|
| 216 |
+
metadata = {
|
| 217 |
+
'sentiment': example.get('sentiment', ''),
|
| 218 |
+
'company': example.get('company', '')
|
| 219 |
+
}
|
| 220 |
+
|
| 221 |
+
base['metadata_json'] = json.dumps(metadata)
|
| 222 |
+
return base
|
| 223 |
+
|
| 224 |
+
# Normalize
|
| 225 |
+
print(" Normalizing with JSON metadata...")
|
| 226 |
+
sci_json = sci_dataset.map(
|
| 227 |
+
lambda x: normalize_to_json_schema(x, 'scientific'),
|
| 228 |
+
remove_columns=sci_dataset.column_names
|
| 229 |
+
)
|
| 230 |
+
code_json = code_dataset.map(
|
| 231 |
+
lambda x: normalize_to_json_schema(x, 'code'),
|
| 232 |
+
remove_columns=code_dataset.column_names
|
| 233 |
+
)
|
| 234 |
+
fin_json = fin_dataset.map(
|
| 235 |
+
lambda x: normalize_to_json_schema(x, 'financial'),
|
| 236 |
+
remove_columns=fin_dataset.column_names
|
| 237 |
+
)
|
| 238 |
+
|
| 239 |
+
# Birleştir
|
| 240 |
+
multi_domain_json = concatenate_datasets([sci_json, code_json, fin_json])
|
| 241 |
+
|
| 242 |
+
print(f"\n✅ Multi-domain (JSON): {len(multi_domain_json)} örnek")
|
| 243 |
+
print(f"Kolonlar: {multi_domain_json.column_names}")
|
| 244 |
+
|
| 245 |
+
# Metadata'yı parse et
|
| 246 |
+
print("\n📊 JSON Metadata Örnekleri:")
|
| 247 |
+
for i, idx in enumerate([0, 100, 200]):
|
| 248 |
+
ex = multi_domain_json[idx]
|
| 249 |
+
metadata = json.loads(ex['metadata_json'])
|
| 250 |
+
print(f"\n{i+1}. {ex['domain'].capitalize()}:")
|
| 251 |
+
print(f" Text: {ex['text'][:50]}...")
|
| 252 |
+
print(f" Metadata: {metadata}")
|
| 253 |
+
|
| 254 |
+
|
| 255 |
+
print("\n" + "="*70)
|
| 256 |
+
print("✅ ÇÖZÜM 3: SEPARATE TABLES - DATABASE APPROACH")
|
| 257 |
+
print("="*70)
|
| 258 |
+
|
| 259 |
+
print("""
|
| 260 |
+
🗄️ Database-style Approach:
|
| 261 |
+
|
| 262 |
+
Ana tablo (unified):
|
| 263 |
+
- id
|
| 264 |
+
- text
|
| 265 |
+
- domain
|
| 266 |
+
- reference_id
|
| 267 |
+
|
| 268 |
+
Domain-specific tablolar:
|
| 269 |
+
- scientific_metadata: reference_id -> {year, domain, ...}
|
| 270 |
+
- code_metadata: reference_id -> {language, lines, ...}
|
| 271 |
+
- financial_metadata: reference_id -> {sentiment, company, ...}
|
| 272 |
+
|
| 273 |
+
장점:
|
| 274 |
+
✓ Schema flexibility
|
| 275 |
+
✓ Easy to extend
|
| 276 |
+
✓ Efficient storage
|
| 277 |
+
✓ Type safety
|
| 278 |
+
|
| 279 |
+
단점:
|
| 280 |
+
✗ Join gerekir
|
| 281 |
+
✗ Daha kompleks
|
| 282 |
+
""")
|
| 283 |
+
|
| 284 |
+
# Simple implementation
|
| 285 |
+
def create_separated_tables(datasets_dict):
|
| 286 |
+
"""
|
| 287 |
+
Ana tablo + ayrı metadata tabloları
|
| 288 |
+
"""
|
| 289 |
+
# Ana tablo
|
| 290 |
+
unified = []
|
| 291 |
+
metadata_tables = {
|
| 292 |
+
'scientific': [],
|
| 293 |
+
'code': [],
|
| 294 |
+
'financial': []
|
| 295 |
+
}
|
| 296 |
+
|
| 297 |
+
ref_id = 0
|
| 298 |
+
|
| 299 |
+
# Scientific
|
| 300 |
+
for ex in datasets_dict['scientific']:
|
| 301 |
+
unified.append({
|
| 302 |
+
'id': ex['id'],
|
| 303 |
+
'text': ex['abstract'],
|
| 304 |
+
'domain': 'scientific',
|
| 305 |
+
'reference_id': ref_id
|
| 306 |
+
})
|
| 307 |
+
metadata_tables['scientific'].append({
|
| 308 |
+
'reference_id': ref_id,
|
| 309 |
+
'sci_domain': ex['domain'],
|
| 310 |
+
'year': ex['year']
|
| 311 |
+
})
|
| 312 |
+
ref_id += 1
|
| 313 |
+
|
| 314 |
+
# Code
|
| 315 |
+
for ex in datasets_dict['code']:
|
| 316 |
+
unified.append({
|
| 317 |
+
'id': ex['id'],
|
| 318 |
+
'text': ex['code'],
|
| 319 |
+
'domain': 'code',
|
| 320 |
+
'reference_id': ref_id
|
| 321 |
+
})
|
| 322 |
+
metadata_tables['code'].append({
|
| 323 |
+
'reference_id': ref_id,
|
| 324 |
+
'language': ex['language'],
|
| 325 |
+
'lines_of_code': ex['lines_of_code']
|
| 326 |
+
})
|
| 327 |
+
ref_id += 1
|
| 328 |
+
|
| 329 |
+
# Financial
|
| 330 |
+
for ex in datasets_dict['financial']:
|
| 331 |
+
unified.append({
|
| 332 |
+
'id': ex['id'],
|
| 333 |
+
'text': ex['text'],
|
| 334 |
+
'domain': 'financial',
|
| 335 |
+
'reference_id': ref_id
|
| 336 |
+
})
|
| 337 |
+
metadata_tables['financial'].append({
|
| 338 |
+
'reference_id': ref_id,
|
| 339 |
+
'sentiment': ex['sentiment'],
|
| 340 |
+
'company': ex['company']
|
| 341 |
+
})
|
| 342 |
+
ref_id += 1
|
| 343 |
+
|
| 344 |
+
return {
|
| 345 |
+
'unified': Dataset.from_dict({k: [d[k] for d in unified] for k in unified[0].keys()}),
|
| 346 |
+
'metadata': {k: Dataset.from_dict({k: [d[k] for d in v] for k in v[0].keys()})
|
| 347 |
+
for k, v in metadata_tables.items()}
|
| 348 |
+
}
|
| 349 |
+
|
| 350 |
+
print("\n🔧 Creating separated tables...")
|
| 351 |
+
separated = create_separated_tables({
|
| 352 |
+
'scientific': sci_dataset,
|
| 353 |
+
'code': code_dataset,
|
| 354 |
+
'financial': fin_dataset
|
| 355 |
+
})
|
| 356 |
+
|
| 357 |
+
print(f"\n✅ Unified table: {len(separated['unified'])} records")
|
| 358 |
+
print(f" Columns: {separated['unified'].column_names}")
|
| 359 |
+
|
| 360 |
+
for domain, meta_table in separated['metadata'].items():
|
| 361 |
+
print(f"\n✅ {domain.capitalize()} metadata: {len(meta_table)} records")
|
| 362 |
+
print(f" Columns: {meta_table.column_names}")
|
| 363 |
+
|
| 364 |
+
# Join örneği
|
| 365 |
+
print("\n🔗 Join Example - Scientific record:")
|
| 366 |
+
unified_ex = separated['unified'][0]
|
| 367 |
+
ref_id = unified_ex['reference_id']
|
| 368 |
+
sci_meta = [ex for ex in separated['metadata']['scientific'] if ex['reference_id'] == ref_id][0]
|
| 369 |
+
|
| 370 |
+
print(f" Main table: {unified_ex}")
|
| 371 |
+
print(f" Metadata: {sci_meta}")
|
| 372 |
+
|
| 373 |
+
|
| 374 |
+
print("\n" + "="*70)
|
| 375 |
+
print("📚 BEST PRACTICES - CROSS-DOMAIN DATASETS")
|
| 376 |
+
print("="*70)
|
| 377 |
+
|
| 378 |
+
print("""
|
| 379 |
+
✅ FLATTEN APPROACH:
|
| 380 |
+
장점:
|
| 381 |
+
- En basit yöntem
|
| 382 |
+
- Hızlı erişim
|
| 383 |
+
- Tüm veriler bir yerde
|
| 384 |
+
단점:
|
| 385 |
+
- Çok fazla None değer (sparse)
|
| 386 |
+
- Schema değişikliği zor
|
| 387 |
+
- Memory inefficient
|
| 388 |
+
|
| 389 |
+
Ne zaman kullan:
|
| 390 |
+
- Az sayıda domain
|
| 391 |
+
- Benzer field'lar
|
| 392 |
+
- Simple queries
|
| 393 |
+
|
| 394 |
+
✅ JSON METADATA APPROACH:
|
| 395 |
+
장점:
|
| 396 |
+
- Esnek schema
|
| 397 |
+
- Kolay extend
|
| 398 |
+
- Daha az None
|
| 399 |
+
단점:
|
| 400 |
+
- Parse gerekir
|
| 401 |
+
- Type safety yok
|
| 402 |
+
- Query daha yavaş
|
| 403 |
+
|
| 404 |
+
Ne zaman kullan:
|
| 405 |
+
- Çok farklı domain'ler
|
| 406 |
+
- Sık schema değişikliği
|
| 407 |
+
- Prototype/exploration
|
| 408 |
+
|
| 409 |
+
✅ SEPARATE TABLES APPROACH:
|
| 410 |
+
장점:
|
| 411 |
+
- Temiz schema
|
| 412 |
+
- Type safe
|
| 413 |
+
- Efficient storage
|
| 414 |
+
- Professional approach
|
| 415 |
+
단점:
|
| 416 |
+
- Join gerekir
|
| 417 |
+
- Daha kompleks
|
| 418 |
+
- Setup overhead
|
| 419 |
+
|
| 420 |
+
Ne zaman kullan:
|
| 421 |
+
- Production systems
|
| 422 |
+
- Çok domain
|
| 423 |
+
- Complex queries
|
| 424 |
+
- Large scale
|
| 425 |
+
|
| 426 |
+
✅ HYBRID APPROACH:
|
| 427 |
+
- Common fields flatten
|
| 428 |
+
- Rare fields JSON
|
| 429 |
+
- Best of both worlds
|
| 430 |
+
|
| 431 |
+
Örnek:
|
| 432 |
+
{
|
| 433 |
+
'id': string,
|
| 434 |
+
'text': string,
|
| 435 |
+
'domain': string,
|
| 436 |
+
'common_field_1': value,
|
| 437 |
+
'common_field_2': value,
|
| 438 |
+
'extra_metadata_json': json_string
|
| 439 |
+
}
|
| 440 |
+
|
| 441 |
+
🎯 RECOMMENDATION:
|
| 442 |
+
Small project → JSON approach
|
| 443 |
+
Medium project → Flatten approach
|
| 444 |
+
Large project → Separate tables
|
| 445 |
+
Research → Hybrid approach
|
| 446 |
+
""")
|
| 447 |
+
|
| 448 |
+
|
| 449 |
+
print("\n" + "="*70)
|
| 450 |
+
print("🔍 KARŞILAŞTIRMA - PERFORMANCE & STORAGE")
|
| 451 |
+
print("="*70)
|
| 452 |
+
|
| 453 |
+
import sys
|
| 454 |
+
|
| 455 |
+
print("\n📊 Memory Usage Comparison:")
|
| 456 |
+
print(f" Flatten: {sys.getsizeof(multi_domain_flat.data)} bytes")
|
| 457 |
+
print(f" JSON: {sys.getsizeof(multi_domain_json.data)} bytes")
|
| 458 |
+
print(f" Separated (unified): {sys.getsizeof(separated['unified'].data)} bytes")
|
| 459 |
+
|
| 460 |
+
print("\n🚀 Query Speed Simulation:")
|
| 461 |
+
print(" Flatten: O(1) - Direct column access")
|
| 462 |
+
print(" JSON: O(1) + parse overhead")
|
| 463 |
+
print(" Separated: O(log n) - Join required")
|
| 464 |
+
|
| 465 |
+
print("\n💾 Storage Efficiency:")
|
| 466 |
+
total_flat = len(multi_domain_flat) * len(multi_domain_flat.column_names)
|
| 467 |
+
total_json = len(multi_domain_json) * len(multi_domain_json.column_names)
|
| 468 |
+
total_sep = len(separated['unified']) + sum(len(t) for t in separated['metadata'].values())
|
| 469 |
+
|
| 470 |
+
print(f" Flatten: {total_flat} total fields")
|
| 471 |
+
print(f" JSON: {total_json} total fields")
|
| 472 |
+
print(f" Separated: {total_sep} total fields")
|
| 473 |
+
|
| 474 |
+
|
| 475 |
+
print("\n" + "="*70)
|
| 476 |
+
print("✅ ÇÖZÜM ÖZETİ")
|
| 477 |
+
print("="*70)
|
| 478 |
+
|
| 479 |
+
print("""
|
| 480 |
+
🎯 Ana Sorun:
|
| 481 |
+
ArrowTypeError: struct fields don't match
|
| 482 |
+
|
| 483 |
+
🔧 Çözümler:
|
| 484 |
+
1. Flatten: Tüm field'ları ayrı kolonlara çıkar
|
| 485 |
+
2. JSON: Metadata'yı JSON string olarak sakla
|
| 486 |
+
3. Separated: Ana tablo + metadata tabloları
|
| 487 |
+
|
| 488 |
+
✅ En İyi Yaklaşım:
|
| 489 |
+
- Küçük projeler: JSON
|
| 490 |
+
- Orta projeler: Flatten + JSON hybrid
|
| 491 |
+
- Büyük projeler: Separated tables
|
| 492 |
+
|
| 493 |
+
⚡ Key Takeaway:
|
| 494 |
+
Farklı schema'ları birleştirmeden önce
|
| 495 |
+
ortak bir format'a normalize et!
|
| 496 |
+
""")
|
| 497 |
+
|
| 498 |
+
print("\n🎉 Problem çözüldü! Artık cross-domain dataset'leri güvenle birleştirebilirsiniz.")
|
space/modules/03_ileri_teknikler_part1.py
ADDED
|
@@ -0,0 +1,856 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
İLERİ TEKNİKLER - HUGGING FACE DATASETS
|
| 3 |
+
========================================
|
| 4 |
+
|
| 5 |
+
Bu modülde öğrenecekleriniz:
|
| 6 |
+
1. Custom Data Collators
|
| 7 |
+
2. Advanced Feature Extraction & Transformation
|
| 8 |
+
3. Dataset Preprocessing Pipelines
|
| 9 |
+
4. Data Augmentation Strategies
|
| 10 |
+
5. Advanced Filtering & Sampling
|
| 11 |
+
6. Dynamic Batching
|
| 12 |
+
7. Feature Engineering
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
from datasets import Dataset, DatasetDict
|
| 16 |
+
import numpy as np
|
| 17 |
+
from typing import Dict, List, Any, Callable
|
| 18 |
+
import time
|
| 19 |
+
from collections import defaultdict
|
| 20 |
+
import random
|
| 21 |
+
|
| 22 |
+
print("="*70)
|
| 23 |
+
print("🚀 İLERİ TEKNİKLER - ADVANCED HUGGING FACE DATASETS")
|
| 24 |
+
print("="*70)
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
print("\n" + "="*70)
|
| 28 |
+
print("1. CUSTOM DATA COLLATORS")
|
| 29 |
+
print("="*70)
|
| 30 |
+
|
| 31 |
+
print("\n📦 Data Collator Nedir?")
|
| 32 |
+
print("""
|
| 33 |
+
Data Collator: Batch'teki örnekleri işleyip model input'una çevirir
|
| 34 |
+
- Padding ekler
|
| 35 |
+
- Tensor'lara çevirir
|
| 36 |
+
- Batch oluşturur
|
| 37 |
+
- Dynamic behavior
|
| 38 |
+
""")
|
| 39 |
+
|
| 40 |
+
# Örnek dataset
|
| 41 |
+
def create_sample_dataset(num_samples=100):
|
| 42 |
+
def gen():
|
| 43 |
+
for i in range(num_samples):
|
| 44 |
+
yield {
|
| 45 |
+
'text': f"Sample text {i} " * np.random.randint(5, 20),
|
| 46 |
+
'label': np.random.randint(0, 3),
|
| 47 |
+
'length': np.random.randint(10, 100),
|
| 48 |
+
'metadata': {'id': i, 'score': np.random.random()}
|
| 49 |
+
}
|
| 50 |
+
return Dataset.from_generator(gen)
|
| 51 |
+
|
| 52 |
+
dataset = create_sample_dataset(200)
|
| 53 |
+
print(f"\n✅ Dataset: {len(dataset)} örnek")
|
| 54 |
+
print(f"Örnek: {dataset[0]}")
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
print("\n1️⃣ Basit Collator - Text + Label:")
|
| 58 |
+
|
| 59 |
+
class SimpleCollator:
|
| 60 |
+
"""
|
| 61 |
+
En basit collator - sadece text ve label'ı işler
|
| 62 |
+
"""
|
| 63 |
+
def __init__(self, max_length=50):
|
| 64 |
+
self.max_length = max_length
|
| 65 |
+
|
| 66 |
+
def __call__(self, batch: List[Dict]) -> Dict[str, List]:
|
| 67 |
+
"""
|
| 68 |
+
Batch'i işle
|
| 69 |
+
"""
|
| 70 |
+
# Text'leri al ve truncate et
|
| 71 |
+
texts = []
|
| 72 |
+
for example in batch:
|
| 73 |
+
text = example['text']
|
| 74 |
+
words = text.split()[:self.max_length]
|
| 75 |
+
texts.append(' '.join(words))
|
| 76 |
+
|
| 77 |
+
# Label'ları al
|
| 78 |
+
labels = [example['label'] for example in batch]
|
| 79 |
+
|
| 80 |
+
# Length'leri hesapla
|
| 81 |
+
lengths = [len(text.split()) for text in texts]
|
| 82 |
+
|
| 83 |
+
return {
|
| 84 |
+
'texts': texts,
|
| 85 |
+
'labels': labels,
|
| 86 |
+
'lengths': lengths
|
| 87 |
+
}
|
| 88 |
+
|
| 89 |
+
# Test
|
| 90 |
+
simple_collator = SimpleCollator(max_length=30)
|
| 91 |
+
batch = [dataset[i] for i in range(4)]
|
| 92 |
+
|
| 93 |
+
collated = simple_collator(batch)
|
| 94 |
+
print(f"\n✅ Collated batch:")
|
| 95 |
+
print(f" Texts: {len(collated['texts'])} samples")
|
| 96 |
+
print(f" Labels: {collated['labels']}")
|
| 97 |
+
print(f" Lengths: {collated['lengths']}")
|
| 98 |
+
print(f"\n İlk text: {collated['texts'][0][:80]}...")
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
print("\n2️⃣ Padding Collator - Dynamic Padding:")
|
| 102 |
+
|
| 103 |
+
class PaddingCollator:
|
| 104 |
+
"""
|
| 105 |
+
Dynamic padding - batch içindeki max uzunluğa göre padding
|
| 106 |
+
"""
|
| 107 |
+
def __init__(self, pad_token='[PAD]', max_length=None):
|
| 108 |
+
self.pad_token = pad_token
|
| 109 |
+
self.max_length = max_length
|
| 110 |
+
|
| 111 |
+
def __call__(self, batch: List[Dict]) -> Dict[str, Any]:
|
| 112 |
+
# Tokenize (basit - space split)
|
| 113 |
+
tokenized = []
|
| 114 |
+
for example in batch:
|
| 115 |
+
tokens = example['text'].split()
|
| 116 |
+
if self.max_length:
|
| 117 |
+
tokens = tokens[:self.max_length]
|
| 118 |
+
tokenized.append(tokens)
|
| 119 |
+
|
| 120 |
+
# Batch içindeki max length'i bul
|
| 121 |
+
max_len = max(len(tokens) for tokens in tokenized)
|
| 122 |
+
|
| 123 |
+
# Padding ekle
|
| 124 |
+
padded = []
|
| 125 |
+
attention_masks = []
|
| 126 |
+
|
| 127 |
+
for tokens in tokenized:
|
| 128 |
+
# Padding
|
| 129 |
+
padding_length = max_len - len(tokens)
|
| 130 |
+
padded_tokens = tokens + [self.pad_token] * padding_length
|
| 131 |
+
|
| 132 |
+
# Attention mask (1 = real token, 0 = padding)
|
| 133 |
+
mask = [1] * len(tokens) + [0] * padding_length
|
| 134 |
+
|
| 135 |
+
padded.append(padded_tokens)
|
| 136 |
+
attention_masks.append(mask)
|
| 137 |
+
|
| 138 |
+
labels = [ex['label'] for ex in batch]
|
| 139 |
+
|
| 140 |
+
return {
|
| 141 |
+
'input_tokens': padded,
|
| 142 |
+
'attention_mask': attention_masks,
|
| 143 |
+
'labels': labels,
|
| 144 |
+
'original_lengths': [len(tokens) for tokens in tokenized]
|
| 145 |
+
}
|
| 146 |
+
|
| 147 |
+
# Test
|
| 148 |
+
padding_collator = PaddingCollator(max_length=20)
|
| 149 |
+
batch = [dataset[i] for i in range(4)]
|
| 150 |
+
|
| 151 |
+
padded_batch = padding_collator(batch)
|
| 152 |
+
print(f"\n✅ Padded batch:")
|
| 153 |
+
print(f" Batch size: {len(padded_batch['input_tokens'])}")
|
| 154 |
+
print(f" Max length: {len(padded_batch['input_tokens'][0])}")
|
| 155 |
+
print(f" Original lengths: {padded_batch['original_lengths']}")
|
| 156 |
+
print(f"\n İlk örnek tokens: {padded_batch['input_tokens'][0][:15]}")
|
| 157 |
+
print(f" İlk örnek mask: {padded_batch['attention_mask'][0][:15]}")
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
print("\n3️⃣ Advanced Collator - Multiple Features:")
|
| 161 |
+
|
| 162 |
+
class AdvancedCollator:
|
| 163 |
+
"""
|
| 164 |
+
Çoklu feature'ları handle eden advanced collator
|
| 165 |
+
"""
|
| 166 |
+
def __init__(self,
|
| 167 |
+
pad_token='[PAD]',
|
| 168 |
+
max_length=50,
|
| 169 |
+
include_metadata=True,
|
| 170 |
+
normalize_scores=True):
|
| 171 |
+
self.pad_token = pad_token
|
| 172 |
+
self.max_length = max_length
|
| 173 |
+
self.include_metadata = include_metadata
|
| 174 |
+
self.normalize_scores = normalize_scores
|
| 175 |
+
|
| 176 |
+
def tokenize_and_pad(self, texts):
|
| 177 |
+
"""Tokenize ve pad"""
|
| 178 |
+
tokenized = [text.split()[:self.max_length] for text in texts]
|
| 179 |
+
max_len = max(len(tokens) for tokens in tokenized)
|
| 180 |
+
|
| 181 |
+
padded = []
|
| 182 |
+
masks = []
|
| 183 |
+
for tokens in tokenized:
|
| 184 |
+
pad_len = max_len - len(tokens)
|
| 185 |
+
padded.append(tokens + [self.pad_token] * pad_len)
|
| 186 |
+
masks.append([1] * len(tokens) + [0] * pad_len)
|
| 187 |
+
|
| 188 |
+
return padded, masks
|
| 189 |
+
|
| 190 |
+
def __call__(self, batch: List[Dict]) -> Dict[str, Any]:
|
| 191 |
+
texts = [ex['text'] for ex in batch]
|
| 192 |
+
labels = [ex['label'] for ex in batch]
|
| 193 |
+
lengths = [ex['length'] for ex in batch]
|
| 194 |
+
|
| 195 |
+
# Tokenize and pad
|
| 196 |
+
padded_tokens, attention_masks = self.tokenize_and_pad(texts)
|
| 197 |
+
|
| 198 |
+
result = {
|
| 199 |
+
'input_tokens': padded_tokens,
|
| 200 |
+
'attention_mask': attention_masks,
|
| 201 |
+
'labels': labels,
|
| 202 |
+
'lengths': lengths
|
| 203 |
+
}
|
| 204 |
+
|
| 205 |
+
# Metadata ekle
|
| 206 |
+
if self.include_metadata:
|
| 207 |
+
ids = [ex['metadata']['id'] for ex in batch]
|
| 208 |
+
scores = [ex['metadata']['score'] for ex in batch]
|
| 209 |
+
|
| 210 |
+
if self.normalize_scores:
|
| 211 |
+
# Min-max normalization
|
| 212 |
+
min_score = min(scores)
|
| 213 |
+
max_score = max(scores)
|
| 214 |
+
if max_score > min_score:
|
| 215 |
+
scores = [(s - min_score) / (max_score - min_score)
|
| 216 |
+
for s in scores]
|
| 217 |
+
|
| 218 |
+
result['ids'] = ids
|
| 219 |
+
result['scores'] = scores
|
| 220 |
+
|
| 221 |
+
# Batch statistics
|
| 222 |
+
result['batch_stats'] = {
|
| 223 |
+
'size': len(batch),
|
| 224 |
+
'avg_length': np.mean(lengths),
|
| 225 |
+
'max_length': max(lengths),
|
| 226 |
+
'label_distribution': {
|
| 227 |
+
label: labels.count(label) for label in set(labels)
|
| 228 |
+
}
|
| 229 |
+
}
|
| 230 |
+
|
| 231 |
+
return result
|
| 232 |
+
|
| 233 |
+
# Test
|
| 234 |
+
advanced_collator = AdvancedCollator(
|
| 235 |
+
max_length=25,
|
| 236 |
+
include_metadata=True,
|
| 237 |
+
normalize_scores=True
|
| 238 |
+
)
|
| 239 |
+
|
| 240 |
+
batch = [dataset[i] for i in range(8)]
|
| 241 |
+
advanced_batch = advanced_collator(batch)
|
| 242 |
+
|
| 243 |
+
print(f"\n✅ Advanced collated batch:")
|
| 244 |
+
print(f" Input tokens shape: {len(advanced_batch['input_tokens'])} x {len(advanced_batch['input_tokens'][0])}")
|
| 245 |
+
print(f" Labels: {advanced_batch['labels']}")
|
| 246 |
+
print(f" Normalized scores: {[f'{s:.3f}' for s in advanced_batch['scores']]}")
|
| 247 |
+
print(f" Batch stats: {advanced_batch['batch_stats']}")
|
| 248 |
+
|
| 249 |
+
|
| 250 |
+
print("\n" + "="*70)
|
| 251 |
+
print("2. ADVANCED FEATURE EXTRACTION & TRANSFORMATION")
|
| 252 |
+
print("="*70)
|
| 253 |
+
|
| 254 |
+
print("\n🔧 Feature Engineering Pipeline:")
|
| 255 |
+
|
| 256 |
+
class FeatureExtractor:
|
| 257 |
+
"""
|
| 258 |
+
Comprehensive feature extraction
|
| 259 |
+
"""
|
| 260 |
+
def __init__(self):
|
| 261 |
+
self.features = []
|
| 262 |
+
|
| 263 |
+
def extract_text_features(self, text: str) -> Dict[str, Any]:
|
| 264 |
+
"""Text'ten çeşitli feature'lar çıkar"""
|
| 265 |
+
words = text.split()
|
| 266 |
+
|
| 267 |
+
return {
|
| 268 |
+
# Basic features
|
| 269 |
+
'word_count': len(words),
|
| 270 |
+
'char_count': len(text),
|
| 271 |
+
'avg_word_length': np.mean([len(w) for w in words]) if words else 0,
|
| 272 |
+
|
| 273 |
+
# Complexity features
|
| 274 |
+
'unique_words': len(set(words)),
|
| 275 |
+
'vocabulary_richness': len(set(words)) / len(words) if words else 0,
|
| 276 |
+
|
| 277 |
+
# Statistical features
|
| 278 |
+
'word_length_std': np.std([len(w) for w in words]) if words else 0,
|
| 279 |
+
'max_word_length': max([len(w) for w in words]) if words else 0,
|
| 280 |
+
|
| 281 |
+
# Pattern features
|
| 282 |
+
'has_numbers': any(char.isdigit() for char in text),
|
| 283 |
+
'uppercase_ratio': sum(1 for c in text if c.isupper()) / len(text) if text else 0,
|
| 284 |
+
'punctuation_count': sum(1 for c in text if c in '.,!?;:')
|
| 285 |
+
}
|
| 286 |
+
|
| 287 |
+
def extract_all_features(self, example: Dict) -> Dict:
|
| 288 |
+
"""Tüm feature'ları çıkar"""
|
| 289 |
+
text_features = self.extract_text_features(example['text'])
|
| 290 |
+
|
| 291 |
+
# Mevcut feature'ları koru
|
| 292 |
+
result = {**example}
|
| 293 |
+
|
| 294 |
+
# Yeni feature'ları ekle
|
| 295 |
+
for key, value in text_features.items():
|
| 296 |
+
result[f'feat_{key}'] = value
|
| 297 |
+
|
| 298 |
+
return result
|
| 299 |
+
|
| 300 |
+
# Test feature extraction
|
| 301 |
+
print("\n1️⃣ Basic Feature Extraction:")
|
| 302 |
+
extractor = FeatureExtractor()
|
| 303 |
+
|
| 304 |
+
sample_text = "This is a sample text for feature extraction! It has 123 numbers."
|
| 305 |
+
features = extractor.extract_text_features(sample_text)
|
| 306 |
+
|
| 307 |
+
print(f" Text: {sample_text}")
|
| 308 |
+
print(f"\n Extracted features:")
|
| 309 |
+
for key, value in features.items():
|
| 310 |
+
print(f" {key}: {value:.3f}" if isinstance(value, float) else f" {key}: {value}")
|
| 311 |
+
|
| 312 |
+
# Dataset'e uygula
|
| 313 |
+
print("\n2️⃣ Applying to Dataset:")
|
| 314 |
+
featured_dataset = dataset.map(
|
| 315 |
+
extractor.extract_all_features,
|
| 316 |
+
desc="Extracting features"
|
| 317 |
+
)
|
| 318 |
+
|
| 319 |
+
print(f"\n✅ Featured dataset:")
|
| 320 |
+
print(f" Original columns: {dataset.column_names}")
|
| 321 |
+
print(f" New columns: {featured_dataset.column_names}")
|
| 322 |
+
print(f" Total columns: {len(featured_dataset.column_names)}")
|
| 323 |
+
|
| 324 |
+
# Feature istatistikleri
|
| 325 |
+
print(f"\n📊 Feature Statistics:")
|
| 326 |
+
for col in ['feat_word_count', 'feat_vocabulary_richness', 'feat_punctuation_count']:
|
| 327 |
+
values = [ex[col] for ex in featured_dataset.select(range(100))]
|
| 328 |
+
print(f" {col}:")
|
| 329 |
+
print(f" Mean: {np.mean(values):.2f}")
|
| 330 |
+
print(f" Std: {np.std(values):.2f}")
|
| 331 |
+
print(f" Min/Max: {np.min(values):.2f} / {np.max(values):.2f}")
|
| 332 |
+
|
| 333 |
+
|
| 334 |
+
print("\n3️⃣ Advanced Transformations:")
|
| 335 |
+
|
| 336 |
+
class AdvancedTransformer:
|
| 337 |
+
"""
|
| 338 |
+
Complex transformations
|
| 339 |
+
"""
|
| 340 |
+
def __init__(self):
|
| 341 |
+
self.scaler_params = {}
|
| 342 |
+
|
| 343 |
+
def fit_scaler(self, dataset, columns):
|
| 344 |
+
"""Scaling parameters'ı hesapla"""
|
| 345 |
+
print(" Fitting scaler...")
|
| 346 |
+
for col in columns:
|
| 347 |
+
values = [ex[col] for ex in dataset]
|
| 348 |
+
self.scaler_params[col] = {
|
| 349 |
+
'mean': np.mean(values),
|
| 350 |
+
'std': np.std(values),
|
| 351 |
+
'min': np.min(values),
|
| 352 |
+
'max': np.max(values)
|
| 353 |
+
}
|
| 354 |
+
|
| 355 |
+
def normalize(self, example, columns, method='minmax'):
|
| 356 |
+
"""Feature normalization"""
|
| 357 |
+
result = {**example}
|
| 358 |
+
|
| 359 |
+
for col in columns:
|
| 360 |
+
value = example[col]
|
| 361 |
+
params = self.scaler_params.get(col, {})
|
| 362 |
+
|
| 363 |
+
if method == 'minmax':
|
| 364 |
+
# Min-max scaling [0, 1]
|
| 365 |
+
min_val = params.get('min', 0)
|
| 366 |
+
max_val = params.get('max', 1)
|
| 367 |
+
if max_val > min_val:
|
| 368 |
+
normalized = (value - min_val) / (max_val - min_val)
|
| 369 |
+
else:
|
| 370 |
+
normalized = 0
|
| 371 |
+
elif method == 'zscore':
|
| 372 |
+
# Z-score normalization
|
| 373 |
+
mean = params.get('mean', 0)
|
| 374 |
+
std = params.get('std', 1)
|
| 375 |
+
if std > 0:
|
| 376 |
+
normalized = (value - mean) / std
|
| 377 |
+
else:
|
| 378 |
+
normalized = 0
|
| 379 |
+
else:
|
| 380 |
+
normalized = value
|
| 381 |
+
|
| 382 |
+
result[f'{col}_normalized'] = normalized
|
| 383 |
+
|
| 384 |
+
return result
|
| 385 |
+
|
| 386 |
+
def create_interaction_features(self, example):
|
| 387 |
+
"""Interaction features oluştur"""
|
| 388 |
+
result = {**example}
|
| 389 |
+
|
| 390 |
+
# Örnek: word_count * vocabulary_richness
|
| 391 |
+
if 'feat_word_count' in example and 'feat_vocabulary_richness' in example:
|
| 392 |
+
result['interaction_wc_vr'] = (
|
| 393 |
+
example['feat_word_count'] * example['feat_vocabulary_richness']
|
| 394 |
+
)
|
| 395 |
+
|
| 396 |
+
# Örnek: char_count / word_count (avg word length)
|
| 397 |
+
if 'feat_char_count' in example and 'feat_word_count' in example:
|
| 398 |
+
if example['feat_word_count'] > 0:
|
| 399 |
+
result['interaction_char_per_word'] = (
|
| 400 |
+
example['feat_char_count'] / example['feat_word_count']
|
| 401 |
+
)
|
| 402 |
+
else:
|
| 403 |
+
result['interaction_char_per_word'] = 0
|
| 404 |
+
|
| 405 |
+
return result
|
| 406 |
+
|
| 407 |
+
# Test transformations
|
| 408 |
+
transformer = AdvancedTransformer()
|
| 409 |
+
|
| 410 |
+
# Scaler fit et
|
| 411 |
+
numeric_features = ['feat_word_count', 'feat_char_count', 'feat_vocabulary_richness']
|
| 412 |
+
transformer.fit_scaler(featured_dataset, numeric_features)
|
| 413 |
+
|
| 414 |
+
print("\n Scaler parameters:")
|
| 415 |
+
for col, params in transformer.scaler_params.items():
|
| 416 |
+
print(f" {col}: μ={params['mean']:.2f}, σ={params['std']:.2f}")
|
| 417 |
+
|
| 418 |
+
# Normalize et
|
| 419 |
+
print("\n Normalizing features...")
|
| 420 |
+
normalized_dataset = featured_dataset.map(
|
| 421 |
+
lambda x: transformer.normalize(x, numeric_features, method='minmax'),
|
| 422 |
+
desc="Normalizing"
|
| 423 |
+
)
|
| 424 |
+
|
| 425 |
+
print(f"\n✅ Normalized dataset: {len(normalized_dataset)} examples")
|
| 426 |
+
print(f" New columns added: {[c for c in normalized_dataset.column_names if 'normalized' in c]}")
|
| 427 |
+
|
| 428 |
+
# Örnek normalized values
|
| 429 |
+
print(f"\n Sample normalized values:")
|
| 430 |
+
sample = normalized_dataset[0]
|
| 431 |
+
for col in numeric_features:
|
| 432 |
+
print(f" {col}: {sample[col]:.2f} → {sample[f'{col}_normalized']:.3f}")
|
| 433 |
+
|
| 434 |
+
# Interaction features
|
| 435 |
+
print("\n Creating interaction features...")
|
| 436 |
+
interaction_dataset = normalized_dataset.map(
|
| 437 |
+
transformer.create_interaction_features,
|
| 438 |
+
desc="Creating interactions"
|
| 439 |
+
)
|
| 440 |
+
|
| 441 |
+
print(f"\n✅ Interaction features added:")
|
| 442 |
+
print(f" interaction_wc_vr: {interaction_dataset[0]['interaction_wc_vr']:.3f}")
|
| 443 |
+
print(f" interaction_char_per_word: {interaction_dataset[0]['interaction_char_per_word']:.3f}")
|
| 444 |
+
|
| 445 |
+
|
| 446 |
+
print("\n" + "="*70)
|
| 447 |
+
print("3. DATASET PREPROCESSING PIPELINES")
|
| 448 |
+
print("="*70)
|
| 449 |
+
|
| 450 |
+
print("\n🔄 End-to-End Pipeline:")
|
| 451 |
+
|
| 452 |
+
class DataPipeline:
|
| 453 |
+
"""
|
| 454 |
+
Modular preprocessing pipeline
|
| 455 |
+
"""
|
| 456 |
+
def __init__(self, name="pipeline"):
|
| 457 |
+
self.name = name
|
| 458 |
+
self.steps = []
|
| 459 |
+
self.statistics = {}
|
| 460 |
+
|
| 461 |
+
def add_step(self, name: str, func: Callable, **kwargs):
|
| 462 |
+
"""Pipeline'a step ekle"""
|
| 463 |
+
self.steps.append({
|
| 464 |
+
'name': name,
|
| 465 |
+
'func': func,
|
| 466 |
+
'kwargs': kwargs
|
| 467 |
+
})
|
| 468 |
+
return self
|
| 469 |
+
|
| 470 |
+
def run(self, dataset: Dataset, verbose=True) -> Dataset:
|
| 471 |
+
"""Pipeline'ı çalıştır"""
|
| 472 |
+
if verbose:
|
| 473 |
+
print(f"\n🚀 Running pipeline: {self.name}")
|
| 474 |
+
print(f" Input: {len(dataset)} examples, {len(dataset.column_names)} columns")
|
| 475 |
+
|
| 476 |
+
result = dataset
|
| 477 |
+
|
| 478 |
+
for i, step in enumerate(self.steps):
|
| 479 |
+
if verbose:
|
| 480 |
+
print(f"\n Step {i+1}/{len(self.steps)}: {step['name']}")
|
| 481 |
+
|
| 482 |
+
start_time = time.time()
|
| 483 |
+
|
| 484 |
+
# Step'i çalıştır
|
| 485 |
+
result = step['func'](result, **step['kwargs'])
|
| 486 |
+
|
| 487 |
+
elapsed = time.time() - start_time
|
| 488 |
+
|
| 489 |
+
if verbose:
|
| 490 |
+
print(f" ✓ Completed in {elapsed:.3f}s")
|
| 491 |
+
print(f" Output: {len(result)} examples, {len(result.column_names)} columns")
|
| 492 |
+
|
| 493 |
+
# İstatistikleri kaydet
|
| 494 |
+
self.statistics[step['name']] = {
|
| 495 |
+
'elapsed_time': elapsed,
|
| 496 |
+
'output_size': len(result),
|
| 497 |
+
'output_columns': len(result.column_names)
|
| 498 |
+
}
|
| 499 |
+
|
| 500 |
+
if verbose:
|
| 501 |
+
print(f"\n✅ Pipeline completed!")
|
| 502 |
+
print(f" Total time: {sum(s['elapsed_time'] for s in self.statistics.values()):.3f}s")
|
| 503 |
+
|
| 504 |
+
return result
|
| 505 |
+
|
| 506 |
+
def get_statistics(self):
|
| 507 |
+
"""Pipeline istatistiklerini al"""
|
| 508 |
+
return self.statistics
|
| 509 |
+
|
| 510 |
+
|
| 511 |
+
# Pipeline step'leri tanımla
|
| 512 |
+
def step_clean_text(dataset, min_length=10):
|
| 513 |
+
"""Text temizleme step"""
|
| 514 |
+
def clean(example):
|
| 515 |
+
text = example['text'].strip()
|
| 516 |
+
text = ' '.join(text.split()) # Fazla boşlukları temizle
|
| 517 |
+
example['text_clean'] = text
|
| 518 |
+
return example
|
| 519 |
+
|
| 520 |
+
return dataset.map(clean, desc="Cleaning text")
|
| 521 |
+
|
| 522 |
+
def step_filter_short(dataset, min_words=5):
|
| 523 |
+
"""Kısa metinleri filtrele"""
|
| 524 |
+
return dataset.filter(
|
| 525 |
+
lambda x: len(x['text'].split()) >= min_words,
|
| 526 |
+
desc="Filtering short texts"
|
| 527 |
+
)
|
| 528 |
+
|
| 529 |
+
def step_extract_features(dataset):
|
| 530 |
+
"""Feature extraction"""
|
| 531 |
+
extractor = FeatureExtractor()
|
| 532 |
+
return dataset.map(
|
| 533 |
+
extractor.extract_all_features,
|
| 534 |
+
desc="Extracting features"
|
| 535 |
+
)
|
| 536 |
+
|
| 537 |
+
def step_normalize_features(dataset, columns):
|
| 538 |
+
"""Feature normalization"""
|
| 539 |
+
transformer = AdvancedTransformer()
|
| 540 |
+
transformer.fit_scaler(dataset, columns)
|
| 541 |
+
|
| 542 |
+
return dataset.map(
|
| 543 |
+
lambda x: transformer.normalize(x, columns, method='minmax'),
|
| 544 |
+
desc="Normalizing features"
|
| 545 |
+
)
|
| 546 |
+
|
| 547 |
+
# Pipeline oluştur ve çalıştır
|
| 548 |
+
print("\n1️⃣ Creating Pipeline:")
|
| 549 |
+
pipeline = DataPipeline(name="Text Processing Pipeline")
|
| 550 |
+
|
| 551 |
+
pipeline.add_step("clean_text", step_clean_text, min_length=10)
|
| 552 |
+
pipeline.add_step("filter_short", step_filter_short, min_words=5)
|
| 553 |
+
pipeline.add_step("extract_features", step_extract_features)
|
| 554 |
+
pipeline.add_step("normalize_features", step_normalize_features,
|
| 555 |
+
columns=['feat_word_count', 'feat_char_count'])
|
| 556 |
+
|
| 557 |
+
# Yeni bir dataset oluştur
|
| 558 |
+
raw_dataset = create_sample_dataset(500)
|
| 559 |
+
|
| 560 |
+
# Pipeline'ı çalıştır
|
| 561 |
+
processed_dataset = pipeline.run(raw_dataset, verbose=True)
|
| 562 |
+
|
| 563 |
+
# Sonuçları göster
|
| 564 |
+
print(f"\n📊 Pipeline Results:")
|
| 565 |
+
print(f" Input examples: {len(raw_dataset)}")
|
| 566 |
+
print(f" Output examples: {len(processed_dataset)}")
|
| 567 |
+
print(f" Columns added: {len(processed_dataset.column_names) - len(raw_dataset.column_names)}")
|
| 568 |
+
|
| 569 |
+
# İstatistikler
|
| 570 |
+
print(f"\n📈 Step Statistics:")
|
| 571 |
+
for step_name, stats in pipeline.get_statistics().items():
|
| 572 |
+
print(f" {step_name}:")
|
| 573 |
+
print(f" Time: {stats['elapsed_time']:.3f}s")
|
| 574 |
+
print(f" Output size: {stats['output_size']}")
|
| 575 |
+
|
| 576 |
+
|
| 577 |
+
print("\n2️⃣ Reusable Pipeline Template:")
|
| 578 |
+
|
| 579 |
+
class PipelineTemplate:
|
| 580 |
+
"""
|
| 581 |
+
Re-usable pipeline templates
|
| 582 |
+
"""
|
| 583 |
+
@staticmethod
|
| 584 |
+
def basic_nlp_pipeline():
|
| 585 |
+
"""Basic NLP preprocessing"""
|
| 586 |
+
pipeline = DataPipeline("Basic NLP")
|
| 587 |
+
pipeline.add_step("clean", step_clean_text)
|
| 588 |
+
pipeline.add_step("filter", step_filter_short, min_words=3)
|
| 589 |
+
return pipeline
|
| 590 |
+
|
| 591 |
+
@staticmethod
|
| 592 |
+
def feature_engineering_pipeline():
|
| 593 |
+
"""Feature engineering pipeline"""
|
| 594 |
+
pipeline = DataPipeline("Feature Engineering")
|
| 595 |
+
pipeline.add_step("clean", step_clean_text)
|
| 596 |
+
pipeline.add_step("features", step_extract_features)
|
| 597 |
+
pipeline.add_step("normalize", step_normalize_features,
|
| 598 |
+
columns=['feat_word_count', 'feat_char_count',
|
| 599 |
+
'feat_vocabulary_richness'])
|
| 600 |
+
return pipeline
|
| 601 |
+
|
| 602 |
+
@staticmethod
|
| 603 |
+
def full_pipeline():
|
| 604 |
+
"""Complete preprocessing pipeline"""
|
| 605 |
+
pipeline = DataPipeline("Full Pipeline")
|
| 606 |
+
pipeline.add_step("clean", step_clean_text, min_length=10)
|
| 607 |
+
pipeline.add_step("filter", step_filter_short, min_words=5)
|
| 608 |
+
pipeline.add_step("features", step_extract_features)
|
| 609 |
+
pipeline.add_step("normalize", step_normalize_features,
|
| 610 |
+
columns=['feat_word_count', 'feat_char_count'])
|
| 611 |
+
return pipeline
|
| 612 |
+
|
| 613 |
+
# Template kullanımı
|
| 614 |
+
print("\n Using pipeline template:")
|
| 615 |
+
template_pipeline = PipelineTemplate.feature_engineering_pipeline()
|
| 616 |
+
print(f" Pipeline: {template_pipeline.name}")
|
| 617 |
+
print(f" Steps: {[s['name'] for s in template_pipeline.steps]}")
|
| 618 |
+
|
| 619 |
+
|
| 620 |
+
print("\n" + "="*70)
|
| 621 |
+
print("4. DATA AUGMENTATION STRATEGIES")
|
| 622 |
+
print("="*70)
|
| 623 |
+
|
| 624 |
+
print("\n🎲 Data Augmentation Teknikleri:")
|
| 625 |
+
|
| 626 |
+
class DataAugmenter:
|
| 627 |
+
"""
|
| 628 |
+
Data augmentation methods
|
| 629 |
+
"""
|
| 630 |
+
def __init__(self, augmentation_prob=0.3):
|
| 631 |
+
self.augmentation_prob = augmentation_prob
|
| 632 |
+
|
| 633 |
+
def random_word_deletion(self, text: str, p=0.1) -> str:
|
| 634 |
+
"""Random kelime silme"""
|
| 635 |
+
words = text.split()
|
| 636 |
+
if len(words) <= 2:
|
| 637 |
+
return text
|
| 638 |
+
|
| 639 |
+
new_words = [w for w in words if random.random() > p]
|
| 640 |
+
|
| 641 |
+
# En az 1 kelime kalsın
|
| 642 |
+
if len(new_words) == 0:
|
| 643 |
+
new_words = [random.choice(words)]
|
| 644 |
+
|
| 645 |
+
return ' '.join(new_words)
|
| 646 |
+
|
| 647 |
+
def random_word_swap(self, text: str, n=1) -> str:
|
| 648 |
+
"""Random kelime yer değiştirme"""
|
| 649 |
+
words = text.split()
|
| 650 |
+
if len(words) < 2:
|
| 651 |
+
return text
|
| 652 |
+
|
| 653 |
+
for _ in range(n):
|
| 654 |
+
idx1, idx2 = random.sample(range(len(words)), 2)
|
| 655 |
+
words[idx1], words[idx2] = words[idx2], words[idx1]
|
| 656 |
+
|
| 657 |
+
return ' '.join(words)
|
| 658 |
+
|
| 659 |
+
def synonym_replacement(self, text: str, p=0.1) -> str:
|
| 660 |
+
"""
|
| 661 |
+
Synonym replacement (basitleştirilmiş)
|
| 662 |
+
Gerçek uygulamada WordNet veya embedding kullanılır
|
| 663 |
+
"""
|
| 664 |
+
synonyms = {
|
| 665 |
+
'good': ['great', 'excellent', 'nice'],
|
| 666 |
+
'bad': ['poor', 'terrible', 'awful'],
|
| 667 |
+
'big': ['large', 'huge', 'enormous'],
|
| 668 |
+
'small': ['tiny', 'little', 'mini']
|
| 669 |
+
}
|
| 670 |
+
|
| 671 |
+
words = text.split()
|
| 672 |
+
new_words = []
|
| 673 |
+
|
| 674 |
+
for word in words:
|
| 675 |
+
if word.lower() in synonyms and random.random() < p:
|
| 676 |
+
new_words.append(random.choice(synonyms[word.lower()]))
|
| 677 |
+
else:
|
| 678 |
+
new_words.append(word)
|
| 679 |
+
|
| 680 |
+
return ' '.join(new_words)
|
| 681 |
+
|
| 682 |
+
def augment_example(self, example: Dict) -> Dict:
|
| 683 |
+
"""Tek bir örneği augment et"""
|
| 684 |
+
if random.random() > self.augmentation_prob:
|
| 685 |
+
return example
|
| 686 |
+
|
| 687 |
+
text = example['text']
|
| 688 |
+
|
| 689 |
+
# Random augmentation seç
|
| 690 |
+
aug_method = random.choice([
|
| 691 |
+
self.random_word_deletion,
|
| 692 |
+
self.random_word_swap,
|
| 693 |
+
self.synonym_replacement
|
| 694 |
+
])
|
| 695 |
+
|
| 696 |
+
augmented_text = aug_method(text)
|
| 697 |
+
|
| 698 |
+
return {
|
| 699 |
+
**example,
|
| 700 |
+
'text_augmented': augmented_text,
|
| 701 |
+
'is_augmented': True
|
| 702 |
+
}
|
| 703 |
+
|
| 704 |
+
def augment_dataset(self, dataset: Dataset, num_augmentations=1) -> Dataset:
|
| 705 |
+
"""Dataset'i augment et"""
|
| 706 |
+
augmented_examples = []
|
| 707 |
+
|
| 708 |
+
for example in dataset:
|
| 709 |
+
# Original örneği ekle
|
| 710 |
+
augmented_examples.append({
|
| 711 |
+
**example,
|
| 712 |
+
'is_augmented': False,
|
| 713 |
+
'text_augmented': example['text']
|
| 714 |
+
})
|
| 715 |
+
|
| 716 |
+
# Augmented versiyonları ekle
|
| 717 |
+
for _ in range(num_augmentations):
|
| 718 |
+
aug_example = self.augment_example(example)
|
| 719 |
+
augmented_examples.append(aug_example)
|
| 720 |
+
|
| 721 |
+
# Dict of lists'e çevir
|
| 722 |
+
dict_data = defaultdict(list)
|
| 723 |
+
for example in augmented_examples:
|
| 724 |
+
for key, value in example.items():
|
| 725 |
+
dict_data[key].append(value)
|
| 726 |
+
|
| 727 |
+
return Dataset.from_dict(dict(dict_data))
|
| 728 |
+
|
| 729 |
+
|
| 730 |
+
print("\n1️⃣ Augmentation Examples:")
|
| 731 |
+
augmenter = DataAugmenter(augmentation_prob=1.0) # Her zaman augment et
|
| 732 |
+
|
| 733 |
+
test_texts = [
|
| 734 |
+
"This is a good example of text augmentation",
|
| 735 |
+
"The big dog ran fast in the park",
|
| 736 |
+
"Data augmentation is important for ML"
|
| 737 |
+
]
|
| 738 |
+
|
| 739 |
+
for i, text in enumerate(test_texts):
|
| 740 |
+
print(f"\n Original {i+1}: {text}")
|
| 741 |
+
print(f" Deletion: {augmenter.random_word_deletion(text, p=0.2)}")
|
| 742 |
+
print(f" Swap: {augmenter.random_word_swap(text, n=2)}")
|
| 743 |
+
print(f" Synonym: {augmenter.synonym_replacement(text, p=0.3)}")
|
| 744 |
+
|
| 745 |
+
|
| 746 |
+
print("\n2️⃣ Augmenting Dataset:")
|
| 747 |
+
small_dataset = create_sample_dataset(50)
|
| 748 |
+
|
| 749 |
+
print(f" Original dataset: {len(small_dataset)} examples")
|
| 750 |
+
|
| 751 |
+
# Augment et (her örnek için 2 augmented versiyon)
|
| 752 |
+
augmented_dataset = augmenter.augment_dataset(small_dataset, num_augmentations=2)
|
| 753 |
+
|
| 754 |
+
print(f" Augmented dataset: {len(augmented_dataset)} examples")
|
| 755 |
+
print(f" Augmented ratio: {len(augmented_dataset) / len(small_dataset):.1f}x")
|
| 756 |
+
|
| 757 |
+
# Augmented örnekleri göster
|
| 758 |
+
print(f"\n Sample augmentations:")
|
| 759 |
+
for i in range(3):
|
| 760 |
+
original_idx = i * 3 # Original
|
| 761 |
+
aug_idx = i * 3 + 1 # First augmentation
|
| 762 |
+
|
| 763 |
+
orig = augmented_dataset[original_idx]
|
| 764 |
+
aug = augmented_dataset[aug_idx]
|
| 765 |
+
|
| 766 |
+
print(f"\n Example {i+1}:")
|
| 767 |
+
print(f" Original: {orig['text'][:60]}...")
|
| 768 |
+
print(f" Augmented: {aug['text_augmented'][:60]}...")
|
| 769 |
+
print(f" Is augmented: {aug['is_augmented']}")
|
| 770 |
+
|
| 771 |
+
|
| 772 |
+
print("\n3️⃣ Smart Augmentation - Class Balancing:")
|
| 773 |
+
|
| 774 |
+
def smart_augment_for_balance(dataset, label_column='label', target_per_class=100):
|
| 775 |
+
"""
|
| 776 |
+
Class'ları balance etmek için smart augmentation
|
| 777 |
+
"""
|
| 778 |
+
augmenter = DataAugmenter(augmentation_prob=1.0)
|
| 779 |
+
|
| 780 |
+
# Label distribution'ı hesapla
|
| 781 |
+
labels = [ex[label_column] for ex in dataset]
|
| 782 |
+
label_counts = {label: labels.count(label) for label in set(labels)}
|
| 783 |
+
|
| 784 |
+
print(f"\n Original distribution:")
|
| 785 |
+
for label, count in sorted(label_counts.items()):
|
| 786 |
+
print(f" Label {label}: {count} examples")
|
| 787 |
+
|
| 788 |
+
# Balanced dataset oluştur
|
| 789 |
+
balanced_examples = []
|
| 790 |
+
|
| 791 |
+
for label in set(labels):
|
| 792 |
+
# Bu label'a ait örnekleri al
|
| 793 |
+
label_examples = [ex for ex in dataset if ex[label_column] == label]
|
| 794 |
+
current_count = len(label_examples)
|
| 795 |
+
|
| 796 |
+
# Original örnekleri ekle
|
| 797 |
+
for ex in label_examples:
|
| 798 |
+
balanced_examples.append({
|
| 799 |
+
**ex,
|
| 800 |
+
'is_augmented': False,
|
| 801 |
+
'text_augmented': ex['text']
|
| 802 |
+
})
|
| 803 |
+
|
| 804 |
+
# Eksik kısmı augmentation ile doldur
|
| 805 |
+
if current_count < target_per_class:
|
| 806 |
+
needed = target_per_class - current_count
|
| 807 |
+
|
| 808 |
+
for i in range(needed):
|
| 809 |
+
# Cycle through examples
|
| 810 |
+
source_ex = label_examples[i % len(label_examples)]
|
| 811 |
+
aug_ex = augmenter.augment_example(source_ex)
|
| 812 |
+
balanced_examples.append(aug_ex)
|
| 813 |
+
|
| 814 |
+
# Dataset'e çevir
|
| 815 |
+
dict_data = defaultdict(list)
|
| 816 |
+
for example in balanced_examples:
|
| 817 |
+
for key, value in example.items():
|
| 818 |
+
dict_data[key].append(value)
|
| 819 |
+
|
| 820 |
+
return Dataset.from_dict(dict(dict_data))
|
| 821 |
+
|
| 822 |
+
# Test smart augmentation
|
| 823 |
+
print("\n Applying smart augmentation for balance:")
|
| 824 |
+
balanced_dataset = smart_augment_for_balance(small_dataset, target_per_class=60)
|
| 825 |
+
|
| 826 |
+
print(f"\n Balanced distribution:")
|
| 827 |
+
balanced_labels = [ex['label'] for ex in balanced_dataset]
|
| 828 |
+
balanced_counts = {label: balanced_labels.count(label) for label in set(balanced_labels)}
|
| 829 |
+
for label, count in sorted(balanced_counts.items()):
|
| 830 |
+
print(f" Label {label}: {count} examples")
|
| 831 |
+
|
| 832 |
+
print(f"\n Total examples: {len(small_dataset)} → {len(balanced_dataset)}")
|
| 833 |
+
|
| 834 |
+
|
| 835 |
+
print("\n" + "="*70)
|
| 836 |
+
print("✅ BÖLÜM 3 TAMAMLANDI! (Devam ediyor...)")
|
| 837 |
+
print("="*70)
|
| 838 |
+
|
| 839 |
+
print("""
|
| 840 |
+
Bu bölümde öğrendikleriniz (1. Kısım):
|
| 841 |
+
✓ Custom Data Collators (3 tip)
|
| 842 |
+
✓ Advanced Feature Extraction
|
| 843 |
+
✓ Feature Transformation & Normalization
|
| 844 |
+
✓ Preprocessing Pipelines
|
| 845 |
+
✓ Data Augmentation Strategies
|
| 846 |
+
✓ Smart Class Balancing
|
| 847 |
+
|
| 848 |
+
📚 SONRAKI: Advanced Filtering & Sampling
|
| 849 |
+
- Complex filtering strategies
|
| 850 |
+
- Stratified sampling
|
| 851 |
+
- Active learning sampling
|
| 852 |
+
- Diversity sampling
|
| 853 |
+
""")
|
| 854 |
+
|
| 855 |
+
print("\n▶️ Devam ediyoruz...")
|
| 856 |
+
time.sleep(1)
|
space/modules/03_ileri_teknikler_part2.py
ADDED
|
@@ -0,0 +1,776 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
İLERİ TEKNİKLER - PART 2
|
| 3 |
+
========================
|
| 4 |
+
|
| 5 |
+
Bu modülde öğrenecekleriniz:
|
| 6 |
+
5. Advanced Filtering & Sampling
|
| 7 |
+
6. Dynamic Batching
|
| 8 |
+
7. Active Learning Integration
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from datasets import Dataset
|
| 12 |
+
import numpy as np
|
| 13 |
+
from typing import Dict, List, Any
|
| 14 |
+
import random
|
| 15 |
+
from collections import defaultdict, Counter
|
| 16 |
+
|
| 17 |
+
print("\n" + "="*70)
|
| 18 |
+
print("5. ADVANCED FILTERING & SAMPLING")
|
| 19 |
+
print("="*70)
|
| 20 |
+
|
| 21 |
+
# Dataset oluştur
|
| 22 |
+
def create_diverse_dataset(num_samples=1000):
|
| 23 |
+
def gen():
|
| 24 |
+
domains = ['science', 'tech', 'sports', 'politics', 'entertainment']
|
| 25 |
+
difficulties = ['easy', 'medium', 'hard']
|
| 26 |
+
|
| 27 |
+
for i in range(num_samples):
|
| 28 |
+
domain = np.random.choice(domains)
|
| 29 |
+
difficulty = np.random.choice(difficulties)
|
| 30 |
+
|
| 31 |
+
yield {
|
| 32 |
+
'id': i,
|
| 33 |
+
'text': f"Sample text {i} in {domain} " * np.random.randint(5, 20),
|
| 34 |
+
'domain': domain,
|
| 35 |
+
'difficulty': difficulty,
|
| 36 |
+
'score': np.random.random(),
|
| 37 |
+
'label': np.random.randint(0, 3),
|
| 38 |
+
'length': np.random.randint(50, 500),
|
| 39 |
+
'quality': np.random.choice(['high', 'medium', 'low'])
|
| 40 |
+
}
|
| 41 |
+
return Dataset.from_generator(gen)
|
| 42 |
+
|
| 43 |
+
dataset = create_diverse_dataset(1000)
|
| 44 |
+
print(f"✅ Dataset: {len(dataset)} örnekler")
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
print("\n1️⃣ Complex Multi-Condition Filtering:")
|
| 48 |
+
|
| 49 |
+
class AdvancedFilter:
|
| 50 |
+
"""
|
| 51 |
+
Complex filtering with multiple conditions
|
| 52 |
+
"""
|
| 53 |
+
@staticmethod
|
| 54 |
+
def filter_by_multiple_conditions(dataset, conditions: List[callable]):
|
| 55 |
+
"""
|
| 56 |
+
Birden fazla koşulu AND ile uygula
|
| 57 |
+
"""
|
| 58 |
+
def combined_filter(example):
|
| 59 |
+
return all(condition(example) for condition in conditions)
|
| 60 |
+
|
| 61 |
+
return dataset.filter(combined_filter, desc="Multi-condition filtering")
|
| 62 |
+
|
| 63 |
+
@staticmethod
|
| 64 |
+
def filter_by_score_percentile(dataset, percentile=75, column='score'):
|
| 65 |
+
"""
|
| 66 |
+
Belirli percentile'ın üstündeki örnekleri filtrele
|
| 67 |
+
"""
|
| 68 |
+
scores = [ex[column] for ex in dataset]
|
| 69 |
+
threshold = np.percentile(scores, percentile)
|
| 70 |
+
|
| 71 |
+
return dataset.filter(
|
| 72 |
+
lambda x: x[column] >= threshold,
|
| 73 |
+
desc=f"Filtering top {100-percentile}%"
|
| 74 |
+
)
|
| 75 |
+
|
| 76 |
+
@staticmethod
|
| 77 |
+
def filter_balanced_classes(dataset, label_column='label', samples_per_class=100):
|
| 78 |
+
"""
|
| 79 |
+
Her class'tan eşit sayıda örnek al
|
| 80 |
+
"""
|
| 81 |
+
# Label'lara göre grupla
|
| 82 |
+
label_groups = defaultdict(list)
|
| 83 |
+
for i, ex in enumerate(dataset):
|
| 84 |
+
label_groups[ex[label_column]].append(i)
|
| 85 |
+
|
| 86 |
+
# Her class'tan sample al
|
| 87 |
+
selected_indices = []
|
| 88 |
+
for label, indices in label_groups.items():
|
| 89 |
+
# Random sample
|
| 90 |
+
n_samples = min(samples_per_class, len(indices))
|
| 91 |
+
sampled = random.sample(indices, n_samples)
|
| 92 |
+
selected_indices.extend(sampled)
|
| 93 |
+
|
| 94 |
+
return dataset.select(sorted(selected_indices))
|
| 95 |
+
|
| 96 |
+
# Test filters
|
| 97 |
+
print("\n Complex filtering örneği:")
|
| 98 |
+
|
| 99 |
+
# Birden fazla koşul
|
| 100 |
+
conditions = [
|
| 101 |
+
lambda x: x['length'] > 100, # Uzun metinler
|
| 102 |
+
lambda x: x['score'] > 0.5, # Yüksek score
|
| 103 |
+
lambda x: x['quality'] == 'high' # Yüksek kalite
|
| 104 |
+
]
|
| 105 |
+
|
| 106 |
+
filtered = AdvancedFilter.filter_by_multiple_conditions(dataset, conditions)
|
| 107 |
+
print(f" Original: {len(dataset)} examples")
|
| 108 |
+
print(f" Filtered (length>100 AND score>0.5 AND quality=high): {len(filtered)} examples")
|
| 109 |
+
print(f" Kept: {len(filtered)/len(dataset)*100:.1f}%")
|
| 110 |
+
|
| 111 |
+
# Percentile filtering
|
| 112 |
+
print("\n Percentile filtering:")
|
| 113 |
+
top_25 = AdvancedFilter.filter_by_score_percentile(dataset, percentile=75)
|
| 114 |
+
print(f" Top 25% by score: {len(top_25)} examples")
|
| 115 |
+
|
| 116 |
+
# Balanced sampling
|
| 117 |
+
print("\n Balanced class sampling:")
|
| 118 |
+
balanced = AdvancedFilter.filter_balanced_classes(dataset, samples_per_class=100)
|
| 119 |
+
labels = [ex['label'] for ex in balanced]
|
| 120 |
+
label_dist = Counter(labels)
|
| 121 |
+
print(f" Total: {len(balanced)} examples")
|
| 122 |
+
print(f" Distribution: {dict(label_dist)}")
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
print("\n2️⃣ Stratified Sampling:")
|
| 126 |
+
|
| 127 |
+
class StratifiedSampler:
|
| 128 |
+
"""
|
| 129 |
+
Stratified sampling for representative splits
|
| 130 |
+
"""
|
| 131 |
+
@staticmethod
|
| 132 |
+
def stratified_split(dataset,
|
| 133 |
+
stratify_column='label',
|
| 134 |
+
train_ratio=0.8,
|
| 135 |
+
seed=42):
|
| 136 |
+
"""
|
| 137 |
+
Stratified train/test split
|
| 138 |
+
"""
|
| 139 |
+
random.seed(seed)
|
| 140 |
+
|
| 141 |
+
# Group by stratify column
|
| 142 |
+
groups = defaultdict(list)
|
| 143 |
+
for i, ex in enumerate(dataset):
|
| 144 |
+
groups[ex[stratify_column]].append(i)
|
| 145 |
+
|
| 146 |
+
train_indices = []
|
| 147 |
+
test_indices = []
|
| 148 |
+
|
| 149 |
+
# Split each group
|
| 150 |
+
for group_indices in groups.values():
|
| 151 |
+
random.shuffle(group_indices)
|
| 152 |
+
split_point = int(len(group_indices) * train_ratio)
|
| 153 |
+
train_indices.extend(group_indices[:split_point])
|
| 154 |
+
test_indices.extend(group_indices[split_point:])
|
| 155 |
+
|
| 156 |
+
train_dataset = dataset.select(sorted(train_indices))
|
| 157 |
+
test_dataset = dataset.select(sorted(test_indices))
|
| 158 |
+
|
| 159 |
+
return train_dataset, test_dataset
|
| 160 |
+
|
| 161 |
+
@staticmethod
|
| 162 |
+
def multi_stratified_split(dataset,
|
| 163 |
+
stratify_columns=['label', 'domain'],
|
| 164 |
+
train_ratio=0.8,
|
| 165 |
+
seed=42):
|
| 166 |
+
"""
|
| 167 |
+
Multiple column stratification
|
| 168 |
+
"""
|
| 169 |
+
random.seed(seed)
|
| 170 |
+
|
| 171 |
+
# Create combined stratification key
|
| 172 |
+
groups = defaultdict(list)
|
| 173 |
+
for i, ex in enumerate(dataset):
|
| 174 |
+
key = tuple(ex[col] for col in stratify_columns)
|
| 175 |
+
groups[key].append(i)
|
| 176 |
+
|
| 177 |
+
train_indices = []
|
| 178 |
+
test_indices = []
|
| 179 |
+
|
| 180 |
+
# Split each group
|
| 181 |
+
for group_indices in groups.values():
|
| 182 |
+
random.shuffle(group_indices)
|
| 183 |
+
split_point = int(len(group_indices) * train_ratio)
|
| 184 |
+
train_indices.extend(group_indices[:split_point])
|
| 185 |
+
test_indices.extend(group_indices[split_point:])
|
| 186 |
+
|
| 187 |
+
train_dataset = dataset.select(sorted(train_indices))
|
| 188 |
+
test_dataset = dataset.select(sorted(test_indices))
|
| 189 |
+
|
| 190 |
+
return train_dataset, test_dataset
|
| 191 |
+
|
| 192 |
+
# Test stratified sampling
|
| 193 |
+
print("\n Single column stratification (label):")
|
| 194 |
+
train, test = StratifiedSampler.stratified_split(dataset, stratify_column='label')
|
| 195 |
+
|
| 196 |
+
print(f" Train: {len(train)} examples")
|
| 197 |
+
train_labels = [ex['label'] for ex in train]
|
| 198 |
+
train_dist = Counter(train_labels)
|
| 199 |
+
print(f" Train distribution: {dict(train_dist)}")
|
| 200 |
+
|
| 201 |
+
print(f"\n Test: {len(test)} examples")
|
| 202 |
+
test_labels = [ex['label'] for ex in test]
|
| 203 |
+
test_dist = Counter(test_labels)
|
| 204 |
+
print(f" Test distribution: {dict(test_dist)}")
|
| 205 |
+
|
| 206 |
+
# Multi-column stratification
|
| 207 |
+
print("\n Multi-column stratification (label + domain):")
|
| 208 |
+
train_multi, test_multi = StratifiedSampler.multi_stratified_split(
|
| 209 |
+
dataset,
|
| 210 |
+
stratify_columns=['label', 'domain']
|
| 211 |
+
)
|
| 212 |
+
|
| 213 |
+
print(f" Train: {len(train_multi)} examples")
|
| 214 |
+
print(f" Test: {len(test_multi)} examples")
|
| 215 |
+
|
| 216 |
+
# Check distribution
|
| 217 |
+
train_combos = [(ex['label'], ex['domain']) for ex in train_multi.select(range(min(100, len(train_multi))))]
|
| 218 |
+
print(f" Sample combinations in train: {len(set(train_combos))} unique")
|
| 219 |
+
|
| 220 |
+
|
| 221 |
+
print("\n3️⃣ Diversity Sampling:")
|
| 222 |
+
|
| 223 |
+
class DiversitySampler:
|
| 224 |
+
"""
|
| 225 |
+
Sample diverse examples from dataset
|
| 226 |
+
"""
|
| 227 |
+
@staticmethod
|
| 228 |
+
def max_diversity_sampling(dataset,
|
| 229 |
+
n_samples=100,
|
| 230 |
+
feature_columns=['length', 'score'],
|
| 231 |
+
seed=42):
|
| 232 |
+
"""
|
| 233 |
+
Maksimum diversity için örnekleri seç
|
| 234 |
+
K-means benzeri algoritma
|
| 235 |
+
"""
|
| 236 |
+
random.seed(seed)
|
| 237 |
+
np.random.seed(seed)
|
| 238 |
+
|
| 239 |
+
# Feature matrix oluştur
|
| 240 |
+
features = []
|
| 241 |
+
for ex in dataset:
|
| 242 |
+
feat_vector = [ex[col] for col in feature_columns]
|
| 243 |
+
features.append(feat_vector)
|
| 244 |
+
features = np.array(features)
|
| 245 |
+
|
| 246 |
+
# Normalize
|
| 247 |
+
features = (features - features.mean(axis=0)) / (features.std(axis=0) + 1e-8)
|
| 248 |
+
|
| 249 |
+
# Greedy selection
|
| 250 |
+
selected_indices = []
|
| 251 |
+
|
| 252 |
+
# İlk örneği random seç
|
| 253 |
+
first_idx = random.randint(0, len(dataset) - 1)
|
| 254 |
+
selected_indices.append(first_idx)
|
| 255 |
+
|
| 256 |
+
# Kalan örnekleri seç
|
| 257 |
+
for _ in range(n_samples - 1):
|
| 258 |
+
max_dist = -1
|
| 259 |
+
best_idx = -1
|
| 260 |
+
|
| 261 |
+
# Her aday için min distance to selected hesapla
|
| 262 |
+
for candidate_idx in range(len(dataset)):
|
| 263 |
+
if candidate_idx in selected_indices:
|
| 264 |
+
continue
|
| 265 |
+
|
| 266 |
+
# Min distance to any selected point
|
| 267 |
+
min_dist = float('inf')
|
| 268 |
+
for sel_idx in selected_indices:
|
| 269 |
+
dist = np.linalg.norm(
|
| 270 |
+
features[candidate_idx] - features[sel_idx]
|
| 271 |
+
)
|
| 272 |
+
min_dist = min(min_dist, dist)
|
| 273 |
+
|
| 274 |
+
# En uzak olanı seç
|
| 275 |
+
if min_dist > max_dist:
|
| 276 |
+
max_dist = min_dist
|
| 277 |
+
best_idx = candidate_idx
|
| 278 |
+
|
| 279 |
+
if best_idx != -1:
|
| 280 |
+
selected_indices.append(best_idx)
|
| 281 |
+
|
| 282 |
+
return dataset.select(selected_indices)
|
| 283 |
+
|
| 284 |
+
@staticmethod
|
| 285 |
+
def coverage_based_sampling(dataset,
|
| 286 |
+
coverage_column='domain',
|
| 287 |
+
n_samples_per_value=20):
|
| 288 |
+
"""
|
| 289 |
+
Her category'den belirli sayıda örnek al (coverage)
|
| 290 |
+
"""
|
| 291 |
+
groups = defaultdict(list)
|
| 292 |
+
for i, ex in enumerate(dataset):
|
| 293 |
+
groups[ex[coverage_column]].append(i)
|
| 294 |
+
|
| 295 |
+
selected_indices = []
|
| 296 |
+
for group_indices in groups.values():
|
| 297 |
+
n = min(n_samples_per_value, len(group_indices))
|
| 298 |
+
sampled = random.sample(group_indices, n)
|
| 299 |
+
selected_indices.extend(sampled)
|
| 300 |
+
|
| 301 |
+
return dataset.select(sorted(selected_indices))
|
| 302 |
+
|
| 303 |
+
# Test diversity sampling
|
| 304 |
+
print("\n Max diversity sampling:")
|
| 305 |
+
diverse_sample = DiversitySampler.max_diversity_sampling(
|
| 306 |
+
dataset,
|
| 307 |
+
n_samples=100,
|
| 308 |
+
feature_columns=['length', 'score']
|
| 309 |
+
)
|
| 310 |
+
|
| 311 |
+
print(f" Selected: {len(diverse_sample)} diverse examples")
|
| 312 |
+
|
| 313 |
+
# Diversity ölçüsü
|
| 314 |
+
lengths = [ex['length'] for ex in diverse_sample]
|
| 315 |
+
scores = [ex['score'] for ex in diverse_sample]
|
| 316 |
+
print(f" Length range: {min(lengths)} - {max(lengths)}")
|
| 317 |
+
print(f" Length std: {np.std(lengths):.2f}")
|
| 318 |
+
print(f" Score range: {min(scores):.3f} - {max(scores):.3f}")
|
| 319 |
+
|
| 320 |
+
# Coverage sampling
|
| 321 |
+
print("\n Coverage-based sampling:")
|
| 322 |
+
coverage_sample = DiversitySampler.coverage_based_sampling(
|
| 323 |
+
dataset,
|
| 324 |
+
coverage_column='domain',
|
| 325 |
+
n_samples_per_value=20
|
| 326 |
+
)
|
| 327 |
+
|
| 328 |
+
print(f" Selected: {len(coverage_sample)} examples")
|
| 329 |
+
domains = [ex['domain'] for ex in coverage_sample]
|
| 330 |
+
domain_dist = Counter(domains)
|
| 331 |
+
print(f" Domain distribution: {dict(domain_dist)}")
|
| 332 |
+
|
| 333 |
+
|
| 334 |
+
print("\n4️⃣ Active Learning Sampling:")
|
| 335 |
+
|
| 336 |
+
class ActiveLearningSampler:
|
| 337 |
+
"""
|
| 338 |
+
Active learning için uncertainty-based sampling
|
| 339 |
+
"""
|
| 340 |
+
@staticmethod
|
| 341 |
+
def uncertainty_sampling(dataset,
|
| 342 |
+
uncertainty_scores: List[float],
|
| 343 |
+
n_samples=100,
|
| 344 |
+
strategy='least_confident'):
|
| 345 |
+
"""
|
| 346 |
+
Model uncertainty'ye göre sample
|
| 347 |
+
"""
|
| 348 |
+
if len(uncertainty_scores) != len(dataset):
|
| 349 |
+
raise ValueError("Uncertainty scores must match dataset size")
|
| 350 |
+
|
| 351 |
+
# Strategy'ye göre sırala
|
| 352 |
+
if strategy == 'least_confident':
|
| 353 |
+
# En düşük confidence (en yüksek uncertainty)
|
| 354 |
+
sorted_indices = np.argsort(uncertainty_scores)[::-1]
|
| 355 |
+
elif strategy == 'margin':
|
| 356 |
+
# En düşük margin
|
| 357 |
+
sorted_indices = np.argsort(uncertainty_scores)
|
| 358 |
+
else:
|
| 359 |
+
sorted_indices = np.argsort(uncertainty_scores)[::-1]
|
| 360 |
+
|
| 361 |
+
# Top n'i seç
|
| 362 |
+
selected_indices = sorted_indices[:n_samples].tolist()
|
| 363 |
+
|
| 364 |
+
return dataset.select(selected_indices)
|
| 365 |
+
|
| 366 |
+
@staticmethod
|
| 367 |
+
def diversity_uncertainty_sampling(dataset,
|
| 368 |
+
uncertainty_scores: List[float],
|
| 369 |
+
n_samples=100,
|
| 370 |
+
diversity_weight=0.5):
|
| 371 |
+
"""
|
| 372 |
+
Uncertainty + Diversity kombinasyonu
|
| 373 |
+
"""
|
| 374 |
+
# Simulated diversity scores (gerçekte embedding distance kullanılır)
|
| 375 |
+
diversity_scores = [random.random() for _ in range(len(dataset))]
|
| 376 |
+
|
| 377 |
+
# Combined score
|
| 378 |
+
combined_scores = [
|
| 379 |
+
(1 - diversity_weight) * uncertainty_scores[i] +
|
| 380 |
+
diversity_weight * diversity_scores[i]
|
| 381 |
+
for i in range(len(dataset))
|
| 382 |
+
]
|
| 383 |
+
|
| 384 |
+
# Top n
|
| 385 |
+
sorted_indices = np.argsort(combined_scores)[::-1]
|
| 386 |
+
selected_indices = sorted_indices[:n_samples].tolist()
|
| 387 |
+
|
| 388 |
+
return dataset.select(selected_indices)
|
| 389 |
+
|
| 390 |
+
# Test active learning sampling
|
| 391 |
+
print("\n Uncertainty-based sampling:")
|
| 392 |
+
|
| 393 |
+
# Simulate uncertainty scores (gerçekte model'den gelir)
|
| 394 |
+
uncertainty_scores = [random.random() for _ in range(len(dataset))]
|
| 395 |
+
|
| 396 |
+
uncertain_sample = ActiveLearningSampler.uncertainty_sampling(
|
| 397 |
+
dataset,
|
| 398 |
+
uncertainty_scores,
|
| 399 |
+
n_samples=50,
|
| 400 |
+
strategy='least_confident'
|
| 401 |
+
)
|
| 402 |
+
|
| 403 |
+
print(f" Selected: {len(uncertain_sample)} most uncertain examples")
|
| 404 |
+
selected_uncertainties = [uncertainty_scores[i] for i in range(50)]
|
| 405 |
+
print(f" Avg uncertainty: {np.mean(selected_uncertainties):.3f}")
|
| 406 |
+
|
| 407 |
+
# Diversity + Uncertainty
|
| 408 |
+
print("\n Diversity + Uncertainty sampling:")
|
| 409 |
+
diverse_uncertain = ActiveLearningSampler.diversity_uncertainty_sampling(
|
| 410 |
+
dataset,
|
| 411 |
+
uncertainty_scores,
|
| 412 |
+
n_samples=50,
|
| 413 |
+
diversity_weight=0.3 # 30% diversity, 70% uncertainty
|
| 414 |
+
)
|
| 415 |
+
|
| 416 |
+
print(f" Selected: {len(diverse_uncertain)} examples")
|
| 417 |
+
|
| 418 |
+
|
| 419 |
+
print("\n" + "="*70)
|
| 420 |
+
print("6. DYNAMIC BATCHING")
|
| 421 |
+
print("="*70)
|
| 422 |
+
|
| 423 |
+
print("\n📦 Dynamic Batching Strategies:")
|
| 424 |
+
|
| 425 |
+
class DynamicBatcher:
|
| 426 |
+
"""
|
| 427 |
+
Dynamic batching for efficient training
|
| 428 |
+
"""
|
| 429 |
+
def __init__(self, dataset, batch_size=32):
|
| 430 |
+
self.dataset = dataset
|
| 431 |
+
self.batch_size = batch_size
|
| 432 |
+
|
| 433 |
+
def length_based_batching(self, length_column='length', max_length_diff=50):
|
| 434 |
+
"""
|
| 435 |
+
Benzer uzunluktaki örnekleri aynı batch'te topla
|
| 436 |
+
"""
|
| 437 |
+
# Uzunluğa göre sırala
|
| 438 |
+
sorted_indices = sorted(
|
| 439 |
+
range(len(self.dataset)),
|
| 440 |
+
key=lambda i: self.dataset[i][length_column]
|
| 441 |
+
)
|
| 442 |
+
|
| 443 |
+
# Batch'leri oluştur
|
| 444 |
+
batches = []
|
| 445 |
+
for i in range(0, len(sorted_indices), self.batch_size):
|
| 446 |
+
batch_indices = sorted_indices[i:i + self.batch_size]
|
| 447 |
+
batches.append(self.dataset.select(batch_indices))
|
| 448 |
+
|
| 449 |
+
return batches
|
| 450 |
+
|
| 451 |
+
def bucket_batching(self, length_column='length', n_buckets=5):
|
| 452 |
+
"""
|
| 453 |
+
Bucket-based batching - uzunluklara göre gruplama
|
| 454 |
+
"""
|
| 455 |
+
lengths = [ex[length_column] for ex in self.dataset]
|
| 456 |
+
min_len, max_len = min(lengths), max(lengths)
|
| 457 |
+
|
| 458 |
+
# Bucket boundaries
|
| 459 |
+
bucket_size = (max_len - min_len) / n_buckets
|
| 460 |
+
buckets = [[] for _ in range(n_buckets)]
|
| 461 |
+
|
| 462 |
+
# Örnekleri bucket'lara ata
|
| 463 |
+
for i, ex in enumerate(self.dataset):
|
| 464 |
+
length = ex[length_column]
|
| 465 |
+
bucket_idx = min(int((length - min_len) / bucket_size), n_buckets - 1)
|
| 466 |
+
buckets[bucket_idx].append(i)
|
| 467 |
+
|
| 468 |
+
# Her bucket'tan batch'ler oluştur
|
| 469 |
+
all_batches = []
|
| 470 |
+
for bucket_indices in buckets:
|
| 471 |
+
random.shuffle(bucket_indices)
|
| 472 |
+
for i in range(0, len(bucket_indices), self.batch_size):
|
| 473 |
+
batch_indices = bucket_indices[i:i + self.batch_size]
|
| 474 |
+
all_batches.append(self.dataset.select(batch_indices))
|
| 475 |
+
|
| 476 |
+
return all_batches
|
| 477 |
+
|
| 478 |
+
def get_batch_statistics(self, batches, length_column='length'):
|
| 479 |
+
"""
|
| 480 |
+
Batch istatistiklerini hesapla
|
| 481 |
+
"""
|
| 482 |
+
stats = []
|
| 483 |
+
for i, batch in enumerate(batches):
|
| 484 |
+
lengths = [ex[length_column] for ex in batch]
|
| 485 |
+
stats.append({
|
| 486 |
+
'batch_id': i,
|
| 487 |
+
'size': len(batch),
|
| 488 |
+
'min_length': min(lengths),
|
| 489 |
+
'max_length': max(lengths),
|
| 490 |
+
'avg_length': np.mean(lengths),
|
| 491 |
+
'std_length': np.std(lengths)
|
| 492 |
+
})
|
| 493 |
+
return stats
|
| 494 |
+
|
| 495 |
+
# Test dynamic batching
|
| 496 |
+
print("\n1️⃣ Length-based Batching:")
|
| 497 |
+
batcher = DynamicBatcher(dataset, batch_size=50)
|
| 498 |
+
|
| 499 |
+
length_batches = batcher.length_based_batching(length_column='length')
|
| 500 |
+
print(f" Total batches: {len(length_batches)}")
|
| 501 |
+
|
| 502 |
+
# İlk 5 batch'in istatistikleri
|
| 503 |
+
stats = batcher.get_batch_statistics(length_batches[:5])
|
| 504 |
+
print(f"\n First 5 batch statistics:")
|
| 505 |
+
for stat in stats:
|
| 506 |
+
print(f" Batch {stat['batch_id']}: "
|
| 507 |
+
f"size={stat['size']}, "
|
| 508 |
+
f"length range=[{stat['min_length']}-{stat['max_length']}], "
|
| 509 |
+
f"std={stat['std_length']:.1f}")
|
| 510 |
+
|
| 511 |
+
# Padding efficiency
|
| 512 |
+
print(f"\n Padding efficiency:")
|
| 513 |
+
total_padding = sum(
|
| 514 |
+
(stat['max_length'] - stat['avg_length']) * stat['size']
|
| 515 |
+
for stat in stats
|
| 516 |
+
)
|
| 517 |
+
print(f" Average padding per example: {total_padding / sum(s['size'] for s in stats):.1f}")
|
| 518 |
+
|
| 519 |
+
|
| 520 |
+
print("\n2️⃣ Bucket Batching:")
|
| 521 |
+
bucket_batches = batcher.bucket_batching(n_buckets=5)
|
| 522 |
+
print(f" Total batches: {len(bucket_batches)}")
|
| 523 |
+
|
| 524 |
+
# Bucket istatistikleri
|
| 525 |
+
bucket_stats = batcher.get_batch_statistics(bucket_batches[:10])
|
| 526 |
+
print(f"\n Sample bucket statistics:")
|
| 527 |
+
for stat in bucket_stats[:5]:
|
| 528 |
+
print(f" Batch {stat['batch_id']}: "
|
| 529 |
+
f"size={stat['size']}, "
|
| 530 |
+
f"length range=[{stat['min_length']}-{stat['max_length']}]")
|
| 531 |
+
|
| 532 |
+
|
| 533 |
+
print("\n3️⃣ Smart Batch Composition:")
|
| 534 |
+
|
| 535 |
+
class SmartBatcher:
|
| 536 |
+
"""
|
| 537 |
+
Intelligent batch composition
|
| 538 |
+
"""
|
| 539 |
+
@staticmethod
|
| 540 |
+
def create_balanced_batches(dataset,
|
| 541 |
+
label_column='label',
|
| 542 |
+
batch_size=32):
|
| 543 |
+
"""
|
| 544 |
+
Her batch'te class balance sağla
|
| 545 |
+
"""
|
| 546 |
+
# Label'lara göre grupla
|
| 547 |
+
label_groups = defaultdict(list)
|
| 548 |
+
for i, ex in enumerate(dataset):
|
| 549 |
+
label_groups[ex[label_column]].append(i)
|
| 550 |
+
|
| 551 |
+
# Her label'dan eşit sayıda örnek al
|
| 552 |
+
n_labels = len(label_groups)
|
| 553 |
+
per_label = batch_size // n_labels
|
| 554 |
+
|
| 555 |
+
batches = []
|
| 556 |
+
max_iterations = max(len(indices) for indices in label_groups.values()) // per_label
|
| 557 |
+
|
| 558 |
+
for iteration in range(max_iterations):
|
| 559 |
+
batch_indices = []
|
| 560 |
+
|
| 561 |
+
for label, indices in label_groups.items():
|
| 562 |
+
start = iteration * per_label
|
| 563 |
+
end = start + per_label
|
| 564 |
+
if start < len(indices):
|
| 565 |
+
batch_indices.extend(indices[start:min(end, len(indices))])
|
| 566 |
+
|
| 567 |
+
if batch_indices:
|
| 568 |
+
random.shuffle(batch_indices)
|
| 569 |
+
batches.append(dataset.select(batch_indices))
|
| 570 |
+
|
| 571 |
+
return batches
|
| 572 |
+
|
| 573 |
+
@staticmethod
|
| 574 |
+
def create_diverse_batches(dataset,
|
| 575 |
+
diversity_column='domain',
|
| 576 |
+
batch_size=32):
|
| 577 |
+
"""
|
| 578 |
+
Her batch'te çeşitlilik sağla
|
| 579 |
+
"""
|
| 580 |
+
groups = defaultdict(list)
|
| 581 |
+
for i, ex in enumerate(dataset):
|
| 582 |
+
groups[ex[diversity_column]].append(i)
|
| 583 |
+
|
| 584 |
+
# Round-robin şeklinde batch oluştur
|
| 585 |
+
all_indices = list(range(len(dataset)))
|
| 586 |
+
random.shuffle(all_indices)
|
| 587 |
+
|
| 588 |
+
batches = []
|
| 589 |
+
for i in range(0, len(all_indices), batch_size):
|
| 590 |
+
batch_indices = all_indices[i:i + batch_size]
|
| 591 |
+
batches.append(dataset.select(batch_indices))
|
| 592 |
+
|
| 593 |
+
return batches
|
| 594 |
+
|
| 595 |
+
# Test smart batching
|
| 596 |
+
print("\n Balanced batches:")
|
| 597 |
+
balanced_batches = SmartBatcher.create_balanced_batches(dataset, batch_size=30)
|
| 598 |
+
print(f" Created: {len(balanced_batches)} batches")
|
| 599 |
+
|
| 600 |
+
# İlk batch'in label distribution'ı
|
| 601 |
+
first_batch_labels = [ex['label'] for ex in balanced_batches[0]]
|
| 602 |
+
label_dist = Counter(first_batch_labels)
|
| 603 |
+
print(f" First batch label distribution: {dict(label_dist)}")
|
| 604 |
+
|
| 605 |
+
print("\n Diverse batches:")
|
| 606 |
+
diverse_batches = SmartBatcher.create_diverse_batches(dataset, batch_size=30)
|
| 607 |
+
print(f" Created: {len(diverse_batches)} batches")
|
| 608 |
+
|
| 609 |
+
# İlk batch'in domain distribution'ı
|
| 610 |
+
first_batch_domains = [ex['domain'] for ex in diverse_batches[0]]
|
| 611 |
+
domain_dist = Counter(first_batch_domains)
|
| 612 |
+
print(f" First batch domain distribution: {dict(domain_dist)}")
|
| 613 |
+
|
| 614 |
+
|
| 615 |
+
print("\n" + "="*70)
|
| 616 |
+
print("7. PRODUCTION-READY PATTERNS")
|
| 617 |
+
print("="*70)
|
| 618 |
+
|
| 619 |
+
print("\n🎯 Real-World Integration Patterns:")
|
| 620 |
+
|
| 621 |
+
class DatasetManager:
|
| 622 |
+
"""
|
| 623 |
+
Production-ready dataset management
|
| 624 |
+
"""
|
| 625 |
+
def __init__(self, dataset, validation_rules=None):
|
| 626 |
+
self.dataset = dataset
|
| 627 |
+
self.validation_rules = validation_rules or []
|
| 628 |
+
self.statistics = {}
|
| 629 |
+
|
| 630 |
+
def validate(self):
|
| 631 |
+
"""Dataset'i validate et"""
|
| 632 |
+
print("\n Validating dataset...")
|
| 633 |
+
issues = []
|
| 634 |
+
|
| 635 |
+
# Temel validations
|
| 636 |
+
if len(self.dataset) == 0:
|
| 637 |
+
issues.append("Dataset is empty")
|
| 638 |
+
|
| 639 |
+
# Custom validation rules
|
| 640 |
+
for rule in self.validation_rules:
|
| 641 |
+
try:
|
| 642 |
+
result = rule(self.dataset)
|
| 643 |
+
if not result['valid']:
|
| 644 |
+
issues.append(result['message'])
|
| 645 |
+
except Exception as e:
|
| 646 |
+
issues.append(f"Validation error: {str(e)}")
|
| 647 |
+
|
| 648 |
+
if issues:
|
| 649 |
+
print(f" ⚠️ Found {len(issues)} issues:")
|
| 650 |
+
for issue in issues:
|
| 651 |
+
print(f" - {issue}")
|
| 652 |
+
return False
|
| 653 |
+
else:
|
| 654 |
+
print(" ✅ Validation passed")
|
| 655 |
+
return True
|
| 656 |
+
|
| 657 |
+
def compute_statistics(self):
|
| 658 |
+
"""İstatistikleri hesapla"""
|
| 659 |
+
print("\n Computing statistics...")
|
| 660 |
+
|
| 661 |
+
self.statistics = {
|
| 662 |
+
'size': len(self.dataset),
|
| 663 |
+
'columns': self.dataset.column_names,
|
| 664 |
+
'memory_size': len(str(self.dataset)), # Approximation
|
| 665 |
+
}
|
| 666 |
+
|
| 667 |
+
# Numeric column statistics
|
| 668 |
+
for col in self.dataset.column_names:
|
| 669 |
+
try:
|
| 670 |
+
values = [ex[col] for ex in self.dataset.select(range(min(100, len(self.dataset))))]
|
| 671 |
+
if all(isinstance(v, (int, float)) for v in values):
|
| 672 |
+
self.statistics[f'{col}_stats'] = {
|
| 673 |
+
'mean': np.mean(values),
|
| 674 |
+
'std': np.std(values),
|
| 675 |
+
'min': np.min(values),
|
| 676 |
+
'max': np.max(values)
|
| 677 |
+
}
|
| 678 |
+
except:
|
| 679 |
+
pass
|
| 680 |
+
|
| 681 |
+
print(f" ✅ Statistics computed")
|
| 682 |
+
return self.statistics
|
| 683 |
+
|
| 684 |
+
def summary(self):
|
| 685 |
+
"""Dataset özeti"""
|
| 686 |
+
print(f"\n📊 Dataset Summary:")
|
| 687 |
+
print(f" Size: {len(self.dataset):,} examples")
|
| 688 |
+
print(f" Columns: {len(self.dataset.column_names)}")
|
| 689 |
+
print(f" Column names: {', '.join(self.dataset.column_names[:5])}...")
|
| 690 |
+
|
| 691 |
+
# Test production patterns
|
| 692 |
+
print("\n Creating dataset manager:")
|
| 693 |
+
|
| 694 |
+
# Custom validation rules
|
| 695 |
+
def check_text_length(dataset):
|
| 696 |
+
lengths = [len(ex['text']) for ex in dataset.select(range(min(100, len(dataset))))]
|
| 697 |
+
avg_length = np.mean(lengths)
|
| 698 |
+
return {
|
| 699 |
+
'valid': avg_length > 10,
|
| 700 |
+
'message': f"Average text length too short: {avg_length:.1f}"
|
| 701 |
+
}
|
| 702 |
+
|
| 703 |
+
def check_label_distribution(dataset):
|
| 704 |
+
labels = [ex['label'] for ex in dataset]
|
| 705 |
+
label_counts = Counter(labels)
|
| 706 |
+
min_count = min(label_counts.values())
|
| 707 |
+
return {
|
| 708 |
+
'valid': min_count >= 10,
|
| 709 |
+
'message': f"Imbalanced labels: min count = {min_count}"
|
| 710 |
+
}
|
| 711 |
+
|
| 712 |
+
manager = DatasetManager(
|
| 713 |
+
dataset,
|
| 714 |
+
validation_rules=[check_text_length, check_label_distribution]
|
| 715 |
+
)
|
| 716 |
+
|
| 717 |
+
# Validate
|
| 718 |
+
manager.validate()
|
| 719 |
+
|
| 720 |
+
# Statistics
|
| 721 |
+
stats = manager.compute_statistics()
|
| 722 |
+
|
| 723 |
+
# Summary
|
| 724 |
+
manager.summary()
|
| 725 |
+
|
| 726 |
+
|
| 727 |
+
print("\n" + "="*70)
|
| 728 |
+
print("✅ BÖLÜM 3 TAMAMLANDI!")
|
| 729 |
+
print("="*70)
|
| 730 |
+
|
| 731 |
+
print(f"""
|
| 732 |
+
Bu bölümde öğrendikleriniz (Tam Liste):
|
| 733 |
+
|
| 734 |
+
PART 1:
|
| 735 |
+
✓ Custom Data Collators (3 tip: Simple, Padding, Advanced)
|
| 736 |
+
✓ Advanced Feature Extraction (10+ features)
|
| 737 |
+
✓ Feature Transformation & Normalization
|
| 738 |
+
✓ Interaction Features
|
| 739 |
+
✓ End-to-End Preprocessing Pipelines
|
| 740 |
+
✓ Pipeline Templates
|
| 741 |
+
✓ Data Augmentation (Word deletion, swap, synonym)
|
| 742 |
+
✓ Smart Class Balancing
|
| 743 |
+
|
| 744 |
+
PART 2:
|
| 745 |
+
✓ Complex Multi-Condition Filtering
|
| 746 |
+
✓ Percentile Filtering
|
| 747 |
+
✓ Stratified Sampling (Single & Multi-column)
|
| 748 |
+
✓ Diversity Sampling (Max diversity, Coverage-based)
|
| 749 |
+
✓ Active Learning Sampling (Uncertainty-based)
|
| 750 |
+
✓ Dynamic Batching (Length-based, Bucket-based)
|
| 751 |
+
✓ Smart Batch Composition (Balanced, Diverse)
|
| 752 |
+
✓ Production-Ready Dataset Management
|
| 753 |
+
|
| 754 |
+
📊 PERFORMANS KAZANIMLARI:
|
| 755 |
+
- Dynamic batching: Padding'i %40+ azaltır
|
| 756 |
+
- Stratified sampling: Balanced splits
|
| 757 |
+
- Diversity sampling: Daha representative data
|
| 758 |
+
- Smart augmentation: 3x veri artışı
|
| 759 |
+
|
| 760 |
+
🎯 KEY TAKEAWAYS:
|
| 761 |
+
- Collator'lar modele göre özelleştirilmeli
|
| 762 |
+
- Pipeline pattern code organization'ı kolaylaştırır
|
| 763 |
+
- Augmentation class imbalance'ı çözer
|
| 764 |
+
- Stratified sampling generalization'ı iyileştirir
|
| 765 |
+
- Dynamic batching training efficiency'yi artırır
|
| 766 |
+
|
| 767 |
+
📚 SONRAKI BÖLÜM: Özel Görevler İçin Datasets
|
| 768 |
+
- Question Answering (SQuAD, Natural Questions)
|
| 769 |
+
- Summarization (CNN/DailyMail)
|
| 770 |
+
- Named Entity Recognition
|
| 771 |
+
- Sentiment Analysis
|
| 772 |
+
- Text Classification
|
| 773 |
+
""")
|
| 774 |
+
|
| 775 |
+
print("\n🎉 Tebrikler! İleri teknikler modülünü tamamladınız!")
|
| 776 |
+
print("4. bölüme geçelim mi? (Özel Görevler)")
|
space/modules/04_ozel_gorevler.py
ADDED
|
@@ -0,0 +1,1039 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ÖZEL GÖREVLER İÇİN DATASETS - İLERİ SEVİYE
|
| 3 |
+
==========================================
|
| 4 |
+
|
| 5 |
+
Bu modülde öğrenecekleriniz:
|
| 6 |
+
1. Question Answering (QA) Datasets
|
| 7 |
+
2. Summarization Datasets
|
| 8 |
+
3. Named Entity Recognition (NER)
|
| 9 |
+
4. Sentiment Analysis
|
| 10 |
+
5. Text Classification
|
| 11 |
+
6. Multi-Task Learning Datasets
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from datasets import Dataset, DatasetDict
|
| 15 |
+
import numpy as np
|
| 16 |
+
from typing import Dict, List, Any
|
| 17 |
+
import random
|
| 18 |
+
from collections import Counter, defaultdict
|
| 19 |
+
import json
|
| 20 |
+
|
| 21 |
+
print("="*70)
|
| 22 |
+
print("📚 ÖZEL GÖREVLER İÇİN DATASETS")
|
| 23 |
+
print("="*70)
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
print("\n" + "="*70)
|
| 27 |
+
print("1. QUESTION ANSWERING (QA) DATASETS")
|
| 28 |
+
print("="*70)
|
| 29 |
+
|
| 30 |
+
print("\n❓ Question Answering Dataset Yapısı:")
|
| 31 |
+
|
| 32 |
+
class QADatasetCreator:
|
| 33 |
+
"""
|
| 34 |
+
Question Answering dataset oluşturucu
|
| 35 |
+
"""
|
| 36 |
+
@staticmethod
|
| 37 |
+
def create_extractive_qa_dataset(num_samples=200):
|
| 38 |
+
"""
|
| 39 |
+
Extractive QA (SQuAD-style)
|
| 40 |
+
Cevap context'in içinden extract edilir
|
| 41 |
+
"""
|
| 42 |
+
contexts = [
|
| 43 |
+
"The Amazon rainforest, also known as Amazonia, is a moist broadleaf tropical rainforest. "
|
| 44 |
+
"It covers most of the Amazon basin of South America. The basin covers 7 million square kilometers. "
|
| 45 |
+
"The rainforest contains approximately 390 billion individual trees.",
|
| 46 |
+
|
| 47 |
+
"Python is a high-level programming language. It was created by Guido van Rossum in 1991. "
|
| 48 |
+
"Python emphasizes code readability with significant indentation. It supports multiple programming paradigms "
|
| 49 |
+
"including structured, object-oriented and functional programming.",
|
| 50 |
+
|
| 51 |
+
"The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. "
|
| 52 |
+
"It was designed by Gustave Eiffel and completed in 1889. Standing 330 meters tall, "
|
| 53 |
+
"it was the world's tallest man-made structure until 1930.",
|
| 54 |
+
|
| 55 |
+
"Artificial Intelligence is the simulation of human intelligence by machines. "
|
| 56 |
+
"AI research began in 1956 at Dartmouth College. Modern AI techniques include "
|
| 57 |
+
"machine learning, deep learning, and natural language processing."
|
| 58 |
+
]
|
| 59 |
+
|
| 60 |
+
qa_pairs = [
|
| 61 |
+
("What is the Amazon rainforest?", "a moist broadleaf tropical rainforest", 0),
|
| 62 |
+
("How many square kilometers does the Amazon basin cover?", "7 million square kilometers", 0),
|
| 63 |
+
("Who created Python?", "Guido van Rossum", 1),
|
| 64 |
+
("When was Python created?", "1991", 1),
|
| 65 |
+
("Where is the Eiffel Tower located?", "Paris, France", 2),
|
| 66 |
+
("How tall is the Eiffel Tower?", "330 meters", 2),
|
| 67 |
+
("When did AI research begin?", "1956", 3),
|
| 68 |
+
("Where did AI research begin?", "Dartmouth College", 3),
|
| 69 |
+
]
|
| 70 |
+
|
| 71 |
+
def gen():
|
| 72 |
+
for i in range(num_samples):
|
| 73 |
+
context_idx = i % len(contexts)
|
| 74 |
+
qa_idx = i % len(qa_pairs)
|
| 75 |
+
|
| 76 |
+
context = contexts[context_idx]
|
| 77 |
+
question, answer, expected_ctx = qa_pairs[qa_idx]
|
| 78 |
+
|
| 79 |
+
# Answer span'i bul
|
| 80 |
+
answer_start = context.find(answer) if context_idx == expected_ctx else -1
|
| 81 |
+
|
| 82 |
+
yield {
|
| 83 |
+
'id': f'qa_{i}',
|
| 84 |
+
'context': context,
|
| 85 |
+
'question': question,
|
| 86 |
+
'answers': {
|
| 87 |
+
'text': [answer],
|
| 88 |
+
'answer_start': [answer_start] if answer_start >= 0 else [-1]
|
| 89 |
+
},
|
| 90 |
+
'is_impossible': answer_start < 0
|
| 91 |
+
}
|
| 92 |
+
|
| 93 |
+
return Dataset.from_generator(gen)
|
| 94 |
+
|
| 95 |
+
@staticmethod
|
| 96 |
+
def create_multiple_choice_qa(num_samples=100):
|
| 97 |
+
"""
|
| 98 |
+
Multiple Choice QA
|
| 99 |
+
"""
|
| 100 |
+
questions = [
|
| 101 |
+
{
|
| 102 |
+
'question': 'What is the capital of France?',
|
| 103 |
+
'choices': ['London', 'Berlin', 'Paris', 'Madrid'],
|
| 104 |
+
'answer': 2
|
| 105 |
+
},
|
| 106 |
+
{
|
| 107 |
+
'question': 'Which planet is known as the Red Planet?',
|
| 108 |
+
'choices': ['Venus', 'Mars', 'Jupiter', 'Saturn'],
|
| 109 |
+
'answer': 1
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
'question': 'Who wrote Romeo and Juliet?',
|
| 113 |
+
'choices': ['Charles Dickens', 'William Shakespeare', 'Jane Austen', 'Mark Twain'],
|
| 114 |
+
'answer': 1
|
| 115 |
+
}
|
| 116 |
+
]
|
| 117 |
+
|
| 118 |
+
def gen():
|
| 119 |
+
for i in range(num_samples):
|
| 120 |
+
q = questions[i % len(questions)]
|
| 121 |
+
yield {
|
| 122 |
+
'id': f'mcqa_{i}',
|
| 123 |
+
'question': q['question'],
|
| 124 |
+
'choices': q['choices'],
|
| 125 |
+
'answer': q['answer'],
|
| 126 |
+
'answer_text': q['choices'][q['answer']]
|
| 127 |
+
}
|
| 128 |
+
|
| 129 |
+
return Dataset.from_generator(gen)
|
| 130 |
+
|
| 131 |
+
print("\n1️⃣ Extractive QA Dataset (SQuAD-style):")
|
| 132 |
+
qa_dataset = QADatasetCreator.create_extractive_qa_dataset(200)
|
| 133 |
+
|
| 134 |
+
print(f"✅ Dataset: {len(qa_dataset)} QA pairs")
|
| 135 |
+
print(f"\nÖrnek QA:")
|
| 136 |
+
sample = qa_dataset[0]
|
| 137 |
+
print(f" Context: {sample['context'][:100]}...")
|
| 138 |
+
print(f" Question: {sample['question']}")
|
| 139 |
+
print(f" Answer: {sample['answers']['text'][0]}")
|
| 140 |
+
print(f" Answer start: {sample['answers']['answer_start'][0]}")
|
| 141 |
+
print(f" Is impossible: {sample['is_impossible']}")
|
| 142 |
+
|
| 143 |
+
# İstatistikler
|
| 144 |
+
print(f"\n📊 QA Statistics:")
|
| 145 |
+
impossible_count = sum([1 for ex in qa_dataset if ex['is_impossible']])
|
| 146 |
+
print(f" Total questions: {len(qa_dataset)}")
|
| 147 |
+
print(f" Answerable: {len(qa_dataset) - impossible_count}")
|
| 148 |
+
print(f" Impossible: {impossible_count}")
|
| 149 |
+
|
| 150 |
+
# Answer length distribution
|
| 151 |
+
answerable = [ex for ex in qa_dataset if not ex['is_impossible']]
|
| 152 |
+
answer_lengths = [len(ex['answers']['text'][0].split()) for ex in answerable]
|
| 153 |
+
print(f" Avg answer length: {np.mean(answer_lengths):.1f} words")
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
print("\n2️⃣ Multiple Choice QA:")
|
| 157 |
+
mcqa_dataset = QADatasetCreator.create_multiple_choice_qa(100)
|
| 158 |
+
|
| 159 |
+
print(f"✅ Dataset: {len(mcqa_dataset)} questions")
|
| 160 |
+
print(f"\nÖrnek:")
|
| 161 |
+
sample = mcqa_dataset[0]
|
| 162 |
+
print(f" Question: {sample['question']}")
|
| 163 |
+
print(f" Choices:")
|
| 164 |
+
for i, choice in enumerate(sample['choices']):
|
| 165 |
+
marker = "✓" if i == sample['answer'] else " "
|
| 166 |
+
print(f" {marker} {i}. {choice}")
|
| 167 |
+
print(f" Correct answer: {sample['answer_text']}")
|
| 168 |
+
|
| 169 |
+
|
| 170 |
+
print("\n3️⃣ QA Preprocessing Pipeline:")
|
| 171 |
+
|
| 172 |
+
class QAPreprocessor:
|
| 173 |
+
"""
|
| 174 |
+
QA-specific preprocessing
|
| 175 |
+
"""
|
| 176 |
+
@staticmethod
|
| 177 |
+
def validate_qa_example(example):
|
| 178 |
+
"""
|
| 179 |
+
QA örneğini validate et
|
| 180 |
+
"""
|
| 181 |
+
if example['is_impossible']:
|
| 182 |
+
return True
|
| 183 |
+
|
| 184 |
+
answer = example['answers']['text'][0]
|
| 185 |
+
answer_start = example['answers']['answer_start'][0]
|
| 186 |
+
context = example['context']
|
| 187 |
+
|
| 188 |
+
# Answer context'te var mı?
|
| 189 |
+
if answer_start >= 0:
|
| 190 |
+
extracted = context[answer_start:answer_start + len(answer)]
|
| 191 |
+
return extracted == answer
|
| 192 |
+
return False
|
| 193 |
+
|
| 194 |
+
@staticmethod
|
| 195 |
+
def add_qa_features(example):
|
| 196 |
+
"""
|
| 197 |
+
QA feature'ları ekle
|
| 198 |
+
"""
|
| 199 |
+
result = {**example}
|
| 200 |
+
|
| 201 |
+
# Question type
|
| 202 |
+
question_lower = example['question'].lower()
|
| 203 |
+
if question_lower.startswith('what'):
|
| 204 |
+
q_type = 'what'
|
| 205 |
+
elif question_lower.startswith('who'):
|
| 206 |
+
q_type = 'who'
|
| 207 |
+
elif question_lower.startswith('when'):
|
| 208 |
+
q_type = 'when'
|
| 209 |
+
elif question_lower.startswith('where'):
|
| 210 |
+
q_type = 'where'
|
| 211 |
+
elif question_lower.startswith('how'):
|
| 212 |
+
q_type = 'how'
|
| 213 |
+
elif question_lower.startswith('why'):
|
| 214 |
+
q_type = 'why'
|
| 215 |
+
else:
|
| 216 |
+
q_type = 'other'
|
| 217 |
+
|
| 218 |
+
result['question_type'] = q_type
|
| 219 |
+
result['context_length'] = len(example['context'].split())
|
| 220 |
+
result['question_length'] = len(example['question'].split())
|
| 221 |
+
|
| 222 |
+
if not example['is_impossible']:
|
| 223 |
+
answer = example['answers']['text'][0]
|
| 224 |
+
result['answer_length'] = len(answer.split())
|
| 225 |
+
else:
|
| 226 |
+
result['answer_length'] = 0
|
| 227 |
+
|
| 228 |
+
return result
|
| 229 |
+
|
| 230 |
+
# Preprocessing uygula
|
| 231 |
+
print("\n Applying QA preprocessing:")
|
| 232 |
+
qa_processed = qa_dataset.map(
|
| 233 |
+
QAPreprocessor.add_qa_features,
|
| 234 |
+
desc="Adding QA features"
|
| 235 |
+
)
|
| 236 |
+
|
| 237 |
+
print(f"✅ Processed: {len(qa_processed)} examples")
|
| 238 |
+
print(f" New columns: {[c for c in qa_processed.column_names if c not in qa_dataset.column_names]}")
|
| 239 |
+
|
| 240 |
+
# Question type distribution
|
| 241 |
+
q_types = [ex['question_type'] for ex in qa_processed]
|
| 242 |
+
type_dist = Counter(q_types)
|
| 243 |
+
print(f"\n Question type distribution:")
|
| 244 |
+
for qtype, count in type_dist.most_common():
|
| 245 |
+
print(f" {qtype}: {count}")
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
print("\n" + "="*70)
|
| 249 |
+
print("2. SUMMARIZATION DATASETS")
|
| 250 |
+
print("="*70)
|
| 251 |
+
|
| 252 |
+
print("\n📝 Summarization Dataset Yapısı:")
|
| 253 |
+
|
| 254 |
+
class SummarizationDatasetCreator:
|
| 255 |
+
"""
|
| 256 |
+
Summarization dataset oluşturucu
|
| 257 |
+
"""
|
| 258 |
+
@staticmethod
|
| 259 |
+
def create_news_summarization(num_samples=100):
|
| 260 |
+
"""
|
| 261 |
+
News summarization (CNN/DailyMail style)
|
| 262 |
+
"""
|
| 263 |
+
article_templates = [
|
| 264 |
+
{
|
| 265 |
+
'article': "Scientists have made a breakthrough discovery in renewable energy. "
|
| 266 |
+
"Researchers at MIT developed a new solar panel technology that increases "
|
| 267 |
+
"efficiency by 40%. The innovation uses advanced nanomaterials. "
|
| 268 |
+
"This could revolutionize the solar energy industry. "
|
| 269 |
+
"The team published their findings in Nature Energy journal. "
|
| 270 |
+
"Commercial applications are expected within 5 years.",
|
| 271 |
+
'summary': "MIT researchers developed solar panels with 40% higher efficiency using nanomaterials."
|
| 272 |
+
},
|
| 273 |
+
{
|
| 274 |
+
'article': "The global tech conference concluded yesterday with major announcements. "
|
| 275 |
+
"Leading companies unveiled new AI technologies and products. "
|
| 276 |
+
"Attendance reached record numbers with over 50,000 participants. "
|
| 277 |
+
"Industry experts discussed future trends in artificial intelligence. "
|
| 278 |
+
"The conference featured 200 speakers from 30 countries.",
|
| 279 |
+
'summary': "Global tech conference featured AI announcements with record 50,000 attendees."
|
| 280 |
+
},
|
| 281 |
+
{
|
| 282 |
+
'article': "Climate change continues to impact global weather patterns. "
|
| 283 |
+
"Recent studies show increasing temperatures worldwide. "
|
| 284 |
+
"Scientists warn of more frequent extreme weather events. "
|
| 285 |
+
"International cooperation is needed to address the crisis. "
|
| 286 |
+
"Many countries are implementing new environmental policies.",
|
| 287 |
+
'summary': "Studies reveal climate change effects and call for international action."
|
| 288 |
+
}
|
| 289 |
+
]
|
| 290 |
+
|
| 291 |
+
def gen():
|
| 292 |
+
for i in range(num_samples):
|
| 293 |
+
template = article_templates[i % len(article_templates)]
|
| 294 |
+
|
| 295 |
+
yield {
|
| 296 |
+
'id': f'summ_{i}',
|
| 297 |
+
'article': template['article'],
|
| 298 |
+
'summary': template['summary'],
|
| 299 |
+
'article_length': len(template['article'].split()),
|
| 300 |
+
'summary_length': len(template['summary'].split()),
|
| 301 |
+
'compression_ratio': len(template['summary']) / len(template['article'])
|
| 302 |
+
}
|
| 303 |
+
|
| 304 |
+
return Dataset.from_generator(gen)
|
| 305 |
+
|
| 306 |
+
@staticmethod
|
| 307 |
+
def create_abstractive_summarization(num_samples=100):
|
| 308 |
+
"""
|
| 309 |
+
Abstractive summarization - yeni kelimeler içeren özetler
|
| 310 |
+
"""
|
| 311 |
+
def gen():
|
| 312 |
+
for i in range(num_samples):
|
| 313 |
+
article_length = np.random.randint(100, 500)
|
| 314 |
+
summary_length = np.random.randint(20, 50)
|
| 315 |
+
|
| 316 |
+
yield {
|
| 317 |
+
'id': f'abs_summ_{i}',
|
| 318 |
+
'article': f"Long article about topic {i}. " * (article_length // 5),
|
| 319 |
+
'summary': f"Brief summary of article {i}. " * (summary_length // 5),
|
| 320 |
+
'summary_type': 'abstractive',
|
| 321 |
+
'article_length': article_length,
|
| 322 |
+
'summary_length': summary_length
|
| 323 |
+
}
|
| 324 |
+
|
| 325 |
+
return Dataset.from_generator(gen)
|
| 326 |
+
|
| 327 |
+
print("\n1️⃣ News Summarization Dataset:")
|
| 328 |
+
summ_dataset = SummarizationDatasetCreator.create_news_summarization(100)
|
| 329 |
+
|
| 330 |
+
print(f"✅ Dataset: {len(summ_dataset)} article-summary pairs")
|
| 331 |
+
print(f"\nÖrnek:")
|
| 332 |
+
sample = summ_dataset[0]
|
| 333 |
+
print(f" Article ({sample['article_length']} words):")
|
| 334 |
+
print(f" {sample['article'][:150]}...")
|
| 335 |
+
print(f" Summary ({sample['summary_length']} words):")
|
| 336 |
+
print(f" {sample['summary']}")
|
| 337 |
+
print(f" Compression ratio: {sample['compression_ratio']:.2%}")
|
| 338 |
+
|
| 339 |
+
# Summarization statistics
|
| 340 |
+
print(f"\n📊 Summarization Statistics:")
|
| 341 |
+
avg_article_len = np.mean([ex['article_length'] for ex in summ_dataset])
|
| 342 |
+
avg_summary_len = np.mean([ex['summary_length'] for ex in summ_dataset])
|
| 343 |
+
avg_compression = np.mean([ex['compression_ratio'] for ex in summ_dataset])
|
| 344 |
+
|
| 345 |
+
print(f" Avg article length: {avg_article_len:.1f} words")
|
| 346 |
+
print(f" Avg summary length: {avg_summary_len:.1f} words")
|
| 347 |
+
print(f" Avg compression ratio: {avg_compression:.2%}")
|
| 348 |
+
|
| 349 |
+
|
| 350 |
+
print("\n2️⃣ Summarization Quality Metrics:")
|
| 351 |
+
|
| 352 |
+
class SummarizationMetrics:
|
| 353 |
+
"""
|
| 354 |
+
Summarization için quality metrics
|
| 355 |
+
"""
|
| 356 |
+
@staticmethod
|
| 357 |
+
def calculate_rouge_proxy(article, summary):
|
| 358 |
+
"""
|
| 359 |
+
Basitleştirilmiş ROUGE-like metric
|
| 360 |
+
Gerçek ROUGE için rouge-score library kullanılır
|
| 361 |
+
"""
|
| 362 |
+
article_words = set(article.lower().split())
|
| 363 |
+
summary_words = set(summary.lower().split())
|
| 364 |
+
|
| 365 |
+
# Overlap
|
| 366 |
+
overlap = len(article_words & summary_words)
|
| 367 |
+
|
| 368 |
+
# Precision, Recall, F1
|
| 369 |
+
precision = overlap / len(summary_words) if summary_words else 0
|
| 370 |
+
recall = overlap / len(article_words) if article_words else 0
|
| 371 |
+
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
|
| 372 |
+
|
| 373 |
+
return {
|
| 374 |
+
'precision': precision,
|
| 375 |
+
'recall': recall,
|
| 376 |
+
'f1': f1
|
| 377 |
+
}
|
| 378 |
+
|
| 379 |
+
@staticmethod
|
| 380 |
+
def add_quality_metrics(example):
|
| 381 |
+
"""
|
| 382 |
+
Quality metrics ekle
|
| 383 |
+
"""
|
| 384 |
+
metrics = SummarizationMetrics.calculate_rouge_proxy(
|
| 385 |
+
example['article'],
|
| 386 |
+
example['summary']
|
| 387 |
+
)
|
| 388 |
+
|
| 389 |
+
return {
|
| 390 |
+
**example,
|
| 391 |
+
'rouge_precision': metrics['precision'],
|
| 392 |
+
'rouge_recall': metrics['recall'],
|
| 393 |
+
'rouge_f1': metrics['f1']
|
| 394 |
+
}
|
| 395 |
+
|
| 396 |
+
# Metrics ekle
|
| 397 |
+
print("\n Adding quality metrics:")
|
| 398 |
+
summ_with_metrics = summ_dataset.map(
|
| 399 |
+
SummarizationMetrics.add_quality_metrics,
|
| 400 |
+
desc="Calculating metrics"
|
| 401 |
+
)
|
| 402 |
+
|
| 403 |
+
print(f"✅ Metrics added")
|
| 404 |
+
print(f"\nSample metrics:")
|
| 405 |
+
sample = summ_with_metrics[0]
|
| 406 |
+
print(f" ROUGE Precision: {sample['rouge_precision']:.3f}")
|
| 407 |
+
print(f" ROUGE Recall: {sample['rouge_recall']:.3f}")
|
| 408 |
+
print(f" ROUGE F1: {sample['rouge_f1']:.3f}")
|
| 409 |
+
|
| 410 |
+
|
| 411 |
+
print("\n" + "="*70)
|
| 412 |
+
print("3. NAMED ENTITY RECOGNITION (NER)")
|
| 413 |
+
print("="*70)
|
| 414 |
+
|
| 415 |
+
print("\n🏷️ NER Dataset Yapısı:")
|
| 416 |
+
|
| 417 |
+
class NERDatasetCreator:
|
| 418 |
+
"""
|
| 419 |
+
Named Entity Recognition dataset oluşturucu
|
| 420 |
+
"""
|
| 421 |
+
@staticmethod
|
| 422 |
+
def create_ner_dataset(num_samples=100):
|
| 423 |
+
"""
|
| 424 |
+
NER dataset (CoNLL format)
|
| 425 |
+
"""
|
| 426 |
+
templates = [
|
| 427 |
+
{
|
| 428 |
+
'tokens': ['John', 'Smith', 'works', 'at', 'Google', 'in', 'New', 'York'],
|
| 429 |
+
'ner_tags': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'O', 'B-LOC', 'I-LOC']
|
| 430 |
+
},
|
| 431 |
+
{
|
| 432 |
+
'tokens': ['Apple', 'announced', 'new', 'products', 'in', 'California'],
|
| 433 |
+
'ner_tags': ['B-ORG', 'O', 'O', 'O', 'O', 'B-LOC']
|
| 434 |
+
},
|
| 435 |
+
{
|
| 436 |
+
'tokens': ['Dr.', 'Jane', 'Brown', 'visited', 'Paris', 'last', 'Monday'],
|
| 437 |
+
'ner_tags': ['O', 'B-PER', 'I-PER', 'O', 'B-LOC', 'O', 'B-DATE']
|
| 438 |
+
}
|
| 439 |
+
]
|
| 440 |
+
|
| 441 |
+
# Tag to ID mapping
|
| 442 |
+
tag2id = {
|
| 443 |
+
'O': 0,
|
| 444 |
+
'B-PER': 1, 'I-PER': 2,
|
| 445 |
+
'B-ORG': 3, 'I-ORG': 4,
|
| 446 |
+
'B-LOC': 5, 'I-LOC': 6,
|
| 447 |
+
'B-DATE': 7, 'I-DATE': 8
|
| 448 |
+
}
|
| 449 |
+
|
| 450 |
+
def gen():
|
| 451 |
+
for i in range(num_samples):
|
| 452 |
+
template = templates[i % len(templates)]
|
| 453 |
+
|
| 454 |
+
yield {
|
| 455 |
+
'id': f'ner_{i}',
|
| 456 |
+
'tokens': template['tokens'],
|
| 457 |
+
'ner_tags': template['ner_tags'],
|
| 458 |
+
'ner_tag_ids': [tag2id[tag] for tag in template['ner_tags']],
|
| 459 |
+
'sentence': ' '.join(template['tokens'])
|
| 460 |
+
}
|
| 461 |
+
|
| 462 |
+
return Dataset.from_generator(gen), tag2id
|
| 463 |
+
|
| 464 |
+
print("\n1️⃣ NER Dataset:")
|
| 465 |
+
ner_dataset, tag2id = NERDatasetCreator.create_ner_dataset(100)
|
| 466 |
+
|
| 467 |
+
print(f"✅ Dataset: {len(ner_dataset)} sentences")
|
| 468 |
+
print(f" Tag vocabulary: {len(tag2id)} tags")
|
| 469 |
+
print(f" Tags: {list(tag2id.keys())}")
|
| 470 |
+
|
| 471 |
+
print(f"\nÖrnek:")
|
| 472 |
+
sample = ner_dataset[0]
|
| 473 |
+
print(f" Sentence: {sample['sentence']}")
|
| 474 |
+
print(f" Tokens: {sample['tokens']}")
|
| 475 |
+
print(f" NER tags: {sample['ner_tags']}")
|
| 476 |
+
print(f"\n Token-Tag pairs:")
|
| 477 |
+
for token, tag in zip(sample['tokens'], sample['ner_tags']):
|
| 478 |
+
if tag != 'O':
|
| 479 |
+
print(f" {token}: {tag}")
|
| 480 |
+
|
| 481 |
+
|
| 482 |
+
print("\n2️⃣ NER Statistics:")
|
| 483 |
+
|
| 484 |
+
class NERAnalyzer:
|
| 485 |
+
"""
|
| 486 |
+
NER dataset analizi
|
| 487 |
+
"""
|
| 488 |
+
@staticmethod
|
| 489 |
+
def analyze_entities(dataset):
|
| 490 |
+
"""
|
| 491 |
+
Entity statistics
|
| 492 |
+
"""
|
| 493 |
+
all_tags = []
|
| 494 |
+
entity_counts = defaultdict(int)
|
| 495 |
+
|
| 496 |
+
for ex in dataset:
|
| 497 |
+
tags = ex['ner_tags']
|
| 498 |
+
all_tags.extend(tags)
|
| 499 |
+
|
| 500 |
+
# Entity'leri say
|
| 501 |
+
for tag in tags:
|
| 502 |
+
if tag.startswith('B-'):
|
| 503 |
+
entity_type = tag.split('-')[1]
|
| 504 |
+
entity_counts[entity_type] += 1
|
| 505 |
+
|
| 506 |
+
tag_dist = Counter(all_tags)
|
| 507 |
+
|
| 508 |
+
return {
|
| 509 |
+
'tag_distribution': dict(tag_dist),
|
| 510 |
+
'entity_counts': dict(entity_counts),
|
| 511 |
+
'total_tokens': len(all_tags),
|
| 512 |
+
'entity_tokens': len([t for t in all_tags if t != 'O'])
|
| 513 |
+
}
|
| 514 |
+
|
| 515 |
+
analyzer = NERAnalyzer()
|
| 516 |
+
ner_stats = analyzer.analyze_entities(ner_dataset)
|
| 517 |
+
|
| 518 |
+
print(f"\n Total tokens: {ner_stats['total_tokens']}")
|
| 519 |
+
print(f" Entity tokens: {ner_stats['entity_tokens']} "
|
| 520 |
+
f"({ner_stats['entity_tokens']/ner_stats['total_tokens']*100:.1f}%)")
|
| 521 |
+
|
| 522 |
+
print(f"\n Entity type distribution:")
|
| 523 |
+
for entity_type, count in sorted(ner_stats['entity_counts'].items()):
|
| 524 |
+
print(f" {entity_type}: {count} entities")
|
| 525 |
+
|
| 526 |
+
print(f"\n Tag distribution:")
|
| 527 |
+
for tag, count in sorted(ner_stats['tag_distribution'].items(), key=lambda x: -x[1])[:5]:
|
| 528 |
+
print(f" {tag}: {count}")
|
| 529 |
+
|
| 530 |
+
|
| 531 |
+
print("\n3️⃣ NER Data Augmentation:")
|
| 532 |
+
|
| 533 |
+
class NERAugmenter:
|
| 534 |
+
"""
|
| 535 |
+
NER için data augmentation
|
| 536 |
+
"""
|
| 537 |
+
@staticmethod
|
| 538 |
+
def swap_entities(example, entity_bank):
|
| 539 |
+
"""
|
| 540 |
+
Entity'leri farklı entity'lerle değiştir
|
| 541 |
+
"""
|
| 542 |
+
tokens = example['tokens'].copy()
|
| 543 |
+
ner_tags = example['ner_tags'].copy()
|
| 544 |
+
|
| 545 |
+
# B-tags'i bul
|
| 546 |
+
for i, tag in enumerate(ner_tags):
|
| 547 |
+
if tag.startswith('B-'):
|
| 548 |
+
entity_type = tag.split('-')[1]
|
| 549 |
+
if entity_type in entity_bank and entity_bank[entity_type]:
|
| 550 |
+
# Random entity seç
|
| 551 |
+
new_entity = random.choice(entity_bank[entity_type])
|
| 552 |
+
tokens[i] = new_entity
|
| 553 |
+
|
| 554 |
+
return {
|
| 555 |
+
**example,
|
| 556 |
+
'tokens': tokens,
|
| 557 |
+
'sentence': ' '.join(tokens),
|
| 558 |
+
'is_augmented': True
|
| 559 |
+
}
|
| 560 |
+
|
| 561 |
+
# Entity bank oluştur
|
| 562 |
+
entity_bank = {
|
| 563 |
+
'PER': ['Alice', 'Bob', 'Charlie', 'Diana'],
|
| 564 |
+
'ORG': ['Microsoft', 'Amazon', 'Tesla', 'IBM'],
|
| 565 |
+
'LOC': ['London', 'Tokyo', 'Berlin', 'Sydney']
|
| 566 |
+
}
|
| 567 |
+
|
| 568 |
+
augmenter = NERAugmenter()
|
| 569 |
+
print("\n Entity swapping örneği:")
|
| 570 |
+
original = ner_dataset[0]
|
| 571 |
+
augmented = augmenter.swap_entities(original, entity_bank)
|
| 572 |
+
|
| 573 |
+
print(f" Original: {original['sentence']}")
|
| 574 |
+
print(f" Augmented: {augmented['sentence']}")
|
| 575 |
+
|
| 576 |
+
|
| 577 |
+
print("\n" + "="*70)
|
| 578 |
+
print("4. SENTIMENT ANALYSIS")
|
| 579 |
+
print("="*70)
|
| 580 |
+
|
| 581 |
+
print("\n😊 Sentiment Analysis Dataset Yapısı:")
|
| 582 |
+
|
| 583 |
+
class SentimentDatasetCreator:
|
| 584 |
+
"""
|
| 585 |
+
Sentiment analysis dataset oluşturucu
|
| 586 |
+
"""
|
| 587 |
+
@staticmethod
|
| 588 |
+
def create_sentiment_dataset(num_samples=200):
|
| 589 |
+
"""
|
| 590 |
+
Binary/Multi-class sentiment classification
|
| 591 |
+
"""
|
| 592 |
+
positive_texts = [
|
| 593 |
+
"This product is amazing! Highly recommended.",
|
| 594 |
+
"Excellent service and great quality.",
|
| 595 |
+
"I love this! Best purchase ever.",
|
| 596 |
+
"Fantastic experience, will buy again.",
|
| 597 |
+
"Outstanding quality and fast delivery."
|
| 598 |
+
]
|
| 599 |
+
|
| 600 |
+
negative_texts = [
|
| 601 |
+
"Terrible product, waste of money.",
|
| 602 |
+
"Very disappointed with the quality.",
|
| 603 |
+
"Poor customer service, never again.",
|
| 604 |
+
"Worst purchase I've ever made.",
|
| 605 |
+
"Completely unsatisfied with this."
|
| 606 |
+
]
|
| 607 |
+
|
| 608 |
+
neutral_texts = [
|
| 609 |
+
"It's okay, nothing special.",
|
| 610 |
+
"Average product, meets basic needs.",
|
| 611 |
+
"Neither good nor bad, just acceptable.",
|
| 612 |
+
"Standard quality for the price.",
|
| 613 |
+
"It works as described."
|
| 614 |
+
]
|
| 615 |
+
|
| 616 |
+
def gen():
|
| 617 |
+
for i in range(num_samples):
|
| 618 |
+
sentiment_choice = i % 3
|
| 619 |
+
|
| 620 |
+
if sentiment_choice == 0:
|
| 621 |
+
text = positive_texts[i % len(positive_texts)]
|
| 622 |
+
label = 2 # Positive
|
| 623 |
+
label_text = 'positive'
|
| 624 |
+
elif sentiment_choice == 1:
|
| 625 |
+
text = negative_texts[i % len(negative_texts)]
|
| 626 |
+
label = 0 # Negative
|
| 627 |
+
label_text = 'negative'
|
| 628 |
+
else:
|
| 629 |
+
text = neutral_texts[i % len(neutral_texts)]
|
| 630 |
+
label = 1 # Neutral
|
| 631 |
+
label_text = 'neutral'
|
| 632 |
+
|
| 633 |
+
# Simulated confidence score
|
| 634 |
+
confidence = np.random.uniform(0.7, 1.0)
|
| 635 |
+
|
| 636 |
+
yield {
|
| 637 |
+
'id': f'sent_{i}',
|
| 638 |
+
'text': text,
|
| 639 |
+
'label': label,
|
| 640 |
+
'label_text': label_text,
|
| 641 |
+
'confidence': confidence,
|
| 642 |
+
'text_length': len(text.split())
|
| 643 |
+
}
|
| 644 |
+
|
| 645 |
+
return Dataset.from_generator(gen)
|
| 646 |
+
|
| 647 |
+
@staticmethod
|
| 648 |
+
def create_aspect_based_sentiment(num_samples=100):
|
| 649 |
+
"""
|
| 650 |
+
Aspect-based sentiment analysis
|
| 651 |
+
Farklı aspect'ler için farklı sentiment'ler
|
| 652 |
+
"""
|
| 653 |
+
def gen():
|
| 654 |
+
aspects = ['quality', 'price', 'service', 'delivery']
|
| 655 |
+
|
| 656 |
+
for i in range(num_samples):
|
| 657 |
+
aspect_sentiments = {
|
| 658 |
+
aspect: {
|
| 659 |
+
'sentiment': random.choice(['positive', 'negative', 'neutral']),
|
| 660 |
+
'score': np.random.uniform(0, 1)
|
| 661 |
+
}
|
| 662 |
+
for aspect in aspects
|
| 663 |
+
}
|
| 664 |
+
|
| 665 |
+
yield {
|
| 666 |
+
'id': f'aspect_sent_{i}',
|
| 667 |
+
'text': f"Review text {i} discussing various aspects.",
|
| 668 |
+
'aspect_sentiments': aspect_sentiments
|
| 669 |
+
}
|
| 670 |
+
|
| 671 |
+
return Dataset.from_generator(gen)
|
| 672 |
+
|
| 673 |
+
print("\n1️⃣ Sentiment Classification Dataset:")
|
| 674 |
+
sentiment_dataset = SentimentDatasetCreator.create_sentiment_dataset(300)
|
| 675 |
+
|
| 676 |
+
print(f"✅ Dataset: {len(sentiment_dataset)} reviews")
|
| 677 |
+
|
| 678 |
+
# Label distribution
|
| 679 |
+
labels = [ex['label_text'] for ex in sentiment_dataset]
|
| 680 |
+
label_dist = Counter(labels)
|
| 681 |
+
print(f"\n📊 Label distribution:")
|
| 682 |
+
for label, count in label_dist.items():
|
| 683 |
+
pct = count / len(sentiment_dataset) * 100
|
| 684 |
+
print(f" {label}: {count} ({pct:.1f}%)")
|
| 685 |
+
|
| 686 |
+
# Örnekler
|
| 687 |
+
print(f"\nÖrnekler:")
|
| 688 |
+
for label in ['positive', 'negative', 'neutral']:
|
| 689 |
+
example = [ex for ex in sentiment_dataset if ex['label_text'] == label][0]
|
| 690 |
+
print(f"\n {label.capitalize()}:")
|
| 691 |
+
print(f" Text: {example['text']}")
|
| 692 |
+
print(f" Confidence: {example['confidence']:.2f}")
|
| 693 |
+
|
| 694 |
+
|
| 695 |
+
print("\n2️⃣ Aspect-Based Sentiment:")
|
| 696 |
+
aspect_dataset = SentimentDatasetCreator.create_aspect_based_sentiment(50)
|
| 697 |
+
|
| 698 |
+
print(f"✅ Dataset: {len(aspect_dataset)} reviews")
|
| 699 |
+
print(f"\nÖrnek aspect-based analysis:")
|
| 700 |
+
sample = aspect_dataset[0]
|
| 701 |
+
print(f" Text: {sample['text']}")
|
| 702 |
+
print(f" Aspect sentiments:")
|
| 703 |
+
for aspect, sentiment_info in sample['aspect_sentiments'].items():
|
| 704 |
+
print(f" {aspect}: {sentiment_info['sentiment']} (score: {sentiment_info['score']:.2f})")
|
| 705 |
+
|
| 706 |
+
|
| 707 |
+
print("\n3️⃣ Sentiment Feature Engineering:")
|
| 708 |
+
|
| 709 |
+
class SentimentFeatureEngineer:
|
| 710 |
+
"""
|
| 711 |
+
Sentiment için feature engineering
|
| 712 |
+
"""
|
| 713 |
+
@staticmethod
|
| 714 |
+
def extract_sentiment_features(example):
|
| 715 |
+
"""
|
| 716 |
+
Sentiment-specific features
|
| 717 |
+
"""
|
| 718 |
+
text = example['text'].lower()
|
| 719 |
+
|
| 720 |
+
# Sentiment keywords (simplified)
|
| 721 |
+
positive_words = ['great', 'excellent', 'amazing', 'love', 'best', 'fantastic']
|
| 722 |
+
negative_words = ['terrible', 'worst', 'poor', 'bad', 'disappointed', 'waste']
|
| 723 |
+
|
| 724 |
+
pos_count = sum([1 for word in positive_words if word in text])
|
| 725 |
+
neg_count = sum([1 for word in negative_words if word in text])
|
| 726 |
+
|
| 727 |
+
# Punctuation features
|
| 728 |
+
exclamation_count = text.count('!')
|
| 729 |
+
question_count = text.count('?')
|
| 730 |
+
|
| 731 |
+
# Capitalization
|
| 732 |
+
upper_count = sum([1 for c in example['text'] if c.isupper()])
|
| 733 |
+
|
| 734 |
+
return {
|
| 735 |
+
**example,
|
| 736 |
+
'positive_word_count': pos_count,
|
| 737 |
+
'negative_word_count': neg_count,
|
| 738 |
+
'exclamation_count': exclamation_count,
|
| 739 |
+
'question_count': question_count,
|
| 740 |
+
'upper_case_count': upper_count,
|
| 741 |
+
'sentiment_score': pos_count - neg_count # Simple score
|
| 742 |
+
}
|
| 743 |
+
|
| 744 |
+
feature_engineer = SentimentFeatureEngineer()
|
| 745 |
+
sentiment_featured = sentiment_dataset.map(
|
| 746 |
+
feature_engineer.extract_sentiment_features,
|
| 747 |
+
desc="Extracting sentiment features"
|
| 748 |
+
)
|
| 749 |
+
|
| 750 |
+
print(f"\n Feature extraction completed")
|
| 751 |
+
print(f" New features: positive_word_count, negative_word_count, sentiment_score, etc.")
|
| 752 |
+
|
| 753 |
+
print(f"\n Feature correlation with labels:")
|
| 754 |
+
for label_text in ['positive', 'negative', 'neutral']:
|
| 755 |
+
subset = [ex for ex in sentiment_featured if ex['label_text'] == label_text]
|
| 756 |
+
avg_score = np.mean([ex['sentiment_score'] for ex in subset])
|
| 757 |
+
avg_pos = np.mean([ex['positive_word_count'] for ex in subset])
|
| 758 |
+
avg_neg = np.mean([ex['negative_word_count'] for ex in subset])
|
| 759 |
+
|
| 760 |
+
print(f"\n {label_text.capitalize()}:")
|
| 761 |
+
print(f" Avg sentiment score: {avg_score:.2f}")
|
| 762 |
+
print(f" Avg positive words: {avg_pos:.2f}")
|
| 763 |
+
print(f" Avg negative words: {avg_neg:.2f}")
|
| 764 |
+
|
| 765 |
+
|
| 766 |
+
print("\n" + "="*70)
|
| 767 |
+
print("5. TEXT CLASSIFICATION")
|
| 768 |
+
print("="*70)
|
| 769 |
+
|
| 770 |
+
print("\n📊 General Text Classification:")
|
| 771 |
+
|
| 772 |
+
class TextClassificationDataset:
|
| 773 |
+
"""
|
| 774 |
+
Multi-class text classification
|
| 775 |
+
"""
|
| 776 |
+
@staticmethod
|
| 777 |
+
def create_topic_classification(num_samples=200):
|
| 778 |
+
"""
|
| 779 |
+
Topic/Category classification
|
| 780 |
+
"""
|
| 781 |
+
topics = {
|
| 782 |
+
'sports': [
|
| 783 |
+
"The team won the championship with a final score of 3-1.",
|
| 784 |
+
"Athletes trained hard for the upcoming Olympic games.",
|
| 785 |
+
"The basketball match was exciting until the last minute."
|
| 786 |
+
],
|
| 787 |
+
'technology': [
|
| 788 |
+
"The new smartphone features advanced AI capabilities.",
|
| 789 |
+
"Software update improves system performance significantly.",
|
| 790 |
+
"Researchers developed a breakthrough algorithm for data processing."
|
| 791 |
+
],
|
| 792 |
+
'politics': [
|
| 793 |
+
"The parliament voted on the new legislation today.",
|
| 794 |
+
"Government announces policy changes affecting citizens.",
|
| 795 |
+
"Election results show close competition between candidates."
|
| 796 |
+
],
|
| 797 |
+
'entertainment': [
|
| 798 |
+
"The movie premiere attracted thousands of fans.",
|
| 799 |
+
"New album breaks streaming records in first week.",
|
| 800 |
+
"Award ceremony celebrates best performances of the year."
|
| 801 |
+
]
|
| 802 |
+
}
|
| 803 |
+
|
| 804 |
+
topic_to_id = {topic: i for i, topic in enumerate(topics.keys())}
|
| 805 |
+
|
| 806 |
+
def gen():
|
| 807 |
+
for i in range(num_samples):
|
| 808 |
+
topic = list(topics.keys())[i % len(topics)]
|
| 809 |
+
text = topics[topic][i % len(topics[topic])]
|
| 810 |
+
|
| 811 |
+
yield {
|
| 812 |
+
'id': f'topic_{i}',
|
| 813 |
+
'text': text,
|
| 814 |
+
'label': topic_to_id[topic],
|
| 815 |
+
'label_text': topic
|
| 816 |
+
}
|
| 817 |
+
|
| 818 |
+
return Dataset.from_generator(gen), topic_to_id
|
| 819 |
+
|
| 820 |
+
print("\n1️⃣ Topic Classification Dataset:")
|
| 821 |
+
topic_dataset, topic_to_id = TextClassificationDataset.create_topic_classification(200)
|
| 822 |
+
|
| 823 |
+
print(f"✅ Dataset: {len(topic_dataset)} documents")
|
| 824 |
+
print(f" Topics: {list(topic_to_id.keys())}")
|
| 825 |
+
|
| 826 |
+
# Topic distribution
|
| 827 |
+
topics = [ex['label_text'] for ex in topic_dataset]
|
| 828 |
+
topic_dist = Counter(topics)
|
| 829 |
+
print(f"\n📊 Topic distribution:")
|
| 830 |
+
for topic, count in topic_dist.items():
|
| 831 |
+
print(f" {topic}: {count}")
|
| 832 |
+
|
| 833 |
+
# Örnekler
|
| 834 |
+
print(f"\nÖrnekler:")
|
| 835 |
+
for topic in list(topic_to_id.keys())[:3]:
|
| 836 |
+
example = [ex for ex in topic_dataset if ex['label_text'] == topic][0]
|
| 837 |
+
print(f"\n {topic.capitalize()}:")
|
| 838 |
+
print(f" {example['text']}")
|
| 839 |
+
|
| 840 |
+
|
| 841 |
+
print("\n" + "="*70)
|
| 842 |
+
print("6. MULTI-TASK LEARNING DATASETS")
|
| 843 |
+
print("="*70)
|
| 844 |
+
|
| 845 |
+
print("\n🎯 Multi-Task Dataset Yapısı:")
|
| 846 |
+
|
| 847 |
+
class MultiTaskDatasetCreator:
|
| 848 |
+
"""
|
| 849 |
+
Birden fazla task için unified dataset
|
| 850 |
+
"""
|
| 851 |
+
@staticmethod
|
| 852 |
+
def create_multitask_dataset(num_samples=100):
|
| 853 |
+
"""
|
| 854 |
+
Aynı text için multiple task annotations
|
| 855 |
+
"""
|
| 856 |
+
def gen():
|
| 857 |
+
for i in range(num_samples):
|
| 858 |
+
text = f"Sample text {i} with multiple annotations for various tasks."
|
| 859 |
+
|
| 860 |
+
yield {
|
| 861 |
+
'id': f'multi_{i}',
|
| 862 |
+
'text': text,
|
| 863 |
+
|
| 864 |
+
# Task 1: Sentiment
|
| 865 |
+
'sentiment': random.choice(['positive', 'negative', 'neutral']),
|
| 866 |
+
'sentiment_score': np.random.random(),
|
| 867 |
+
|
| 868 |
+
# Task 2: Topic
|
| 869 |
+
'topic': random.choice(['sports', 'tech', 'politics']),
|
| 870 |
+
'topic_confidence': np.random.random(),
|
| 871 |
+
|
| 872 |
+
# Task 3: Language quality
|
| 873 |
+
'grammar_score': np.random.uniform(0.5, 1.0),
|
| 874 |
+
'readability_score': np.random.uniform(0.5, 1.0),
|
| 875 |
+
|
| 876 |
+
# Metadata
|
| 877 |
+
'text_length': len(text.split())
|
| 878 |
+
}
|
| 879 |
+
|
| 880 |
+
return Dataset.from_generator(gen)
|
| 881 |
+
|
| 882 |
+
print("\n1️⃣ Multi-Task Dataset:")
|
| 883 |
+
multitask_dataset = MultiTaskDatasetCreator.create_multitask_dataset(100)
|
| 884 |
+
|
| 885 |
+
print(f"✅ Dataset: {len(multitask_dataset)} examples")
|
| 886 |
+
print(f" Tasks: sentiment, topic, grammar, readability")
|
| 887 |
+
|
| 888 |
+
print(f"\nÖrnek multi-task annotation:")
|
| 889 |
+
sample = multitask_dataset[0]
|
| 890 |
+
print(f" Text: {sample['text']}")
|
| 891 |
+
print(f"\n Task Annotations:")
|
| 892 |
+
print(f" Sentiment: {sample['sentiment']} (score: {sample['sentiment_score']:.2f})")
|
| 893 |
+
print(f" Topic: {sample['topic']} (confidence: {sample['topic_confidence']:.2f})")
|
| 894 |
+
print(f" Grammar score: {sample['grammar_score']:.2f}")
|
| 895 |
+
print(f" Readability: {sample['readability_score']:.2f}")
|
| 896 |
+
|
| 897 |
+
|
| 898 |
+
print("\n2️⃣ Task-Specific Data Loaders:")
|
| 899 |
+
|
| 900 |
+
class MultiTaskLoader:
|
| 901 |
+
"""
|
| 902 |
+
Multi-task dataset'i task-specific olarak yükle
|
| 903 |
+
"""
|
| 904 |
+
def __init__(self, dataset):
|
| 905 |
+
self.dataset = dataset
|
| 906 |
+
|
| 907 |
+
def get_task_dataset(self, task_name, task_columns):
|
| 908 |
+
"""
|
| 909 |
+
Belirli bir task için dataset al
|
| 910 |
+
"""
|
| 911 |
+
def extract_task_data(example):
|
| 912 |
+
result = {
|
| 913 |
+
'text': example['text'],
|
| 914 |
+
'id': example['id']
|
| 915 |
+
}
|
| 916 |
+
for col in task_columns:
|
| 917 |
+
result[col] = example[col]
|
| 918 |
+
return result
|
| 919 |
+
|
| 920 |
+
return self.dataset.map(
|
| 921 |
+
extract_task_data,
|
| 922 |
+
remove_columns=[c for c in self.dataset.column_names
|
| 923 |
+
if c not in ['text', 'id'] + task_columns],
|
| 924 |
+
desc=f"Loading {task_name} task"
|
| 925 |
+
)
|
| 926 |
+
|
| 927 |
+
loader = MultiTaskLoader(multitask_dataset)
|
| 928 |
+
|
| 929 |
+
# Task-specific datasets
|
| 930 |
+
print("\n Creating task-specific datasets:")
|
| 931 |
+
|
| 932 |
+
sentiment_task = loader.get_task_dataset(
|
| 933 |
+
'sentiment',
|
| 934 |
+
['sentiment', 'sentiment_score']
|
| 935 |
+
)
|
| 936 |
+
print(f" Sentiment task: {len(sentiment_task)} examples, columns: {sentiment_task.column_names}")
|
| 937 |
+
|
| 938 |
+
topic_task = loader.get_task_dataset(
|
| 939 |
+
'topic',
|
| 940 |
+
['topic', 'topic_confidence']
|
| 941 |
+
)
|
| 942 |
+
print(f" Topic task: {len(topic_task)} examples, columns: {topic_task.column_names}")
|
| 943 |
+
|
| 944 |
+
|
| 945 |
+
print("\n" + "="*70)
|
| 946 |
+
print("📚 BEST PRACTICES - TASK-SPECIFIC DATASETS")
|
| 947 |
+
print("="*70)
|
| 948 |
+
|
| 949 |
+
print("""
|
| 950 |
+
✅ QUESTION ANSWERING:
|
| 951 |
+
- SQuAD format: context, question, answer, answer_start
|
| 952 |
+
- Validate answer spans
|
| 953 |
+
- Handle impossible questions
|
| 954 |
+
- Question type classification
|
| 955 |
+
- Context length management
|
| 956 |
+
|
| 957 |
+
✅ SUMMARIZATION:
|
| 958 |
+
- Multiple reference summaries
|
| 959 |
+
- Compression ratio tracking
|
| 960 |
+
- ROUGE scores for validation
|
| 961 |
+
- Abstractive vs Extractive
|
| 962 |
+
- Length constraints
|
| 963 |
+
|
| 964 |
+
✅ NAMED ENTITY RECOGNITION:
|
| 965 |
+
- BIO/BIOES tagging scheme
|
| 966 |
+
- Entity type taxonomy
|
| 967 |
+
- Nested entities handling
|
| 968 |
+
- Cross-sentence entities
|
| 969 |
+
- Entity linking (optional)
|
| 970 |
+
|
| 971 |
+
✅ SENTIMENT ANALYSIS:
|
| 972 |
+
- Multi-level granularity (binary/3-class/5-class)
|
| 973 |
+
- Aspect-based sentiment
|
| 974 |
+
- Confidence scores
|
| 975 |
+
- Domain-specific lexicons
|
| 976 |
+
- Emotion detection
|
| 977 |
+
|
| 978 |
+
✅ TEXT CLASSIFICATION:
|
| 979 |
+
- Balanced classes
|
| 980 |
+
- Hierarchical categories
|
| 981 |
+
- Multi-label support
|
| 982 |
+
- Confidence calibration
|
| 983 |
+
- Class imbalance handling
|
| 984 |
+
|
| 985 |
+
✅ MULTI-TASK LEARNING:
|
| 986 |
+
- Consistent text preprocessing
|
| 987 |
+
- Task-specific heads
|
| 988 |
+
- Shared representations
|
| 989 |
+
- Task weighting strategies
|
| 990 |
+
- Auxiliary tasks
|
| 991 |
+
|
| 992 |
+
🎯 GENERAL PRINCIPLES:
|
| 993 |
+
- Clear annotation guidelines
|
| 994 |
+
- Inter-annotator agreement
|
| 995 |
+
- Quality control checks
|
| 996 |
+
- Regular dataset updates
|
| 997 |
+
- Version control
|
| 998 |
+
- Documentation
|
| 999 |
+
""")
|
| 1000 |
+
|
| 1001 |
+
|
| 1002 |
+
print("\n" + "="*70)
|
| 1003 |
+
print("✅ BÖLÜM 4 TAMAMLANDI!")
|
| 1004 |
+
print("="*70)
|
| 1005 |
+
|
| 1006 |
+
print(f"""
|
| 1007 |
+
Bu bölümde öğrendikleriniz:
|
| 1008 |
+
✓ Question Answering datasets (Extractive & Multiple Choice)
|
| 1009 |
+
✓ Summarization datasets (News & Abstractive)
|
| 1010 |
+
✓ Named Entity Recognition (BIO tagging)
|
| 1011 |
+
✓ Sentiment Analysis (Binary, Multi-class, Aspect-based)
|
| 1012 |
+
✓ Text Classification (Topic classification)
|
| 1013 |
+
✓ Multi-Task Learning datasets
|
| 1014 |
+
|
| 1015 |
+
📊 ÜRETİLEN DATASETS:
|
| 1016 |
+
- QA: 200 extractive + 100 multiple choice
|
| 1017 |
+
- Summarization: 100 news articles
|
| 1018 |
+
- NER: 100 annotated sentences
|
| 1019 |
+
- Sentiment: 300 reviews + 50 aspect-based
|
| 1020 |
+
- Topic: 200 documents
|
| 1021 |
+
- Multi-task: 100 multi-annotated examples
|
| 1022 |
+
|
| 1023 |
+
🎯 KEY LEARNINGS:
|
| 1024 |
+
- Her task farklı data format gerektirir
|
| 1025 |
+
- Quality metrics task-specific
|
| 1026 |
+
- Preprocessing task'a göre özelleştirilmeli
|
| 1027 |
+
- Multi-task learning verimli öğrenme sağlar
|
| 1028 |
+
- Annotation quality critical
|
| 1029 |
+
|
| 1030 |
+
📚 SERİ TAMAMLANDI!
|
| 1031 |
+
Tüm modüller başarıyla tamamlandı:
|
| 1032 |
+
✅ Bölüm 1: Büyük Ölçekli Datasets
|
| 1033 |
+
✅ Bölüm 2: Domain-Specific Datasets
|
| 1034 |
+
✅ Bölüm 3: İleri Teknikler
|
| 1035 |
+
✅ Bölüm 4: Özel Görevler İçin Datasets
|
| 1036 |
+
""")
|
| 1037 |
+
|
| 1038 |
+
print("\n🎉 Tebrikler! Tüm modülleri tamamladınız!")
|
| 1039 |
+
print("Şimdi bu bilgileri kendi projelerinizde kullanabilirsiniz! 🚀")
|