MEHMET TUĞRUL KAYA commited on
Commit
2e6a47d
·
1 Parent(s): e600950

Initial commit: Advanced Dataset Tutorial

Browse files
DEPLOYMENT.md ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 Hugging Face'e Yükleme Talimatları
2
+
3
+ Bu dosya, projeyi Hugging Face'e nasıl yükleyeceğinizi açıklar.
4
+
5
+ ## 📋 Ön Hazırlık
6
+
7
+ 1. **Hugging Face hesabı** oluşturun: https://huggingface.co/join
8
+ 2. **Access token** alın: https://huggingface.co/settings/tokens
9
+ 3. **Git LFS** kurun (büyük dosyalar için):
10
+ ```bash
11
+ git lfs install
12
+ ```
13
+
14
+ ## 🌐 Space Olarak Yükleme
15
+
16
+ ### 1. Yeni Space Oluştur
17
+
18
+ Hugging Face'te: https://huggingface.co/new-space
19
+
20
+ - **Space name**: `advanced-dataset-tutorial`
21
+ - **License**: MIT
22
+ - **SDK**: Gradio
23
+ - **Hardware**: CPU (basic)
24
+
25
+ ### 2. Repository'yi Clone Et
26
+
27
+ ```bash
28
+ git clone https://huggingface.co/spaces/YOUR-USERNAME/advanced-dataset-tutorial
29
+ cd advanced-dataset-tutorial
30
+ ```
31
+
32
+ ### 3. Dosyaları Kopyala
33
+
34
+ ```bash
35
+ # Proje dosyalarını kopyala
36
+ cp -r /path/to/advanced-dataset-tutorial/* .
37
+
38
+ # Yapı:
39
+ # .
40
+ # ├── README.md
41
+ # ├── requirements.txt
42
+ # ├── LICENSE
43
+ # ├── .gitignore
44
+ # ├── datasets/
45
+ # └── space/
46
+ # ├── app.py
47
+ # └── modules/
48
+ ```
49
+
50
+ ### 4. Push Et
51
+
52
+ ```bash
53
+ git add .
54
+ git commit -m "Initial commit: Advanced Dataset Tutorial"
55
+ git push
56
+ ```
57
+
58
+ ### 5. Space Otomatik Deploy Olacak! 🎉
59
+
60
+ Birkaç dakika içinde: `https://huggingface.co/spaces/YOUR-USERNAME/advanced-dataset-tutorial`
61
+
62
+ ## 📊 Dataset Olarak Yükleme (Opsiyonel)
63
+
64
+ ### 1. Dataset Repository Oluştur
65
+
66
+ ```bash
67
+ # Yeni dataset repository
68
+ huggingface-cli repo create advanced-dataset-tutorial --type dataset
69
+
70
+ # Clone
71
+ git clone https://huggingface.co/datasets/YOUR-USERNAME/advanced-dataset-tutorial
72
+ cd advanced-dataset-tutorial
73
+ ```
74
+
75
+ ### 2. Dataset Dosyalarını Hazırla
76
+
77
+ ```python
78
+ # create_datasets.py
79
+ from datasets import Dataset, DatasetDict
80
+
81
+ # Örnek dataset'leri oluştur
82
+ datasets = DatasetDict({
83
+ 'large_scale_examples': ...,
84
+ 'domain_specific_examples': ...,
85
+ 'advanced_techniques_examples': ...,
86
+ 'task_specific_examples': ...
87
+ })
88
+
89
+ # Kaydet
90
+ datasets.save_to_disk('dataset')
91
+ ```
92
+
93
+ ### 3. Push Dataset
94
+
95
+ ```bash
96
+ git add .
97
+ git commit -m "Add dataset examples"
98
+ git push
99
+ ```
100
+
101
+ ## 🔗 GitHub Integration (Opsiyonel)
102
+
103
+ ### 1. GitHub Repository Oluştur
104
+
105
+ ```bash
106
+ # GitHub'da repo oluştur, sonra:
107
+ git remote add github https://github.com/YOUR-USERNAME/advanced-dataset-tutorial
108
+ git push github main
109
+ ```
110
+
111
+ ### 2. Hugging Face ile Sync Et
112
+
113
+ Hugging Face Space settings'de:
114
+ - GitHub repository'yi bağla
115
+ - Auto-sync etkinleştir
116
+
117
+ ## 📝 Yükleme Sonrası Checklist
118
+
119
+ - [ ] Space çalışıyor mu? Test et
120
+ - [ ] README düzgün görünüyor mu?
121
+ - [ ] Gradio demo açılıyor mu?
122
+ - [ ] Tüm modüller yüklendi mi?
123
+ - [ ] License doğru mu?
124
+ - [ ] Tags eklendi mi?
125
+
126
+ ## 🎨 Customization
127
+
128
+ ### Space Settings
129
+
130
+ Settings'den düzenle:
131
+ - **Title**: Advanced Dataset Tutorial
132
+ - **Emoji**: 📚
133
+ - **Theme**: Soft (veya istediğiniz)
134
+ - **Hardware**: CPU Basic (ücretsiz)
135
+
136
+ ### README Metadata
137
+
138
+ README.md başındaki metadata'yı güncelleyin:
139
+ ```yaml
140
+ ---
141
+ title: Advanced Dataset Tutorial
142
+ emoji: 📚
143
+ colorFrom: blue
144
+ colorTo: purple
145
+ sdk: gradio
146
+ sdk_version: 4.44.0
147
+ app_file: space/app.py
148
+ ---
149
+ ```
150
+
151
+ ## 🐛 Sorun Giderme
152
+
153
+ ### Space Build Hatası
154
+
155
+ 1. Logs'u kontrol et
156
+ 2. `requirements.txt` doğru mu?
157
+ 3. `app.py` path'i doğru mu?
158
+
159
+ ### Import Hatası
160
+
161
+ ```python
162
+ # app.py'de path ekle
163
+ import sys
164
+ from pathlib import Path
165
+ sys.path.append(str(Path(__file__).parent / "modules"))
166
+ ```
167
+
168
+ ### Network Hatası
169
+
170
+ Hugging Face'te bazı URL'ler block'lanabilir. Lokal dataset'ler kullanın.
171
+
172
+ ## 📚 Kaynaklar
173
+
174
+ - [Hugging Face Spaces Docs](https://huggingface.co/docs/hub/spaces)
175
+ - [Gradio Docs](https://gradio.app/docs/)
176
+ - [Git LFS](https://git-lfs.github.com/)
177
+
178
+ ## ✅ Başarı!
179
+
180
+ Space'iniz hazır! Şimdi:
181
+
182
+ 1. 🌐 **Demo'yu paylaş**: Space URL'ini arkadaşlarınla paylaş
183
+ 2. ⭐ **Community**: Discussion'lar aç, feedback al
184
+ 3. 🔄 **Güncelle**: Düzenli olarak yeni örnekler ekle
185
+ 4. 📊 **İstatistik**: Space usage'i takip et
186
+
187
+ ---
188
+
189
+ **İyi şanslar! 🚀**
190
+
191
+ Sorularınız için: [@yourusername](https://huggingface.co/YOUR-USERNAME)
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Advanced Dataset Tutorial
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,12 +1,293 @@
1
  ---
2
- title: Advanced Dataset Tutorial
3
- emoji: 🐠
4
- colorFrom: pink
5
- colorTo: gray
6
  sdk: gradio
7
- sdk_version: 5.49.1
8
- app_file: app.py
9
  pinned: false
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Advanced Dataset Tutorial - Hugging Face Datasets İleri Seviye
3
+ emoji: 📚
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: space/app.py
9
  pinned: false
10
+ license: mit
11
+ tags:
12
+ - datasets
13
+ - tutorial
14
+ - nlp
15
+ - machine-learning
16
+ - data-processing
17
+ - Turkish
18
  ---
19
 
20
+ # 📚 Advanced Dataset Tutorial - Hugging Face Datasets İleri Seviye
21
+
22
+ Hugging Face Datasets kütüphanesi ile ileri seviye veri işleme teknikleri için kapsamlı Türkçe eğitim materyali.
23
+
24
+ ## 🎯 Proje Hakkında
25
+
26
+ Bu proje, Hugging Face Datasets kütüphanesini profesyonel düzeyde kullanmak isteyenler için hazırlanmış kapsamlı bir eğitim serisidir. 4 ana modül ve 20+ pratik örnek içerir.
27
+
28
+ ## 📖 Modüller
29
+
30
+ ### 1️⃣ Büyük Ölçekli Datasets
31
+ - **Streaming ile büyük veri işleme** (750GB+ datasets)
32
+ - **Memory-efficient preprocessing**
33
+ - **Batch processing optimizasyonu** (2.3x hızlandırma)
34
+ - **Multi-process parallelization** (64x hızlandırma)
35
+ - **Cache yönetimi** (12.1x hızlandırma)
36
+ - **Dataset sharding ve distributed training**
37
+
38
+ **Performans Kazanımları:**
39
+ - ⚡ Batch processing: 2.3x daha hızlı
40
+ - 💾 Cache kullanımı: 12.1x daha hızlı
41
+ - 🚀 Multi-processing: 64x daha hızlı
42
+ - 📦 Generator pattern: Minimal RAM kullanımı
43
+
44
+ ### 2️⃣ Domain-Specific Datasets
45
+ - **Bilimsel makaleler** (arXiv, PubMed style)
46
+ - **Kod datasets** (6 programlama dili)
47
+ - **Finansal analiz** (sentiment + market data)
48
+ - **Tıbbi/sağlık** (PHI anonymization)
49
+ - **Cross-domain integration** (3 çözüm yöntemi)
50
+
51
+ **Üretilen Datasets:**
52
+ - 🔬 2,000 bilimsel makale
53
+ - 💻 2,000 kod örneği
54
+ - 💰 2,000 finansal kayıt
55
+ - 🏥 2,000 tıbbi kayıt
56
+
57
+ ### 3️⃣ İleri Teknikler
58
+ - **Custom Data Collators** (3 farklı tip)
59
+ - **Advanced Feature Extraction** (10+ feature)
60
+ - **Preprocessing Pipelines** (modular & reusable)
61
+ - **Data Augmentation** (3x veri artışı)
62
+ - **Stratified Sampling** (balanced splits)
63
+ - **Dynamic Batching** (40% padding azalması)
64
+ - **Active Learning integration**
65
+
66
+ **Teknikler:**
67
+ - 📦 Simple, Padding, Advanced Collators
68
+ - 🔧 Feature Engineering Pipeline
69
+ - 🎲 Smart Data Augmentation
70
+ - 📊 Diversity & Uncertainty Sampling
71
+
72
+ ### 4️⃣ Özel Görevler İçin Datasets
73
+ - **Question Answering** (SQuAD-style)
74
+ - **Summarization** (CNN/DailyMail)
75
+ - **Named Entity Recognition** (BIO tagging)
76
+ - **Sentiment Analysis** (aspect-based)
77
+ - **Text Classification** (multi-class)
78
+ - **Multi-Task Learning**
79
+
80
+ **Task-Specific Datasets:**
81
+ - ❓ 200 QA pairs + 100 multiple choice
82
+ - 📝 100 summarization pairs
83
+ - 🏷️ 100 NER annotated sentences
84
+ - 😊 300 sentiment reviews
85
+ - 📊 200 topic classification
86
+
87
+ ## 🚀 Hızlı Başlangıç
88
+
89
+ ### Online Demo (Gradio)
90
+ ```bash
91
+ # Space'i çalıştır
92
+ python space/app.py
93
+ ```
94
+
95
+ ### Manuel Kullanım
96
+ ```python
97
+ from datasets import load_dataset
98
+
99
+ # Örnek: Büyük dataset streaming
100
+ dataset = load_dataset("tugrulkaya/advanced-dataset-tutorial")
101
+ ```
102
+
103
+ ## 💻 Kurulum
104
+
105
+ ```bash
106
+ # Gerekli kütüphaneler
107
+ pip install datasets transformers numpy pandas
108
+
109
+ # Opsiyonel
110
+ pip install gradio # İnteraktif demo için
111
+ ```
112
+
113
+ ## 📂 Proje Yapısı
114
+
115
+ ```
116
+ advanced-dataset-tutorial/
117
+ ├── 📊 datasets/ # Örnek dataset'ler
118
+ │ ├── large_scale_example/ # Büyük ölçekli örnekler
119
+ │ ├── domain_specific_example/ # Domain-specific örnekler
120
+ │ ├── advanced_techniques_example/ # İleri teknik örnekleri
121
+ │ └── task_specific_example/ # Task-specific örnekler
122
+
123
+ ├── 🌐 space/ # Gradio Space
124
+ │ ├── app.py # Ana uygulama
125
+ │ ├── modules/ # Tüm modül scriptleri
126
+ │ │ ├── 01_buyuk_olcekli_datasets_complete.py
127
+ │ │ ├── 02_domain_specific_datasets.py
128
+ │ │ ├── 02b_cross_domain_fix.py
129
+ │ │ ├── 03_ileri_teknikler_part1.py
130
+ │ │ ├── 03_ileri_teknikler_part2.py
131
+ │ │ └── 04_ozel_gorevler.py
132
+ │ └── README.md
133
+
134
+ └── README.md # Bu dosya
135
+ ```
136
+
137
+ ## 🎓 Öğrenme Yolu
138
+
139
+ ### Başlangıç Seviyesi
140
+ 1. ✅ Bölüm 1: Büyük Ölçekli Datasets
141
+ - Streaming basics
142
+ - Batch processing
143
+ - Memory management
144
+
145
+ ### Orta Seviye
146
+ 2. ✅ Bölüm 2: Domain-Specific Datasets
147
+ - Scientific data
148
+ - Code datasets
149
+ - Cross-domain integration
150
+
151
+ ### İleri Seviye
152
+ 3. ✅ Bölüm 3: İleri Teknikler
153
+ - Custom collators
154
+ - Pipeline patterns
155
+ - Advanced sampling
156
+
157
+ ### Uzman Seviye
158
+ 4. ✅ Bölüm 4: Özel Görevler
159
+ - Task-specific preprocessing
160
+ - Quality metrics
161
+ - Multi-task learning
162
+
163
+ ## 📊 Performans Metrikleri
164
+
165
+ | Teknik | Performans Artışı | Kullanım Senaryosu |
166
+ |--------|-------------------|-------------------|
167
+ | Batch Processing | 2.3x daha hızlı | Tüm preprocessing |
168
+ | Cache Kullanımı | 12.1x daha hızlı | Tekrarlanan işlemler |
169
+ | Multi-Processing | 64x daha hızlı | CPU-intensive tasks |
170
+ | Dynamic Batching | 40% padding azalması | Training efficiency |
171
+ | Data Augmentation | 3x veri artışı | Class imbalance |
172
+
173
+ ## 🔧 Best Practices
174
+
175
+ ### Memory Efficiency
176
+ ```python
177
+ # ✅ DOĞRU: Streaming ile büyük veri
178
+ dataset = load_dataset("huge_dataset", streaming=True)
179
+
180
+ # ❌ YANLIŞ: Tüm veriyi RAM'e yükleme
181
+ dataset = load_dataset("huge_dataset") # 100GB RAM!
182
+ ```
183
+
184
+ ### Batch Processing
185
+ ```python
186
+ # ✅ DOĞRU: Batched operations
187
+ dataset.map(process_fn, batched=True, batch_size=1000)
188
+
189
+ # ❌ YANLIŞ: Tek tek işleme
190
+ dataset.map(process_fn, batched=False) # 10x-100x yavaş!
191
+ ```
192
+
193
+ ### Cross-Domain Integration
194
+ ```python
195
+ # ✅ DOĞRU: Ortak schema'ya normalize et
196
+ def normalize(example, domain):
197
+ return {
198
+ 'text': example.get('text') or example.get('content'),
199
+ 'domain': domain,
200
+ 'metadata': json.dumps(example.get('meta', {}))
201
+ }
202
+
203
+ # ❌ YANLIŞ: Farklı schema'ları direkt birleştirme
204
+ combined = concatenate_datasets([ds1, ds2]) # ArrowTypeError!
205
+ ```
206
+
207
+ ## 🎯 Kullanım Örnekleri
208
+
209
+ ### 1. Büyük Dataset İşleme
210
+ ```python
211
+ from datasets import load_dataset
212
+
213
+ # Streaming mode
214
+ dataset = load_dataset("c4", "en", split="train", streaming=True)
215
+
216
+ # İlk 1000 örneği işle
217
+ for i, example in enumerate(dataset.take(1000)):
218
+ process(example)
219
+ ```
220
+
221
+ ### 2. Custom Collator
222
+ ```python
223
+ class CustomCollator:
224
+ def __call__(self, batch):
225
+ texts = [ex['text'] for ex in batch]
226
+ labels = [ex['label'] for ex in batch]
227
+ return {'texts': texts, 'labels': labels}
228
+
229
+ # DataLoader ile kullan
230
+ collator = CustomCollator()
231
+ dataloader = DataLoader(dataset, collate_fn=collator)
232
+ ```
233
+
234
+ ### 3. Data Augmentation
235
+ ```python
236
+ def augment(example):
237
+ # Word deletion
238
+ words = example['text'].split()
239
+ augmented = ' '.join(random.sample(words, k=len(words)-2))
240
+ return {'text': augmented, 'label': example['label']}
241
+
242
+ augmented_dataset = dataset.map(augment)
243
+ ```
244
+
245
+ ## 📈 İstatistikler
246
+
247
+ - **Toplam Kod Satırı**: 5,000+
248
+ - **Örnek Sayısı**: 20,000+
249
+ - **Teknik Sayısı**: 50+
250
+ - **Best Practices**: 100+
251
+
252
+ ## 🤝 Katkıda Bulunma
253
+
254
+ Bu proje açık kaynaklıdır ve katkılara açıktır!
255
+
256
+ 1. Fork edin
257
+ 2. Feature branch oluşturun (`git checkout -b feature/amazing`)
258
+ 3. Commit edin (`git commit -m 'Add amazing feature'`)
259
+ 4. Push edin (`git push origin feature/amazing`)
260
+ 5. Pull Request açın
261
+
262
+ ## 📝 Lisans
263
+
264
+ MIT License - detaylar için [LICENSE](LICENSE) dosyasına bakın.
265
+
266
+ ## 👨‍💻 Yazar
267
+
268
+ Bu eğitim materyali, Hugging Face Datasets kullanıcıları için pratik ve uygulanabilir bilgi sağlamak amacıyla hazırlanmıştır.
269
+
270
+ ## 🙏 Teşekkürler
271
+
272
+ - Hugging Face ekibine harika `datasets` kütüphanesi için
273
+ - Açık kaynak topluluğuna sürekli katkıları için
274
+
275
+ ## 📚 Kaynaklar
276
+
277
+ - [Hugging Face Datasets Documentation](https://huggingface.co/docs/datasets)
278
+ - [Hugging Face Hub](https://huggingface.co/datasets)
279
+ - [Apache Arrow](https://arrow.apache.org/)
280
+
281
+ ## 🔗 Bağlantılar
282
+
283
+ - 🌐 [Hugging Face Space](https://huggingface.co/spaces/tugrulkaya/advanced-dataset-tutorial)
284
+ - 📊 [Datasets](https://huggingface.co/datasets/tugrulkaya/advanced-dataset-tutorial)
285
+ - 💬 [Discussions](https://huggingface.co/spaces/tugrulkaya/advanced-dataset-tutorial/discussions)
286
+
287
+ ---
288
+
289
+ **⭐ Beğendiyseniz yıldız vermeyi unutmayın!**
290
+
291
+ **🔄 Güncellemeler için takip edin!**
292
+
293
+ **💬 Sorularınız için Discussion açın!**
datasets/advanced_techniques_example/README.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # İleri Teknikler Örnekleri
2
+
3
+ Bu klasör, advanced dataset processing teknikleri içerir.
4
+
5
+ ## Teknikler
6
+
7
+ ### 📦 Custom Data Collators
8
+
9
+ #### 1. Simple Collator
10
+ ```python
11
+ class SimpleCollator:
12
+ def __call__(self, batch):
13
+ texts = [ex['text'] for ex in batch]
14
+ labels = [ex['label'] for ex in batch]
15
+ return {'texts': texts, 'labels': labels}
16
+ ```
17
+
18
+ #### 2. Padding Collator
19
+ ```python
20
+ class PaddingCollator:
21
+ def __call__(self, batch):
22
+ # Dynamic padding
23
+ max_len = max(len(ex['text']) for ex in batch)
24
+ # Pad to max_len...
25
+ ```
26
+
27
+ #### 3. Advanced Collator
28
+ ```python
29
+ class AdvancedCollator:
30
+ def __call__(self, batch):
31
+ # Padding + normalization + stats
32
+ return {
33
+ 'input_ids': padded,
34
+ 'attention_mask': masks,
35
+ 'labels': labels,
36
+ 'batch_stats': {...}
37
+ }
38
+ ```
39
+
40
+ ### 🔧 Feature Engineering
41
+ - 10+ feature extraction
42
+ - Normalization (min-max, z-score)
43
+ - Interaction features
44
+ - Domain-specific features
45
+
46
+ ### 🎲 Data Augmentation
47
+ - Word deletion (random)
48
+ - Word swap
49
+ - Synonym replacement
50
+ - Class balancing (3x veri artışı)
51
+
52
+ ### 📊 Advanced Sampling
53
+
54
+ #### Stratified Sampling
55
+ ```python
56
+ # Balanced train/test splits
57
+ train, test = stratified_split(
58
+ dataset,
59
+ stratify_column='label',
60
+ train_ratio=0.8
61
+ )
62
+ ```
63
+
64
+ #### Diversity Sampling
65
+ ```python
66
+ # Maximum diversity
67
+ diverse = max_diversity_sampling(
68
+ dataset,
69
+ n_samples=100,
70
+ feature_columns=['length', 'score']
71
+ )
72
+ ```
73
+
74
+ #### Active Learning
75
+ ```python
76
+ # Uncertainty-based
77
+ uncertain = uncertainty_sampling(
78
+ dataset,
79
+ uncertainty_scores,
80
+ n_samples=100
81
+ )
82
+ ```
83
+
84
+ ### 📦 Dynamic Batching
85
+
86
+ #### Length-Based
87
+ ```python
88
+ # Benzer uzunlukları grupla
89
+ batches = length_based_batching(
90
+ dataset,
91
+ length_column='length'
92
+ )
93
+ # Result: 40% padding azalması
94
+ ```
95
+
96
+ #### Bucket Batching
97
+ ```python
98
+ # Bucket'lara ayır
99
+ batches = bucket_batching(
100
+ dataset,
101
+ n_buckets=5
102
+ )
103
+ ```
104
+
105
+ ## Pipeline Pattern
106
+
107
+ ```python
108
+ pipeline = DataPipeline("My Pipeline")
109
+ pipeline.add_step("clean", clean_fn)
110
+ pipeline.add_step("features", extract_features)
111
+ pipeline.add_step("normalize", normalize_fn)
112
+
113
+ result = pipeline.run(dataset)
114
+ ```
115
+
116
+ ## Performans
117
+
118
+ | Teknik | Artış | Use Case |
119
+ |--------|-------|----------|
120
+ | Batch Processing | 2.3x | Tüm işlemler |
121
+ | Dynamic Batching | 40% | Padding azalması |
122
+ | Data Augmentation | 3x | Veri artışı |
123
+ | Stratified Sampling | - | Balanced splits |
124
+
125
+ ## Best Practices
126
+
127
+ ✅ Collator'ı modele göre özelleştir
128
+ ✅ Pipeline pattern kullan
129
+ ✅ Augmentation ile balance et
130
+ ✅ Stratified sampling ile generalize et
131
+ ✅ Dynamic batching ile optimize et
datasets/domain_specific_example/README.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Domain-Specific Datasets Örnekleri
2
+
3
+ Bu klasör, farklı domain'ler için özelleştirilmiş dataset örnekleri içerir.
4
+
5
+ ## Domain'ler
6
+
7
+ ### 🔬 Bilimsel Makaleler
8
+ - arXiv, PubMed style
9
+ - 2,000 örnek
10
+ - Citation tracking
11
+ - Abstract + full text
12
+
13
+ ### 💻 Kod Datasets
14
+ - 6 programlama dili
15
+ - 2,000 kod örneği
16
+ - Syntax parsing
17
+ - Docstring extraction
18
+
19
+ ### 💰 Finansal Veri
20
+ - Sentiment analysis
21
+ - Market data
22
+ - 2,000 kayıt
23
+ - Time series
24
+
25
+ ### 🏥 Tıbbi Veri
26
+ - PHI anonymization
27
+ - HIPAA compliance
28
+ - 2,000 kayıt
29
+ - Clinical notes
30
+
31
+ ## Cross-Domain Integration
32
+
33
+ ### Problem: Schema Mismatch
34
+ ```python
35
+ # ❌ Bu HATA verir
36
+ combined = concatenate_datasets([sci_ds, code_ds])
37
+ # ArrowTypeError: struct fields don't match
38
+ ```
39
+
40
+ ### Çözüm 1: Flatten Approach
41
+ ```python
42
+ # ✅ Ortak schema
43
+ def normalize(ex, domain):
44
+ return {
45
+ 'text': ex.get('text'),
46
+ 'domain': domain,
47
+ 'field1': ex.get('field1'),
48
+ 'field2': ex.get('field2'),
49
+ # ... tüm field'lar
50
+ }
51
+ ```
52
+
53
+ ### Çözüm 2: JSON Metadata
54
+ ```python
55
+ # ✅ Esnek yapı
56
+ def normalize(ex, domain):
57
+ return {
58
+ 'text': ex.get('text'),
59
+ 'domain': domain,
60
+ 'metadata_json': json.dumps(ex.get('meta', {}))
61
+ }
62
+ ```
63
+
64
+ ### Çözüm 3: Separate Tables
65
+ ```python
66
+ # ✅ Database-style
67
+ unified_table + metadata_tables
68
+ ```
69
+
70
+ ## Best Practices
71
+
72
+ ✅ Domain expertise kullan
73
+ ✅ Specialized tokenization
74
+ ✅ Quality filtering
75
+ ✅ Ethical guidelines
76
+ ✅ Schema normalization
datasets/large_scale_example/README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Büyük Ölçekli Datasets Örnekleri
2
+
3
+ Bu klasör, büyük ölçekli dataset işleme teknikleri için örnek kodlar içerir.
4
+
5
+ ## Teknikler
6
+
7
+ ### 1. Streaming
8
+ - 750GB+ veri işleme
9
+ - RAM kullanımı minimal
10
+ - Generator pattern
11
+
12
+ ### 2. Batch Processing
13
+ - 2.3x hızlandırma
14
+ - Vectorized operations
15
+ - Optimal batch size: 32-1000
16
+
17
+ ### 3. Multi-Processing
18
+ - 64x hızlandırma
19
+ - CPU parallelization
20
+ - num_proc optimization
21
+
22
+ ### 4. Cache Yönetimi
23
+ - 12.1x hızlandırma
24
+ - Disk caching
25
+ - Arrow format
26
+
27
+ ## Kullanım
28
+
29
+ ```python
30
+ # Streaming örneği
31
+ from datasets import load_dataset
32
+
33
+ dataset = load_dataset(
34
+ "c4",
35
+ "en",
36
+ split="train",
37
+ streaming=True
38
+ )
39
+
40
+ for example in dataset.take(1000):
41
+ process(example)
42
+ ```
43
+
44
+ ## Performans Metrikleri
45
+
46
+ - Batch processing: **2.3x** daha hızlı
47
+ - Cache: **12.1x** daha hızlı
48
+ - Multi-processing: **64x** daha hızlı
49
+
50
+ ## Best Practices
51
+
52
+ ✅ Her zaman `batched=True` kullan
53
+ ✅ Optimal batch_size seç (32-1000)
54
+ ✅ `num_proc` ile paralelize et
55
+ ✅ Cache stratejisi belirle
56
+ ✅ Streaming ile büyük veri işle
datasets/task_specific_example/README.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Özel Görevler İçin Datasets
2
+
3
+ Bu klasör, specific NLP task'leri için dataset örnekleri içerir.
4
+
5
+ ## Task'ler
6
+
7
+ ### ❓ Question Answering
8
+
9
+ #### Extractive QA (SQuAD-style)
10
+ ```python
11
+ {
12
+ 'context': 'Paris is the capital of France...',
13
+ 'question': 'What is the capital of France?',
14
+ 'answers': {
15
+ 'text': ['Paris'],
16
+ 'answer_start': [0]
17
+ }
18
+ }
19
+ ```
20
+
21
+ #### Multiple Choice QA
22
+ ```python
23
+ {
24
+ 'question': 'What is 2+2?',
25
+ 'choices': ['3', '4', '5', '6'],
26
+ 'answer': 1 # Index of correct answer
27
+ }
28
+ ```
29
+
30
+ **Best Practices:**
31
+ - Validate answer spans
32
+ - Handle impossible questions
33
+ - Question type classification
34
+ - Context length management
35
+
36
+ ### 📝 Summarization
37
+
38
+ #### News Summarization
39
+ ```python
40
+ {
41
+ 'article': 'Long news article...',
42
+ 'summary': 'Brief summary...',
43
+ 'compression_ratio': 0.24
44
+ }
45
+ ```
46
+
47
+ **Metrics:**
48
+ - ROUGE scores
49
+ - Compression ratio (20-30% optimal)
50
+ - Abstractive vs Extractive
51
+
52
+ **Best Practices:**
53
+ - Multiple reference summaries
54
+ - Length constraints
55
+ - Quality validation
56
+
57
+ ### 🏷️ Named Entity Recognition
58
+
59
+ #### BIO Tagging
60
+ ```python
61
+ {
62
+ 'tokens': ['John', 'Smith', 'works', 'at', 'Google'],
63
+ 'ner_tags': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG']
64
+ }
65
+ ```
66
+
67
+ **Tag Schema:**
68
+ - B-PER, I-PER (Person)
69
+ - B-ORG, I-ORG (Organization)
70
+ - B-LOC, I-LOC (Location)
71
+ - O (Outside)
72
+
73
+ **Best Practices:**
74
+ - Consistent tagging scheme
75
+ - Entity type taxonomy
76
+ - Nested entities handling
77
+ - Entity linking (optional)
78
+
79
+ ### 😊 Sentiment Analysis
80
+
81
+ #### Binary/Multi-class
82
+ ```python
83
+ {
84
+ 'text': 'This product is amazing!',
85
+ 'label': 2, # 0: neg, 1: neutral, 2: pos
86
+ 'confidence': 0.95
87
+ }
88
+ ```
89
+
90
+ #### Aspect-Based
91
+ ```python
92
+ {
93
+ 'text': 'Great product but slow delivery',
94
+ 'aspect_sentiments': {
95
+ 'product': 'positive',
96
+ 'delivery': 'negative'
97
+ }
98
+ }
99
+ ```
100
+
101
+ **Best Practices:**
102
+ - Multi-level granularity
103
+ - Confidence scores
104
+ - Domain-specific lexicons
105
+ - Emotion detection
106
+
107
+ ### 📊 Text Classification
108
+
109
+ #### Topic Classification
110
+ ```python
111
+ {
112
+ 'text': 'Article text...',
113
+ 'label': 'technology',
114
+ 'label_id': 0
115
+ }
116
+ ```
117
+
118
+ **Best Practices:**
119
+ - Balanced classes
120
+ - Hierarchical categories
121
+ - Multi-label support
122
+ - Class imbalance handling
123
+
124
+ ### 🎯 Multi-Task Learning
125
+
126
+ #### Unified Format
127
+ ```python
128
+ {
129
+ 'text': 'Sample text...',
130
+ 'sentiment': 'positive',
131
+ 'topic': 'technology',
132
+ 'quality_score': 0.85
133
+ }
134
+ ```
135
+
136
+ **Best Practices:**
137
+ - Consistent preprocessing
138
+ - Task-specific heads
139
+ - Shared representations
140
+ - Task weighting
141
+
142
+ ## Dataset Statistics
143
+
144
+ | Task | Örnekler | Format |
145
+ |------|----------|--------|
146
+ | QA | 300 | Extractive + MC |
147
+ | Summarization | 100 | News articles |
148
+ | NER | 100 | BIO tagged |
149
+ | Sentiment | 350 | Multi-class + Aspect |
150
+ | Classification | 200 | Topic |
151
+ | Multi-Task | 100 | Unified |
152
+
153
+ ## Quality Metrics
154
+
155
+ ### QA
156
+ - Exact Match (EM)
157
+ - F1 Score
158
+ - Answer span accuracy
159
+
160
+ ### Summarization
161
+ - ROUGE-1, ROUGE-2, ROUGE-L
162
+ - Compression ratio
163
+ - Factual consistency
164
+
165
+ ### NER
166
+ - Precision, Recall, F1 per entity type
167
+ - Exact match
168
+ - Partial match
169
+
170
+ ### Sentiment
171
+ - Accuracy
172
+ - Macro/Micro F1
173
+ - Confusion matrix
174
+
175
+ ### Classification
176
+ - Accuracy
177
+ - Per-class F1
178
+ - Macro/Weighted F1
179
+
180
+ ## Best Practices (Genel)
181
+
182
+ ✅ Clear annotation guidelines
183
+ ✅ Inter-annotator agreement
184
+ ✅ Quality control checks
185
+ ✅ Regular dataset updates
186
+ ✅ Version control
187
+ ✅ Documentation
188
+ ✅ Ethical considerations
189
+ ✅ Bias analysis
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ datasets>=2.14.0
2
+ transformers>=4.30.0
3
+ gradio>=4.44.0
4
+ numpy>=1.24.0
5
+ pandas>=2.0.0
space/app.py ADDED
@@ -0,0 +1,493 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Advanced Dataset Tutorial - Interactive Gradio Demo
3
+ ===================================================
4
+
5
+ Hugging Face Datasets ile ileri seviye teknikler için interaktif demo
6
+ """
7
+
8
+ import gradio as gr
9
+ import sys
10
+ import os
11
+ from pathlib import Path
12
+
13
+ # Modülleri import edebilmek için path ekle
14
+ sys.path.append(str(Path(__file__).parent / "modules"))
15
+
16
+ # Demo için basit örnekler
17
+ DEMO_CODES = {
18
+ "Büyük Ölçekli - Streaming": """
19
+ from datasets import load_dataset
20
+
21
+ # Streaming mode - RAM'i patlatmadan büyük veri
22
+ dataset = load_dataset(
23
+ "c4",
24
+ "en",
25
+ split="train",
26
+ streaming=True # ✨ Anahtar parametre
27
+ )
28
+
29
+ # İlk 1000 örneği işle
30
+ for i, example in enumerate(dataset.take(1000)):
31
+ print(f"Example {i}: {example['text'][:100]}...")
32
+ """,
33
+
34
+ "Büyük Ölçekli - Batch Processing": """
35
+ from datasets import load_dataset
36
+
37
+ dataset = load_dataset("imdb", split="train")
38
+
39
+ # ❌ YAVAŞ: Tek tek işleme
40
+ def process_single(example):
41
+ return {'length': len(example['text'])}
42
+
43
+ slow = dataset.map(process_single)
44
+
45
+ # ✅ HIZLI: Batch processing
46
+ def process_batch(examples):
47
+ return {'length': [len(t) for t in examples['text']]}
48
+
49
+ fast = dataset.map(
50
+ process_batch,
51
+ batched=True, # 🚀 10x-100x daha hızlı!
52
+ batch_size=1000
53
+ )
54
+ """,
55
+
56
+ "Domain-Specific - Cross-Domain Fix": """
57
+ from datasets import Dataset, concatenate_datasets
58
+ import json
59
+
60
+ # ❌ PROBLEM: Farklı schema'lar
61
+ sci_data = Dataset.from_dict({
62
+ 'text': ['Scientific paper...'],
63
+ 'metadata': {'year': 2024, 'citations': 10}
64
+ })
65
+
66
+ code_data = Dataset.from_dict({
67
+ 'code': ['def hello(): pass'],
68
+ 'language': 'Python'
69
+ })
70
+
71
+ # Bu HATA verir! ArrowTypeError
72
+ # combined = concatenate_datasets([sci_data, code_data])
73
+
74
+ # ✅ ÇÖZÜM: JSON metadata approach
75
+ def normalize_to_json(example, domain):
76
+ return {
77
+ 'text': example.get('text') or example.get('code'),
78
+ 'domain': domain,
79
+ 'metadata_json': json.dumps(example.get('metadata', {}))
80
+ }
81
+
82
+ sci_norm = sci_data.map(lambda x: normalize_to_json(x, 'scientific'))
83
+ code_norm = code_data.map(lambda x: normalize_to_json(x, 'code'))
84
+
85
+ # Şimdi ÇALIŞIR! ✅
86
+ combined = concatenate_datasets([sci_norm, code_norm])
87
+ """,
88
+
89
+ "İleri Teknikler - Custom Collator": """
90
+ from datasets import Dataset
91
+
92
+ class AdvancedCollator:
93
+ def __init__(self, max_length=128, pad_token='[PAD]'):
94
+ self.max_length = max_length
95
+ self.pad_token = pad_token
96
+
97
+ def __call__(self, batch):
98
+ # Tokenize (basit örnek)
99
+ tokenized = [ex['text'].split()[:self.max_length]
100
+ for ex in batch]
101
+
102
+ # Dynamic padding - batch içindeki max length'e göre
103
+ max_len = max(len(tokens) for tokens in tokenized)
104
+
105
+ padded = []
106
+ masks = []
107
+ for tokens in tokenized:
108
+ pad_len = max_len - len(tokens)
109
+ padded.append(tokens + [self.pad_token] * pad_len)
110
+ masks.append([1] * len(tokens) + [0] * pad_len)
111
+
112
+ return {
113
+ 'input_tokens': padded,
114
+ 'attention_mask': masks,
115
+ 'labels': [ex['label'] for ex in batch]
116
+ }
117
+
118
+ # Kullanım
119
+ collator = AdvancedCollator()
120
+ batch = [
121
+ {'text': 'Short text', 'label': 0},
122
+ {'text': 'Much longer text here', 'label': 1}
123
+ ]
124
+ collated = collator(batch)
125
+ """,
126
+
127
+ "İleri Teknikler - Data Augmentation": """
128
+ from datasets import Dataset
129
+ import random
130
+
131
+ class DataAugmenter:
132
+ def augment(self, text):
133
+ words = text.split()
134
+
135
+ # Random word deletion
136
+ if random.random() < 0.3:
137
+ words = [w for w in words if random.random() > 0.1]
138
+
139
+ # Random word swap
140
+ if len(words) > 1 and random.random() < 0.3:
141
+ i, j = random.sample(range(len(words)), 2)
142
+ words[i], words[j] = words[j], words[i]
143
+
144
+ return ' '.join(words) if words else text
145
+
146
+ def augment_dataset(self, dataset, n_augmentations=2):
147
+ augmented = []
148
+
149
+ for example in dataset:
150
+ # Original
151
+ augmented.append({
152
+ **example,
153
+ 'is_augmented': False
154
+ })
155
+
156
+ # Augmented versions
157
+ for _ in range(n_augmentations):
158
+ augmented.append({
159
+ **example,
160
+ 'text': self.augment(example['text']),
161
+ 'is_augmented': True
162
+ })
163
+
164
+ return Dataset.from_list(augmented)
165
+
166
+ # Kullanım: 1 örnek → 3 örnek (1 original + 2 augmented)
167
+ augmenter = DataAugmenter()
168
+ original = Dataset.from_dict({'text': ['Hello world'], 'label': [0]})
169
+ augmented = augmenter.augment_dataset(original, n_augmentations=2)
170
+ print(f"Dataset boyutu: {len(original)} → {len(augmented)}")
171
+ """,
172
+
173
+ "Özel Görevler - Question Answering": """
174
+ from datasets import Dataset
175
+
176
+ # SQuAD-style QA dataset
177
+ qa_dataset = Dataset.from_dict({
178
+ 'context': [
179
+ 'The Eiffel Tower is in Paris. It was built in 1889.'
180
+ ],
181
+ 'question': [
182
+ 'Where is the Eiffel Tower?'
183
+ ],
184
+ 'answers': [{
185
+ 'text': ['Paris'],
186
+ 'answer_start': [23] # Character position
187
+ }]
188
+ })
189
+
190
+ # Preprocessing
191
+ def preprocess_qa(example):
192
+ # Answer'ı validate et
193
+ context = example['context']
194
+ answer = example['answers']['text'][0]
195
+ start = example['answers']['answer_start'][0]
196
+
197
+ # Extract ve kontrol et
198
+ extracted = context[start:start + len(answer)]
199
+ is_valid = extracted == answer
200
+
201
+ return {
202
+ **example,
203
+ 'is_valid': is_valid,
204
+ 'question_type': example['question'].split()[0].lower()
205
+ }
206
+
207
+ qa_processed = qa_dataset.map(preprocess_qa)
208
+ """,
209
+
210
+ "Özel Görevler - NER": """
211
+ from datasets import Dataset
212
+
213
+ # Named Entity Recognition (BIO tagging)
214
+ ner_dataset = Dataset.from_dict({
215
+ 'tokens': [
216
+ ['John', 'Smith', 'works', 'at', 'Google']
217
+ ],
218
+ 'ner_tags': [
219
+ ['B-PER', 'I-PER', 'O', 'O', 'B-ORG']
220
+ ]
221
+ })
222
+
223
+ # Tag to ID mapping
224
+ tag2id = {
225
+ 'O': 0,
226
+ 'B-PER': 1, 'I-PER': 2,
227
+ 'B-ORG': 3, 'I-ORG': 4,
228
+ 'B-LOC': 5, 'I-LOC': 6
229
+ }
230
+
231
+ # Convert tags to IDs
232
+ def convert_tags(example):
233
+ return {
234
+ **example,
235
+ 'ner_tag_ids': [tag2id[tag] for tag in example['ner_tags']],
236
+ 'sentence': ' '.join(example['tokens'])
237
+ }
238
+
239
+ ner_processed = ner_dataset.map(convert_tags)
240
+
241
+ # Entity statistics
242
+ def count_entities(dataset):
243
+ entity_types = {}
244
+ for ex in dataset:
245
+ for tag in ex['ner_tags']:
246
+ if tag.startswith('B-'):
247
+ entity_type = tag.split('-')[1]
248
+ entity_types[entity_type] = entity_types.get(entity_type, 0) + 1
249
+ return entity_types
250
+
251
+ print(count_entities(ner_processed))
252
+ """,
253
+
254
+ "Özel Görevler - Sentiment Analysis": """
255
+ from datasets import Dataset
256
+
257
+ # Sentiment classification dataset
258
+ sentiment_dataset = Dataset.from_dict({
259
+ 'text': [
260
+ 'This product is amazing!',
261
+ 'Terrible, waste of money.',
262
+ 'It\\'s okay, nothing special.'
263
+ ],
264
+ 'label': [2, 0, 1], # 0: negative, 1: neutral, 2: positive
265
+ 'label_text': ['positive', 'negative', 'neutral']
266
+ })
267
+
268
+ # Feature extraction
269
+ def extract_sentiment_features(example):
270
+ text = example['text'].lower()
271
+
272
+ positive_words = ['amazing', 'great', 'excellent', 'love']
273
+ negative_words = ['terrible', 'waste', 'bad', 'poor']
274
+
275
+ pos_count = sum(1 for word in positive_words if word in text)
276
+ neg_count = sum(1 for word in negative_words if word in text)
277
+
278
+ return {
279
+ **example,
280
+ 'positive_words': pos_count,
281
+ 'negative_words': neg_count,
282
+ 'sentiment_score': pos_count - neg_count,
283
+ 'has_exclamation': '!' in example['text']
284
+ }
285
+
286
+ sentiment_featured = sentiment_dataset.map(extract_sentiment_features)
287
+
288
+ # Class balancing with augmentation
289
+ def balance_classes(dataset, target_per_class=100):
290
+ from collections import defaultdict
291
+
292
+ # Group by label
293
+ by_label = defaultdict(list)
294
+ for ex in dataset:
295
+ by_label[ex['label']].append(ex)
296
+
297
+ # Augment minority classes
298
+ balanced = []
299
+ for label, examples in by_label.items():
300
+ balanced.extend(examples)
301
+
302
+ # Add augmented copies if needed
303
+ while len([e for e in balanced if e['label'] == label]) < target_per_class:
304
+ # Simple augmentation: copy with modified text
305
+ ex = examples[len(balanced) % len(examples)]
306
+ balanced.append({
307
+ **ex,
308
+ 'is_augmented': True
309
+ })
310
+
311
+ return Dataset.from_list(balanced)
312
+ """
313
+ }
314
+
315
+ BEST_PRACTICES = """
316
+ # 🎯 Best Practices Özeti
317
+
318
+ ## Memory Efficiency
319
+ ```python
320
+ # ✅ DOĞRU: Streaming
321
+ dataset = load_dataset("huge_data", streaming=True)
322
+
323
+ # ❌ YANLIŞ: Tüm veriyi RAM'e yükleme
324
+ dataset = load_dataset("huge_data") # 100GB RAM!
325
+ ```
326
+
327
+ ## Batch Processing
328
+ ```python
329
+ # ✅ DOĞRU: Batched=True
330
+ dataset.map(fn, batched=True, batch_size=1000)
331
+
332
+ # ❌ YANLIŞ: Tek tek
333
+ dataset.map(fn) # 10x-100x yavaş!
334
+ ```
335
+
336
+ ## Cross-Domain
337
+ ```python
338
+ # ✅ DOĞRU: Normalize et
339
+ def normalize(ex, domain):
340
+ return {'text': ex.get('text'), 'domain': domain}
341
+
342
+ # ❌ YANLIŞ: Direkt birleştir
343
+ concatenate_datasets([ds1, ds2]) # Error!
344
+ ```
345
+
346
+ ## Performans
347
+ - **Streaming**: RAM tasarrufu
348
+ - **Batched**: 10x-100x hız
349
+ - **num_proc**: CPU parallelization
350
+ - **Cache**: Tekrar kullanım
351
+ """
352
+
353
+ def show_code(module_name):
354
+ """Seçilen modül için kod göster"""
355
+ return DEMO_CODES.get(module_name, "Kod örneği yükleniyor...")
356
+
357
+ def show_best_practices():
358
+ """Best practices göster"""
359
+ return BEST_PRACTICES
360
+
361
+ # Gradio Interface
362
+ with gr.Blocks(title="Advanced Dataset Tutorial", theme=gr.themes.Soft()) as demo:
363
+ gr.Markdown("""
364
+ # 📚 Advanced Dataset Tutorial
365
+ ## Hugging Face Datasets - İleri Seviye Türkçe Eğitim
366
+
367
+ Bu interaktif demo, 4 modül ve 20+ teknik içeren kapsamlı dataset eğitiminin özetini sunar.
368
+ """)
369
+
370
+ with gr.Tabs():
371
+ with gr.Tab("🚀 Kod Örnekleri"):
372
+ gr.Markdown("### Her modülden pratik kod örnekleri")
373
+
374
+ module_dropdown = gr.Dropdown(
375
+ choices=list(DEMO_CODES.keys()),
376
+ label="Modül Seçin",
377
+ value=list(DEMO_CODES.keys())[0]
378
+ )
379
+
380
+ code_output = gr.Code(
381
+ label="Kod Örneği",
382
+ language="python",
383
+ value=DEMO_CODES[list(DEMO_CODES.keys())[0]]
384
+ )
385
+
386
+ module_dropdown.change(
387
+ fn=show_code,
388
+ inputs=[module_dropdown],
389
+ outputs=[code_output]
390
+ )
391
+
392
+ with gr.Tab("📖 Modüller"):
393
+ gr.Markdown("""
394
+ ## 4 Ana Modül
395
+
396
+ ### 1️⃣ Büyük Ölçekli Datasets
397
+ - ⚡ Streaming (750GB+ data)
398
+ - 💾 Batch processing (2.3x hızlı)
399
+ - 🚀 Multi-processing (64x hızlı)
400
+ - 📦 Cache (12.1x hızlı)
401
+
402
+ ### 2️⃣ Domain-Specific Datasets
403
+ - 🔬 Bilimsel makaleler (2,000 örnek)
404
+ - 💻 Kod datasets (6 dil, 2,000 örnek)
405
+ - 💰 Finansal veri (2,000 kayıt)
406
+ - 🏥 Tıbbi veri (PHI anonymization)
407
+
408
+ ### 3️⃣ İleri Teknikler
409
+ - 📦 Custom Collators (3 tip)
410
+ - 🔧 Feature Engineering (10+ feature)
411
+ - 🎲 Data Augmentation (3x veri)
412
+ - 📊 Advanced Sampling (diversity, stratified)
413
+
414
+ ### 4️⃣ Özel Görevler
415
+ - ❓ Question Answering (SQuAD)
416
+ - 📝 Summarization (ROUGE)
417
+ - 🏷️ NER (BIO tagging)
418
+ - 😊 Sentiment Analysis
419
+ - 📊 Multi-Task Learning
420
+ """)
421
+
422
+ with gr.Tab("🎯 Best Practices"):
423
+ gr.Code(
424
+ value=BEST_PRACTICES,
425
+ label="Best Practices",
426
+ language="python"
427
+ )
428
+
429
+ with gr.Tab("📊 Performans"):
430
+ gr.Markdown("""
431
+ ## Performans Metrikleri
432
+
433
+ | Teknik | Artış | Kullanım |
434
+ |--------|-------|----------|
435
+ | **Batch Processing** | 2.3x | Tüm preprocessing |
436
+ | **Cache** | 12.1x | Tekrar işlemler |
437
+ | **Multi-Processing** | 64x | CPU tasks |
438
+ | **Dynamic Batching** | 40% | Padding azalması |
439
+ | **Data Augmentation** | 3x | Veri artışı |
440
+
441
+ ## İstatistikler
442
+
443
+ - 📝 **5,000+** kod satırı
444
+ - 🔢 **20,000+** örnek dataset
445
+ - 🛠️ **50+** teknik
446
+ - ✅ **100+** best practice
447
+
448
+ ## Kazanımlar
449
+
450
+ ✅ Büyük ölçekli veri işleme
451
+ ✅ Domain-specific preprocessing
452
+ ✅ Production-ready pipelines
453
+ ✅ Task-specific optimization
454
+ ✅ Multi-task learning
455
+ """)
456
+
457
+ with gr.Tab("ℹ️ Hakkında"):
458
+ gr.Markdown("""
459
+ ## Proje Bilgileri
460
+
461
+ **Amaç:** Hugging Face Datasets kütüphanesini profesyonel düzeyde kullanmak isteyenler için kapsamlı Türkçe kaynak
462
+
463
+ **İçerik:**
464
+ - 4 ana modül
465
+ - 20+ pratik örnek
466
+ - 50+ teknik
467
+ - 100+ best practice
468
+
469
+ **Hedef Kitle:**
470
+ - NLP mühendisleri
471
+ - ML researchers
472
+ - Data scientists
473
+ - AI developers
474
+
475
+ **Lisans:** MIT
476
+
477
+ **Kaynaklar:**
478
+ - [Hugging Face Datasets Docs](https://huggingface.co/docs/datasets)
479
+ - [GitHub Repository](https://github.com/yourusername/advanced-dataset-tutorial)
480
+ - [Hugging Face Hub](https://huggingface.co/datasets)
481
+
482
+ ---
483
+
484
+ ⭐ **Beğendiyseniz yıldız vermeyi unutmayın!**
485
+ """)
486
+
487
+ gr.Markdown("""
488
+ ---
489
+ 💡 **Not:** Bu demo, tam eğitim materyalinin özeti içindir. Detaylı örnekler ve açıklamalar için modül scriptlerine bakın.
490
+ """)
491
+
492
+ if __name__ == "__main__":
493
+ demo.launch()
space/modules/01_buyuk_olcekli_datasets_complete.py ADDED
@@ -0,0 +1,617 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ BÜYÜK ÖLÇEKLİ DATASETS - İLERİ SEVİYE HUGGING FACE
3
+ =================================================
4
+ Network bağımsız versiyon - Sentetik ve lokal örneklerle
5
+
6
+ Bu modülde öğrenecekleriniz:
7
+ 1. Streaming simülasyonu ve büyük veri prensipleri
8
+ 2. Dataset sharding ve chunk'lama
9
+ 3. Memory-efficient preprocessing
10
+ 4. Batch processing optimizasyonu
11
+ 5. Cache yönetimi
12
+ """
13
+
14
+ from datasets import Dataset, DatasetDict, IterableDataset
15
+ from datasets import concatenate_datasets
16
+ import numpy as np
17
+ from typing import Iterator, Dict, List
18
+ import time
19
+ from functools import partial
20
+ import sys
21
+
22
+ print("="*60)
23
+ print("1. STREAMING DATASET SİMÜLASYONU")
24
+ print("="*60)
25
+
26
+ # Büyük bir dataset simülasyonu
27
+ def generate_large_dataset(num_samples=100000):
28
+ """
29
+ Büyük dataset simülasyonu - Generator pattern kullanarak
30
+ Bu, gerçek streaming dataset'lerin çalışma prensibidir
31
+ """
32
+ def gen():
33
+ for i in range(num_samples):
34
+ yield {
35
+ "id": i,
36
+ "text": f"Bu {i}. örnek metindir. " * np.random.randint(10, 100),
37
+ "label": np.random.randint(0, 5),
38
+ "metadata": {
39
+ "source": f"source_{i % 10}",
40
+ "timestamp": i * 1000
41
+ }
42
+ }
43
+
44
+ return gen
45
+
46
+ # Iterable Dataset oluştur (streaming gibi çalışır)
47
+ print("\n📚 Büyük Dataset (100K örnekli) - Generator Pattern")
48
+ print("Normal yükleme = Tüm veri RAM'de")
49
+ print("Streaming/Generator = Sadece işlenen kısım RAM'de\n")
50
+
51
+ streaming_dataset = Dataset.from_generator(
52
+ generate_large_dataset(10000),
53
+ cache_dir=None
54
+ )
55
+
56
+ print(f"Dataset boyutu: {len(streaming_dataset)} örnek")
57
+ print(f"Bellek kullanımı: Minimal (generator pattern)")
58
+
59
+ # İlk 3 örneği al
60
+ print("\nİlk 3 örnek:")
61
+ for i in range(3):
62
+ example = streaming_dataset[i]
63
+ print(f"\n{i+1}. Örnek:")
64
+ print(f" ID: {example['id']}")
65
+ print(f" Text uzunluğu: {len(example['text'])} karakter")
66
+ print(f" Label: {example['label']}")
67
+ print(f" İlk 80 karakter: {example['text'][:80]}...")
68
+
69
+
70
+ print("\n" + "="*60)
71
+ print("2. DATASET SHARDING VE PARALEL İŞLEME")
72
+ print("="*60)
73
+
74
+ print("\n🔀 Dataset Sharding - Distributed Training için")
75
+
76
+ # Dataset'i parçalara böl
77
+ num_shards = 4
78
+ dataset_size = len(streaming_dataset)
79
+ shard_size = dataset_size // num_shards
80
+
81
+ print(f"\nToplam dataset: {dataset_size} örnek")
82
+ print(f"Shard sayısı: {num_shards}")
83
+ print(f"Her shard: ~{shard_size} örnek")
84
+
85
+ for shard_id in range(num_shards):
86
+ start_idx = shard_id * shard_size
87
+ end_idx = start_idx + shard_size if shard_id < num_shards - 1 else dataset_size
88
+
89
+ shard = streaming_dataset.select(range(start_idx, end_idx))
90
+
91
+ print(f"\n Shard {shard_id}:")
92
+ print(f" - İndeksler: {start_idx} - {end_idx}")
93
+ print(f" - Boyut: {len(shard)} örnek")
94
+ print(f" - İlk örnek ID: {shard[0]['id']}")
95
+ print(f" - Use case: GPU {shard_id} için")
96
+
97
+
98
+ print("\n" + "="*60)
99
+ print("3. BATCH PROCESSING - VERİMLİ PREPROCESSING")
100
+ print("="*60)
101
+
102
+ print("\n⚡ Batch vs Single Processing Karşılaştırması")
103
+
104
+ # Test dataset'i
105
+ test_dataset = streaming_dataset.select(range(1000))
106
+
107
+ # YÖNTEM 1: Tek tek işleme (YAVAŞ)
108
+ print("\n1️⃣ TEK TEK İŞLEME:")
109
+ start = time.time()
110
+
111
+ def process_single(example):
112
+ """Her örneği tek tek işle"""
113
+ example['text_length'] = len(example['text'])
114
+ example['word_count'] = len(example['text'].split())
115
+ example['label_squared'] = example['label'] ** 2
116
+ return example
117
+
118
+ processed_single = test_dataset.map(
119
+ process_single,
120
+ desc="Single processing"
121
+ )
122
+ time_single = time.time() - start
123
+ print(f" Süre: {time_single:.3f}s")
124
+ print(f" İşlem hızı: {len(test_dataset)/time_single:.0f} örnek/saniye")
125
+
126
+ # YÖNTEM 2: Batch processing (HIZLI)
127
+ print("\n2️⃣ BATCH İŞLEME:")
128
+ start = time.time()
129
+
130
+ def process_batch(examples):
131
+ """Batch'i bir arada işle - VECTORIZED!"""
132
+ examples['text_length'] = [len(text) for text in examples['text']]
133
+ examples['word_count'] = [len(text.split()) for text in examples['text']]
134
+ examples['label_squared'] = [label ** 2 for label in examples['label']]
135
+ return examples
136
+
137
+ processed_batch = test_dataset.map(
138
+ process_batch,
139
+ batched=True,
140
+ batch_size=100, # 100 örneği birlikte işle
141
+ desc="Batch processing"
142
+ )
143
+ time_batch = time.time() - start
144
+ print(f" Süre: {time_batch:.3f}s")
145
+ print(f" İşlem hızı: {len(test_dataset)/time_batch:.0f} örnek/saniye")
146
+ print(f"\n ⚡ HIZ ARTIŞI: {time_single/time_batch:.1f}x DAHA HIZLI!")
147
+
148
+ # Sonuçları kontrol et
149
+ print("\n✅ Sonuç kontrolü:")
150
+ print(f" İlk örnek - text_length: {processed_batch[0]['text_length']}")
151
+ print(f" İlk örnek - word_count: {processed_batch[0]['word_count']}")
152
+
153
+
154
+ print("\n" + "="*60)
155
+ print("4. MEMORY-EFFICIENT FILTERING")
156
+ print("="*60)
157
+
158
+ print("\n🔍 Büyük Dataset'te Filtreleme")
159
+
160
+ # Farklı filtre stratejileri
161
+ print("\n📊 Orijinal dataset:")
162
+ print(f" Toplam: {len(streaming_dataset)} örnek")
163
+
164
+ # Filtre 1: Kısa metinleri çıkar
165
+ filtered_1 = streaming_dataset.filter(
166
+ lambda x: len(x['text']) > 500,
167
+ desc="Filtering short texts"
168
+ )
169
+ print(f"\n1️⃣ Uzun metinler (>500 char): {len(filtered_1)} örnek")
170
+
171
+ # Filtre 2: Belirli label'ları al
172
+ filtered_2 = streaming_dataset.filter(
173
+ lambda x: x['label'] in [0, 1],
174
+ desc="Filtering by label"
175
+ )
176
+ print(f"2️⃣ Label 0 veya 1: {len(filtered_2)} örnek")
177
+
178
+ # Filtre 3: Kompleks filtre - BATCH ile daha hızlı
179
+ def complex_filter(examples):
180
+ """
181
+ Batch filtering - çok daha hızlı!
182
+ """
183
+ return [
184
+ len(text) > 300 and len(text) < 1000 and label < 3
185
+ for text, label in zip(examples['text'], examples['label'])
186
+ ]
187
+
188
+ start = time.time()
189
+ filtered_3 = streaming_dataset.filter(
190
+ complex_filter,
191
+ batched=True,
192
+ batch_size=1000,
193
+ desc="Complex batch filtering"
194
+ )
195
+ filter_time = time.time() - start
196
+ print(f"3️⃣ Kompleks filtre (300-1000 char, label<3): {len(filtered_3)} örnek")
197
+ print(f" Filtreleme süresi: {filter_time:.3f}s")
198
+
199
+
200
+ print("\n" + "="*60)
201
+ print("5. CHUNK-BASED PROCESSING")
202
+ print("="*60)
203
+
204
+ print("\n📦 Chunk Tabanlı İşleme - Çok Büyük Datasets için")
205
+
206
+ def process_in_chunks(dataset, chunk_size=2000, num_chunks=5):
207
+ """
208
+ Dataset'i chunk'lar halinde işle
209
+ Her chunk işlendikten sonra sonuçlar kaydedilir, bellek temizlenir
210
+ """
211
+ chunk_results = []
212
+ total_size = len(dataset)
213
+
214
+ print(f"\nToplam: {total_size} örnek")
215
+ print(f"Chunk boyutu: {chunk_size}")
216
+ print(f"İşlenecek chunk: {num_chunks}")
217
+
218
+ for chunk_id in range(num_chunks):
219
+ start_idx = chunk_id * chunk_size
220
+ end_idx = min(start_idx + chunk_size, total_size)
221
+
222
+ if start_idx >= total_size:
223
+ break
224
+
225
+ print(f"\n Chunk {chunk_id + 1}/{num_chunks} işleniyor...")
226
+
227
+ # Bir chunk al
228
+ chunk = dataset.select(range(start_idx, end_idx))
229
+
230
+ # İstatistikleri hesapla
231
+ lengths = [len(ex['text']) for ex in chunk]
232
+ labels = [ex['label'] for ex in chunk]
233
+
234
+ chunk_results.append({
235
+ 'chunk_id': chunk_id,
236
+ 'size': len(chunk),
237
+ 'avg_length': np.mean(lengths),
238
+ 'max_length': np.max(lengths),
239
+ 'min_length': np.min(lengths),
240
+ 'label_dist': {i: labels.count(i) for i in range(5)}
241
+ })
242
+
243
+ # Chunk işlendi, bellek temizleniyor
244
+ del chunk
245
+
246
+ return chunk_results
247
+
248
+ # Dataset'i chunk'lar halinde işle
249
+ results = process_in_chunks(streaming_dataset, chunk_size=2000, num_chunks=5)
250
+
251
+ print("\n📊 Chunk İstatistikleri:")
252
+ for result in results:
253
+ print(f"\n Chunk {result['chunk_id']}:")
254
+ print(f" Boyut: {result['size']:,} örnek")
255
+ print(f" Ortalama uzunluk: {result['avg_length']:.0f} karakter")
256
+ print(f" Min/Max: {result['min_length']:.0f} / {result['max_length']:.0f}")
257
+ print(f" Label dağılımı: {result['label_dist']}")
258
+
259
+
260
+ print("\n" + "="*60)
261
+ print("6. DATASET BİRLEŞTİRME VE KARMAŞIK İŞLEMLER")
262
+ print("="*60)
263
+
264
+ print("\n🔄 Birden Fazla Dataset'i Birleştirme")
265
+
266
+ # Farklı kaynaklardan dataset'ler oluştur
267
+ def create_dataset(name, size, label_shift=0):
268
+ def gen():
269
+ for i in range(size):
270
+ yield {
271
+ "text": f"Dataset {name}: Örnek {i}. " * np.random.randint(5, 20),
272
+ "label": (i % 3) + label_shift,
273
+ "source": name
274
+ }
275
+ return Dataset.from_generator(gen)
276
+
277
+ dataset_a = create_dataset("A", 1000, label_shift=0)
278
+ dataset_b = create_dataset("B", 1500, label_shift=2)
279
+ dataset_c = create_dataset("C", 800, label_shift=4)
280
+
281
+ print(f"Dataset A: {len(dataset_a)} örnek")
282
+ print(f"Dataset B: {len(dataset_b)} örnek")
283
+ print(f"Dataset C: {len(dataset_c)} örnek")
284
+
285
+ # Datasets'leri birleştir
286
+ combined = concatenate_datasets([dataset_a, dataset_b, dataset_c])
287
+ print(f"\n✅ Birleştirilmiş: {len(combined)} örnek")
288
+
289
+ # Her kaynaktan örnek sayıları
290
+ sources = [ex['source'] for ex in combined.select(range(min(100, len(combined))))]
291
+ print("\nİlk 100 örnekte kaynak dağılımı:")
292
+ for source in ['A', 'B', 'C']:
293
+ count = sources.count(source)
294
+ print(f" {source}: {count} örnek")
295
+
296
+
297
+ print("\n" + "="*60)
298
+ print("7. CACHE YÖNETİMİ VE OPTIMIZATION")
299
+ print("="*60)
300
+
301
+ print("\n💾 Cache Kullanımı - İşlem Hızlandırma")
302
+
303
+ # Ağır bir preprocessing işlemi simüle et
304
+ def heavy_preprocessing(examples):
305
+ """
306
+ Ağır işlem simülasyonu
307
+ """
308
+ time.sleep(0.0001) # Yapay gecikme
309
+ return {
310
+ 'processed_text': [text.lower()[:100] for text in examples['text']],
311
+ 'features': [[len(text), len(text.split())] for text in examples['text']]
312
+ }
313
+
314
+ test_set = streaming_dataset.select(range(1000))
315
+
316
+ # İlk işleme - cache oluştur
317
+ print("\n1️⃣ İlk işleme (cache oluşturuluyor):")
318
+ start = time.time()
319
+ processed_1 = test_set.map(
320
+ heavy_preprocessing,
321
+ batched=True,
322
+ batch_size=100,
323
+ desc="Processing with cache"
324
+ )
325
+ first_time = time.time() - start
326
+ print(f" Süre: {first_time:.3f}s")
327
+
328
+ # İkinci işleme - cache'den oku (aynı fonksiyon)
329
+ print("\n2️⃣ İkinci işleme (cache kullanılıyor):")
330
+ start = time.time()
331
+ processed_2 = test_set.map(
332
+ heavy_preprocessing,
333
+ batched=True,
334
+ batch_size=100,
335
+ desc="Using cache"
336
+ )
337
+ second_time = time.time() - start
338
+ print(f" Süre: {second_time:.3f}s")
339
+
340
+ if first_time > second_time:
341
+ speedup = first_time / second_time
342
+ print(f"\n ⚡ CACHE HIZ ARTIŞI: {speedup:.1f}x daha hızlı!")
343
+
344
+
345
+ print("\n" + "="*60)
346
+ print("8. MULTI-PROCESS PROCESSING")
347
+ print("="*60)
348
+
349
+ print("\n🚀 Paralel İşleme - Tüm CPU Çekirdeklerini Kullan")
350
+
351
+ import multiprocessing
352
+ num_cores = multiprocessing.cpu_count()
353
+
354
+ print(f"\nSistem CPU çekirdek sayısı: {num_cores}")
355
+
356
+ # CPU-intensive işlem
357
+ def cpu_intensive_processing(examples):
358
+ """
359
+ CPU-yoğun işlem simülasyonu
360
+ """
361
+ results = []
362
+ for text in examples['text']:
363
+ # Basit ama CPU kullanan işlem
364
+ result = sum(ord(c) for c in text[:1000])
365
+ results.append(result)
366
+ return {'computed_hash': results}
367
+
368
+ test_parallel = streaming_dataset.select(range(5000))
369
+
370
+ # Tek işlem
371
+ print("\n1️⃣ Tek işlem (num_proc=1):")
372
+ start = time.time()
373
+ processed_single_proc = test_parallel.map(
374
+ cpu_intensive_processing,
375
+ batched=True,
376
+ batch_size=100,
377
+ num_proc=1,
378
+ desc="Single process"
379
+ )
380
+ time_single_proc = time.time() - start
381
+ print(f" Süre: {time_single_proc:.3f}s")
382
+
383
+ # Çoklu işlem (mümkünse)
384
+ if num_cores > 1:
385
+ num_proc = min(4, num_cores)
386
+ print(f"\n2️⃣ Çoklu işlem (num_proc={num_proc}):")
387
+ start = time.time()
388
+ processed_multi_proc = test_parallel.map(
389
+ cpu_intensive_processing,
390
+ batched=True,
391
+ batch_size=100,
392
+ num_proc=num_proc,
393
+ desc="Multi process"
394
+ )
395
+ time_multi_proc = time.time() - start
396
+ print(f" Süre: {time_multi_proc:.3f}s")
397
+
398
+ if time_multi_proc < time_single_proc:
399
+ speedup = time_single_proc / time_multi_proc
400
+ print(f"\n ⚡ PARALEL HIZ ARTIŞI: {speedup:.1f}x daha hızlı!")
401
+
402
+
403
+ print("\n" + "="*60)
404
+ print("9. DATASET İSTATİSTİKLERİ - BÜYÜK VERİDE")
405
+ print("="*60)
406
+
407
+ print("\n📊 Comprehensive Dataset Analysis")
408
+
409
+ def compute_comprehensive_stats(dataset, sample_size=None):
410
+ """
411
+ Dataset için detaylı istatistikler
412
+ """
413
+ if sample_size and len(dataset) > sample_size:
414
+ print(f" Dataset çok büyük, {sample_size} örnek üzerinden analiz...")
415
+ dataset = dataset.select(range(sample_size))
416
+
417
+ # Text uzunlukları
418
+ lengths = [len(ex['text']) for ex in dataset]
419
+ word_counts = [len(ex['text'].split()) for ex in dataset]
420
+ labels = [ex['label'] for ex in dataset]
421
+
422
+ return {
423
+ 'num_examples': len(dataset),
424
+ 'text_length': {
425
+ 'mean': np.mean(lengths),
426
+ 'median': np.median(lengths),
427
+ 'std': np.std(lengths),
428
+ 'min': np.min(lengths),
429
+ 'max': np.max(lengths),
430
+ 'percentile_25': np.percentile(lengths, 25),
431
+ 'percentile_75': np.percentile(lengths, 75),
432
+ },
433
+ 'word_count': {
434
+ 'mean': np.mean(word_counts),
435
+ 'median': np.median(word_counts),
436
+ },
437
+ 'label_distribution': {
438
+ label: labels.count(label)
439
+ for label in set(labels)
440
+ }
441
+ }
442
+
443
+ stats = compute_comprehensive_stats(streaming_dataset, sample_size=5000)
444
+
445
+ print("\n📈 Dataset İstatistikleri (5000 örnek üzerinden):")
446
+ print(f"\n Toplam örnek: {stats['num_examples']:,}")
447
+
448
+ print("\n 📝 Text Uzunluğu:")
449
+ for key, value in stats['text_length'].items():
450
+ print(f" {key}: {value:.1f} karakter")
451
+
452
+ print("\n 📚 Kelime Sayısı:")
453
+ for key, value in stats['word_count'].items():
454
+ print(f" {key}: {value:.1f} kelime")
455
+
456
+ print("\n 🏷️ Label Dağılımı:")
457
+ total = sum(stats['label_distribution'].values())
458
+ for label, count in sorted(stats['label_distribution'].items()):
459
+ pct = (count / total) * 100
460
+ print(f" Label {label}: {count:,} ({pct:.1f}%)")
461
+
462
+
463
+ print("\n" + "="*60)
464
+ print("10. ADVANCED PATTERNS VE BEST PRACTICES")
465
+ print("="*60)
466
+
467
+ print("\n🎯 Memory-Efficient Data Pipeline")
468
+
469
+ class DataPipeline:
470
+ """
471
+ Production-ready data pipeline
472
+ """
473
+ def __init__(self, dataset, batch_size=32):
474
+ self.dataset = dataset
475
+ self.batch_size = batch_size
476
+ self.processed = None
477
+
478
+ def preprocess(self, keep_columns=None):
479
+ """Step 1: Preprocessing"""
480
+ def process(examples):
481
+ return {
482
+ 'text_clean': [text.lower().strip() for text in examples['text']],
483
+ 'length': [len(text) for text in examples['text']]
484
+ }
485
+
486
+ self.processed = self.dataset.map(
487
+ process,
488
+ batched=True,
489
+ batch_size=self.batch_size,
490
+ remove_columns=['metadata'] if keep_columns is None else None
491
+ )
492
+ return self
493
+
494
+ def filter_valid(self, min_length=100):
495
+ """Step 2: Filtering"""
496
+ self.processed = self.processed.filter(
497
+ lambda x: x['length'] >= min_length,
498
+ batched=False
499
+ )
500
+ return self
501
+
502
+ def get_stats(self):
503
+ """Step 3: Get statistics"""
504
+ lengths = [ex['length'] for ex in self.processed.select(range(min(1000, len(self.processed))))]
505
+ return {
506
+ 'count': len(self.processed),
507
+ 'avg_length': np.mean(lengths),
508
+ 'median_length': np.median(lengths)
509
+ }
510
+
511
+ # Pipeline kullanımı
512
+ print("\n🔧 Pipeline Örneği:")
513
+ pipeline = DataPipeline(streaming_dataset.select(range(5000)), batch_size=100)
514
+
515
+ print("\n Step 1: Preprocessing...")
516
+ pipeline.preprocess()
517
+
518
+ print(" Step 2: Filtering (min_length=400)...")
519
+ pipeline.filter_valid(min_length=400)
520
+
521
+ print(" Step 3: İstatistikler...")
522
+ stats = pipeline.get_stats()
523
+
524
+ print(f"\n ✅ Sonuçlar:")
525
+ print(f" Kalan örnek: {stats['count']:,}")
526
+ print(f" Ortalama uzunluk: {stats['avg_length']:.0f}")
527
+ print(f" Median uzunluk: {stats['median_length']:.0f}")
528
+
529
+
530
+ print("\n" + "="*60)
531
+ print("📚 ÖNEMLİ NOTLAR VE BEST PRACTICES")
532
+ print("="*60)
533
+
534
+ print("""
535
+ ✅ STREAMING / GENERATOR PATTERN:
536
+ - Dataset > 10GB → Streaming kullan
537
+ - Generator pattern ile memory efficient
538
+ - İterasyon sırasında veri üretilir
539
+ - Disk I/O'yu minimize et
540
+
541
+ ✅ BATCH PROCESSING:
542
+ - DAIMA batched=True kullan!
543
+ - Batch size: 32-1000 arası optimal
544
+ - List comprehension kullan (hızlı)
545
+ - Vectorization mümkünse tercih et
546
+
547
+ ✅ MULTI-PROCESSING:
548
+ - CPU-bound işlemler için num_proc kullan
549
+ - num_proc = min(8, cpu_count) genelde optimal
550
+ - I/O-bound işlemlerde fayda sağlamaz
551
+ - Batch size ile beraber ayarla
552
+
553
+ ✅ MEMORY YÖNETİMİ:
554
+ - Gereksiz kolonları erken remove_columns ile kaldır
555
+ - Chunk-based processing büyük veri için
556
+ - Cache stratejisi belirle (load_from_disk/save_to_disk)
557
+ - Generator pattern kullan
558
+
559
+ ✅ FILTERING:
560
+ - Filtreyi erken uygula (veri pipeline'ın başında)
561
+ - Batch filtering daha hızlı
562
+ - Kompleks filtreler için lambda yerine fonksiyon
563
+ - Filter chain'i yerine tek complex filter
564
+
565
+ ✅ PERFORMANCE İPUÇLARI:
566
+ - map() > iterate (her zaman)
567
+ - batched=True > batched=False (10x-100x hızlı)
568
+ - num_proc kullan ama oversubscribe etme
569
+ - Cache akıllıca kullan
570
+ - Arrow format (.arrow) kullan, pickle yerine
571
+
572
+ ✅ PRODUCTION PATTERNS:
573
+ - Pipeline pattern kullan (clean code)
574
+ - Error handling ekle
575
+ - Progress bars kullan (desc parameter)
576
+ - Logging ekle
577
+ - Validation adımları ekle
578
+ - Reproducibility için seed kullan
579
+
580
+ ✅ BENCHMARK VE PROFILE:
581
+ - time.time() ile zamanla
582
+ - memory_profiler kullan
583
+ - Farklı batch size'ları test et
584
+ - Farklı num_proc değerleri test et
585
+ - Optimal değerleri belirle
586
+ """)
587
+
588
+ print("\n" + "="*60)
589
+ print("✅ BÖLÜM 1 TAMAMLANDI!")
590
+ print("="*60)
591
+ print(f"""
592
+ Bu bölümde öğrendikleriniz:
593
+ ✓ Streaming/Generator pattern ile büyük veri
594
+ ✓ Memory-efficient preprocessing
595
+ ✓ Batch processing {time_single/time_batch:.1f}x hız artışı
596
+ ✓ Dataset sharding ve parallelization
597
+ ✓ Cache yönetimi
598
+ ✓ Chunk-based processing
599
+ ✓ Multi-process processing
600
+ ✓ Comprehensive statistics
601
+ ✓ Production-ready pipeline pattern
602
+
603
+ 📊 PERFORMANS KAZANIMLARI:
604
+ - Batch processing: {time_single/time_batch:.1f}x hızlandırma
605
+ - Multi-processing: {num_cores}x CPU çekirdeği
606
+ - Memory: Generator pattern ile minimal kullanım
607
+
608
+ 📚 SONRAKI BÖLÜM: Domain-Specific Datasets
609
+ - Bilimsel makaleler (arXiv, PubMed)
610
+ - Kod datasets (The Stack)
611
+ - Finansal veri
612
+ - Tıbbi datasets
613
+ - Özel domain adaptasyonu
614
+ """)
615
+
616
+ print("\n🚀 Harika! İlk bölümü tamamladık!")
617
+ print("Sonraki bölüme geçelim mi? (Evet yazın)")
space/modules/02_domain_specific_datasets.py ADDED
@@ -0,0 +1,870 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DOMAIN-SPECIFIC DATASETS - İLERİ SEVİYE HUGGING FACE
3
+ ====================================================
4
+
5
+ Bu modülde öğrenecekleriniz:
6
+ 1. Bilimsel Makaleler (arXiv, PubMed) - Academic datasets
7
+ 2. Kod Datasets (The Stack, CodeParrot) - Programming datasets
8
+ 3. Finansal Analiz Datasets - Finance & Business
9
+ 4. Tıbbi/Sağlık Datasets - Medical & Healthcare
10
+ 5. Domain-specific preprocessing
11
+ 6. Custom tokenization
12
+ 7. Domain adaptation techniques
13
+ """
14
+
15
+ from datasets import Dataset, load_dataset, DatasetDict
16
+ import numpy as np
17
+ import json
18
+ from typing import Dict, List
19
+ import time
20
+ from collections import Counter
21
+ import re
22
+
23
+ print("="*70)
24
+ print("🔬 DOMAIN-SPECIFIC DATASETS - İLERİ SEVİYE")
25
+ print("="*70)
26
+
27
+ print("\n" + "="*70)
28
+ print("1. BİLİMSEL MAKALELER - ACADEMIC DATASETS")
29
+ print("="*70)
30
+
31
+ # Sentetik bilimsel makale dataset'i
32
+ def generate_scientific_papers(num_samples=1000):
33
+ """
34
+ Bilimsel makale formatında sentetik veri
35
+ """
36
+ domains = ['Physics', 'Computer Science', 'Biology', 'Mathematics', 'Chemistry']
37
+
38
+ def gen():
39
+ for i in range(num_samples):
40
+ domain = np.random.choice(domains)
41
+
42
+ # Makale yapısı
43
+ abstract = f"This paper presents a novel approach to {domain.lower()} research. " \
44
+ f"We propose a methodology that addresses key challenges in the field. " \
45
+ f"Our experimental results show significant improvements over baseline methods. " \
46
+ f"The proposed framework demonstrates applicability across multiple scenarios."
47
+
48
+ yield {
49
+ 'id': f'arxiv.{i:06d}',
50
+ 'title': f'Advanced Methods in {domain} Research: A Comprehensive Study {i}',
51
+ 'abstract': abstract,
52
+ 'authors': [f'Author {j}' for j in range(np.random.randint(2, 6))],
53
+ 'domain': domain,
54
+ 'year': np.random.randint(2015, 2025),
55
+ 'citations': np.random.randint(0, 500),
56
+ 'keywords': [f'keyword{j}' for j in range(np.random.randint(3, 8))],
57
+ 'full_text': abstract + " " + abstract * np.random.randint(5, 15)
58
+ }
59
+
60
+ return Dataset.from_generator(gen)
61
+
62
+ print("\n📚 Bilimsel Makale Dataset'i Oluşturuluyor...")
63
+ scientific_dataset = generate_scientific_papers(2000)
64
+
65
+ print(f"✅ {len(scientific_dataset)} bilimsel makale yüklendi")
66
+ print(f"\nÖrnek makale:")
67
+ sample = scientific_dataset[0]
68
+ print(f" ID: {sample['id']}")
69
+ print(f" Başlık: {sample['title']}")
70
+ print(f" Domain: {sample['domain']}")
71
+ print(f" Yazar sayısı: {len(sample['authors'])}")
72
+ print(f" Yıl: {sample['year']}")
73
+ print(f" Atıf sayısı: {sample['citations']}")
74
+ print(f" Abstract: {sample['abstract'][:150]}...")
75
+
76
+ # Domain bazlı istatistikler
77
+ print("\n📊 Domain Dağılımı:")
78
+ domains = [ex['domain'] for ex in scientific_dataset]
79
+ domain_counts = Counter(domains)
80
+ for domain, count in domain_counts.most_common():
81
+ pct = (count / len(scientific_dataset)) * 100
82
+ print(f" {domain}: {count} ({pct:.1f}%)")
83
+
84
+ # Yıllara göre analiz
85
+ print("\n📅 Yıllara Göre Yayın Sayısı:")
86
+ years = [ex['year'] for ex in scientific_dataset]
87
+ year_counts = Counter(years)
88
+ for year in sorted(year_counts.keys())[-5:]:
89
+ print(f" {year}: {year_counts[year]} makale")
90
+
91
+ # Atıf analizi
92
+ citations = [ex['citations'] for ex in scientific_dataset]
93
+ print(f"\n📈 Atıf İstatistikleri:")
94
+ print(f" Ortalama: {np.mean(citations):.1f}")
95
+ print(f" Median: {np.median(citations):.1f}")
96
+ print(f" En çok atıf: {np.max(citations)}")
97
+
98
+ # Preprocessing - Bilimsel text temizleme
99
+ print("\n🔧 Bilimsel Text Preprocessing:")
100
+
101
+ def preprocess_scientific_text(examples):
102
+ """
103
+ Bilimsel metin için özel preprocessing
104
+ """
105
+ processed = []
106
+
107
+ for text in examples['abstract']:
108
+ # Küçük harfe çevir
109
+ text = text.lower()
110
+
111
+ # Özel karakterleri temizle
112
+ text = re.sub(r'[^\w\s\.]', '', text)
113
+
114
+ # Fazla boşlukları temizle
115
+ text = ' '.join(text.split())
116
+
117
+ processed.append(text)
118
+
119
+ return {
120
+ 'abstract_clean': processed,
121
+ 'abstract_length': [len(t) for t in processed],
122
+ 'word_count': [len(t.split()) for t in processed]
123
+ }
124
+
125
+ scientific_processed = scientific_dataset.map(
126
+ preprocess_scientific_text,
127
+ batched=True,
128
+ batch_size=500,
129
+ desc="Preprocessing scientific texts"
130
+ )
131
+
132
+ print(f"✅ {len(scientific_processed)} makale işlendi")
133
+ print(f"\nÖrnek işlenmiş abstract:")
134
+ print(f" Original: {scientific_processed[0]['abstract'][:100]}...")
135
+ print(f" Cleaned: {scientific_processed[0]['abstract_clean'][:100]}...")
136
+ print(f" Word count: {scientific_processed[0]['word_count']}")
137
+
138
+
139
+ print("\n" + "="*70)
140
+ print("2. KOD DATASETS - PROGRAMMING & SOFTWARE")
141
+ print("="*70)
142
+
143
+ # Sentetik kod dataset'i
144
+ def generate_code_dataset(num_samples=1000):
145
+ """
146
+ Çeşitli programlama dilleri için kod örnekleri
147
+ """
148
+ languages = ['Python', 'JavaScript', 'Java', 'C++', 'Go', 'Rust']
149
+
150
+ code_templates = {
151
+ 'Python': '''def {func_name}({params}):
152
+ """
153
+ {docstring}
154
+ """
155
+ result = {body}
156
+ return result''',
157
+
158
+ 'JavaScript': '''function {func_name}({params}) {{
159
+ // {docstring}
160
+ const result = {body};
161
+ return result;
162
+ }}''',
163
+
164
+ 'Java': '''public {return_type} {func_name}({params}) {{
165
+ // {docstring}
166
+ {return_type} result = {body};
167
+ return result;
168
+ }}''',
169
+ }
170
+
171
+ def gen():
172
+ for i in range(num_samples):
173
+ lang = np.random.choice(languages)
174
+
175
+ # Kod özellikleri
176
+ func_name = f"process_data_{i}"
177
+ params = "data, config"
178
+ docstring = f"Process data using method {i}"
179
+ body = "data * 2 + config"
180
+
181
+ if lang in code_templates:
182
+ code = code_templates[lang].format(
183
+ func_name=func_name,
184
+ params=params,
185
+ docstring=docstring,
186
+ body=body,
187
+ return_type='int' if lang == 'Java' else ''
188
+ )
189
+ else:
190
+ code = f"// {lang} code example\n{func_name}({params})"
191
+
192
+ yield {
193
+ 'id': f'code_{i:06d}',
194
+ 'language': lang,
195
+ 'code': code,
196
+ 'func_name': func_name,
197
+ 'lines_of_code': len(code.split('\n')),
198
+ 'has_docstring': 'docstring' in code.lower(),
199
+ 'complexity': np.random.choice(['low', 'medium', 'high']),
200
+ 'repo': f'github.com/user/repo_{i % 100}',
201
+ 'stars': np.random.randint(0, 10000)
202
+ }
203
+
204
+ return Dataset.from_generator(gen)
205
+
206
+ print("\n💻 Kod Dataset'i Oluşturuluyor...")
207
+ code_dataset = generate_code_dataset(2000)
208
+
209
+ print(f"✅ {len(code_dataset)} kod örneği yüklendi")
210
+ print(f"\nÖrnek kod:")
211
+ code_sample = code_dataset[0]
212
+ print(f" ID: {code_sample['id']}")
213
+ print(f" Dil: {code_sample['language']}")
214
+ print(f" Satır sayısı: {code_sample['lines_of_code']}")
215
+ print(f" Karmaşıklık: {code_sample['complexity']}")
216
+ print(f"\n Kod:\n{code_sample['code']}\n")
217
+
218
+ # Dil dağılımı
219
+ print("\n📊 Programlama Dili Dağılımı:")
220
+ languages = [ex['language'] for ex in code_dataset]
221
+ lang_counts = Counter(languages)
222
+ for lang, count in lang_counts.most_common():
223
+ pct = (count / len(code_dataset)) * 100
224
+ print(f" {lang}: {count} ({pct:.1f}%)")
225
+
226
+ # Kod analizi
227
+ print("\n📈 Kod Metrikleri:")
228
+ loc_values = [ex['lines_of_code'] for ex in code_dataset]
229
+ print(f" Ortalama satır sayısı: {np.mean(loc_values):.1f}")
230
+ print(f" Median satır sayısı: {np.median(loc_values):.1f}")
231
+
232
+ has_docstring = sum([1 for ex in code_dataset if ex['has_docstring']])
233
+ print(f" Docstring oranı: {(has_docstring/len(code_dataset)*100):.1f}%")
234
+
235
+ # Kod preprocessing
236
+ print("\n🔧 Kod Preprocessing:")
237
+
238
+ def preprocess_code(examples):
239
+ """
240
+ Kod için özel preprocessing
241
+ """
242
+ def extract_functions(code):
243
+ # Fonksiyon isimlerini çıkar (basit regex)
244
+ funcs = re.findall(r'def\s+(\w+)|function\s+(\w+)|public\s+\w+\s+(\w+)', code)
245
+ return [f for group in funcs for f in group if f]
246
+
247
+ def count_comments(code):
248
+ # Yorum satırlarını say
249
+ return len(re.findall(r'#|//|/\*|\*/', code))
250
+
251
+ return {
252
+ 'functions': [extract_functions(code) for code in examples['code']],
253
+ 'comment_count': [count_comments(code) for code in examples['code']],
254
+ 'code_chars': [len(code) for code in examples['code']],
255
+ 'code_tokens': [len(code.split()) for code in examples['code']]
256
+ }
257
+
258
+ code_processed = code_dataset.map(
259
+ preprocess_code,
260
+ batched=True,
261
+ batch_size=500,
262
+ desc="Analyzing code"
263
+ )
264
+
265
+ print(f"✅ {len(code_processed)} kod örneği analiz edildi")
266
+ print(f"\nÖrnek analiz:")
267
+ print(f" Fonksiyonlar: {code_processed[0]['functions']}")
268
+ print(f" Yorum sayısı: {code_processed[0]['comment_count']}")
269
+ print(f" Token sayısı: {code_processed[0]['code_tokens']}")
270
+
271
+
272
+ print("\n" + "="*70)
273
+ print("3. FİNANSAL ANALİZ DATASETS")
274
+ print("="*70)
275
+
276
+ # Sentetik finansal veri
277
+ def generate_financial_dataset(num_samples=1000):
278
+ """
279
+ Finansal haber ve analiz dataset'i
280
+ """
281
+ companies = ['TechCorp', 'FinanceBank', 'RetailCo', 'EnergyInc', 'HealthMed']
282
+ sentiments = ['positive', 'negative', 'neutral']
283
+ categories = ['earnings', 'merger', 'product_launch', 'scandal', 'expansion']
284
+
285
+ def gen():
286
+ for i in range(num_samples):
287
+ company = np.random.choice(companies)
288
+ sentiment = np.random.choice(sentiments)
289
+ category = np.random.choice(categories)
290
+
291
+ # Finansal haber metni
292
+ if sentiment == 'positive':
293
+ text = f"{company} announces strong quarterly earnings, exceeding market expectations. " \
294
+ f"Stock prices surged following the announcement. Analysts remain optimistic."
295
+ elif sentiment == 'negative':
296
+ text = f"{company} faces challenges in the current market. " \
297
+ f"Quarterly results fell short of expectations. Investors express concern."
298
+ else:
299
+ text = f"{company} maintains steady performance in Q{i%4+1}. " \
300
+ f"Market reaction remains moderate. Company outlook unchanged."
301
+
302
+ yield {
303
+ 'id': f'fin_{i:06d}',
304
+ 'company': company,
305
+ 'text': text,
306
+ 'sentiment': sentiment,
307
+ 'category': category,
308
+ 'date': f'2024-{(i%12)+1:02d}-{(i%28)+1:02d}',
309
+ 'stock_change': np.random.uniform(-10, 10),
310
+ 'volume': np.random.randint(1000000, 10000000),
311
+ 'market_cap': np.random.uniform(1e9, 100e9),
312
+ 'sector': np.random.choice(['Tech', 'Finance', 'Retail', 'Energy', 'Healthcare'])
313
+ }
314
+
315
+ return Dataset.from_generator(gen)
316
+
317
+ print("\n💰 Finansal Dataset Oluşturuluyor...")
318
+ financial_dataset = generate_financial_dataset(2000)
319
+
320
+ print(f"✅ {len(financial_dataset)} finansal kayıt yüklendi")
321
+ print(f"\nÖrnek finansal kayıt:")
322
+ fin_sample = financial_dataset[0]
323
+ print(f" ID: {fin_sample['id']}")
324
+ print(f" Şirket: {fin_sample['company']}")
325
+ print(f" Sentiment: {fin_sample['sentiment']}")
326
+ print(f" Kategori: {fin_sample['category']}")
327
+ print(f" Hisse değişimi: {fin_sample['stock_change']:.2f}%")
328
+ print(f" Metin: {fin_sample['text'][:120]}...")
329
+
330
+ # Sentiment analizi
331
+ print("\n📊 Sentiment Dağılımı:")
332
+ sentiments = [ex['sentiment'] for ex in financial_dataset]
333
+ sent_counts = Counter(sentiments)
334
+ for sent, count in sent_counts.items():
335
+ pct = (count / len(financial_dataset)) * 100
336
+ print(f" {sent.capitalize()}: {count} ({pct:.1f}%)")
337
+
338
+ # Şirket bazlı analiz
339
+ print("\n🏢 Şirket Bazlı Analiz:")
340
+ companies = [ex['company'] for ex in financial_dataset]
341
+ company_counts = Counter(companies)
342
+ for company, count in company_counts.most_common():
343
+ avg_change = np.mean([ex['stock_change'] for ex in financial_dataset if ex['company'] == company])
344
+ print(f" {company}: {count} haber, ortalama değişim: {avg_change:+.2f}%")
345
+
346
+ # Finansal preprocessing
347
+ print("\n🔧 Finansal Text Preprocessing:")
348
+
349
+ def preprocess_financial_text(examples):
350
+ """
351
+ Finansal metin için özel preprocessing
352
+ """
353
+ def extract_numbers(text):
354
+ # Sayıları ve yüzdeleri çıkar
355
+ numbers = re.findall(r'\d+\.?\d*%?', text)
356
+ return numbers
357
+
358
+ def extract_financial_terms(text):
359
+ # Finansal terimleri say
360
+ terms = ['earnings', 'stock', 'market', 'quarterly', 'revenue',
361
+ 'profit', 'loss', 'growth', 'decline']
362
+ count = sum([1 for term in terms if term in text.lower()])
363
+ return count
364
+
365
+ return {
366
+ 'numbers_found': [extract_numbers(text) for text in examples['text']],
367
+ 'financial_term_count': [extract_financial_terms(text) for text in examples['text']],
368
+ 'text_length': [len(text) for text in examples['text']],
369
+ 'has_percentage': ['%' in text for text in examples['text']]
370
+ }
371
+
372
+ financial_processed = financial_dataset.map(
373
+ preprocess_financial_text,
374
+ batched=True,
375
+ batch_size=500,
376
+ desc="Processing financial texts"
377
+ )
378
+
379
+ print(f"✅ {len(financial_processed)} finansal kayıt işlendi")
380
+ print(f"\nÖrnek analiz:")
381
+ print(f" Sayılar: {financial_processed[0]['numbers_found']}")
382
+ print(f" Finansal terim sayısı: {financial_processed[0]['financial_term_count']}")
383
+ print(f" Yüzde var mı: {financial_processed[0]['has_percentage']}")
384
+
385
+
386
+ print("\n" + "="*70)
387
+ print("4. TIBBİ/SAĞLIK DATASETS")
388
+ print("="*70)
389
+
390
+ # Sentetik tıbbi veri
391
+ def generate_medical_dataset(num_samples=1000):
392
+ """
393
+ Tıbbi notlar ve tanılar
394
+ """
395
+ conditions = ['Diabetes', 'Hypertension', 'Asthma', 'Arthritis', 'Migraine']
396
+ treatments = ['Medication', 'Physical Therapy', 'Surgery', 'Lifestyle Changes']
397
+ severities = ['mild', 'moderate', 'severe']
398
+
399
+ def gen():
400
+ for i in range(num_samples):
401
+ condition = np.random.choice(conditions)
402
+ treatment = np.random.choice(treatments)
403
+ severity = np.random.choice(severities)
404
+
405
+ # Tıbbi not
406
+ note = f"Patient presents with {severity} {condition.lower()}. " \
407
+ f"Symptoms include relevant clinical findings. " \
408
+ f"Recommended treatment: {treatment}. " \
409
+ f"Follow-up scheduled. Patient advised on preventive measures."
410
+
411
+ yield {
412
+ 'id': f'med_{i:06d}',
413
+ 'patient_id': f'P{i:05d}',
414
+ 'condition': condition,
415
+ 'severity': severity,
416
+ 'treatment': treatment,
417
+ 'note': note,
418
+ 'age': np.random.randint(18, 90),
419
+ 'gender': np.random.choice(['M', 'F']),
420
+ 'visit_date': f'2024-{(i%12)+1:02d}-{(i%28)+1:02d}',
421
+ 'diagnosis_confidence': np.random.uniform(0.7, 1.0),
422
+ 'follow_up_required': np.random.choice([True, False])
423
+ }
424
+
425
+ return Dataset.from_generator(gen)
426
+
427
+ print("\n🏥 Tıbbi Dataset Oluşturuluyor...")
428
+ medical_dataset = generate_medical_dataset(2000)
429
+
430
+ print(f"✅ {len(medical_dataset)} tıbbi kayıt yüklendi")
431
+ print(f"\nÖrnek tıbbi kayıt:")
432
+ med_sample = medical_dataset[0]
433
+ print(f" ID: {med_sample['id']}")
434
+ print(f" Hasta ID: {med_sample['patient_id']}")
435
+ print(f" Durum: {med_sample['condition']}")
436
+ print(f" Şiddet: {med_sample['severity']}")
437
+ print(f" Tedavi: {med_sample['treatment']}")
438
+ print(f" Yaş: {med_sample['age']}")
439
+ print(f" Tanı güveni: {med_sample['diagnosis_confidence']:.2f}")
440
+ print(f" Not: {med_sample['note'][:100]}...")
441
+
442
+ # Durum dağılımı
443
+ print("\n📊 Tıbbi Durum Dağılımı:")
444
+ conditions = [ex['condition'] for ex in medical_dataset]
445
+ cond_counts = Counter(conditions)
446
+ for cond, count in cond_counts.most_common():
447
+ pct = (count / len(medical_dataset)) * 100
448
+ print(f" {cond}: {count} ({pct:.1f}%)")
449
+
450
+ # Şiddet analizi
451
+ print("\n⚠️ Şiddet Dağılımı:")
452
+ severities = [ex['severity'] for ex in medical_dataset]
453
+ sev_counts = Counter(severities)
454
+ for sev, count in sorted(sev_counts.items()):
455
+ pct = (count / len(medical_dataset)) * 100
456
+ print(f" {sev.capitalize()}: {count} ({pct:.1f}%)")
457
+
458
+ # Yaş grupları
459
+ print("\n👥 Yaş Grubu Analizi:")
460
+ ages = [ex['age'] for ex in medical_dataset]
461
+ age_groups = {
462
+ '18-30': sum([1 for age in ages if 18 <= age <= 30]),
463
+ '31-50': sum([1 for age in ages if 31 <= age <= 50]),
464
+ '51-70': sum([1 for age in ages if 51 <= age <= 70]),
465
+ '71+': sum([1 for age in ages if age > 70])
466
+ }
467
+ for group, count in age_groups.items():
468
+ pct = (count / len(ages)) * 100
469
+ print(f" {group}: {count} ({pct:.1f}%)")
470
+
471
+ # Tıbbi preprocessing
472
+ print("\n🔧 Tıbbi Text Preprocessing (PHI Removal):")
473
+
474
+ def preprocess_medical_text(examples):
475
+ """
476
+ Tıbbi metin için özel preprocessing
477
+ PHI (Protected Health Information) temizleme simülasyonu
478
+ """
479
+ def anonymize_text(text, patient_id):
480
+ # Hasta ID'lerini anonimleştir
481
+ text = text.replace(patient_id, '[PATIENT_ID]')
482
+
483
+ # Tarihleri anonimleştir
484
+ text = re.sub(r'\d{4}-\d{2}-\d{2}', '[DATE]', text)
485
+
486
+ return text
487
+
488
+ def extract_medical_entities(text):
489
+ # Tıbbi terimleri say (basit örnek)
490
+ terms = ['patient', 'symptoms', 'treatment', 'diagnosis',
491
+ 'medication', 'therapy', 'condition']
492
+ count = sum([1 for term in terms if term in text.lower()])
493
+ return count
494
+
495
+ return {
496
+ 'note_anonymized': [
497
+ anonymize_text(note, pid)
498
+ for note, pid in zip(examples['note'], examples['patient_id'])
499
+ ],
500
+ 'medical_entity_count': [extract_medical_entities(note) for note in examples['note']],
501
+ 'note_length': [len(note) for note in examples['note']],
502
+ 'requires_follow_up': examples['follow_up_required']
503
+ }
504
+
505
+ medical_processed = medical_dataset.map(
506
+ preprocess_medical_text,
507
+ batched=True,
508
+ batch_size=500,
509
+ desc="Anonymizing medical records"
510
+ )
511
+
512
+ print(f"✅ {len(medical_processed)} tıbbi kayıt anonimleştirildi")
513
+ print(f"\nÖrnek anonimleştirilmiş not:")
514
+ print(f" Orijinal: {medical_processed[0]['note'][:100]}...")
515
+ print(f" Anonimleştirilmiş: {medical_processed[0]['note_anonymized'][:100]}...")
516
+ print(f" Tıbbi entity sayısı: {medical_processed[0]['medical_entity_count']}")
517
+
518
+
519
+ print("\n" + "="*70)
520
+ print("5. DOMAIN-SPECIFIC TOKENIZATION")
521
+ print("="*70)
522
+
523
+ print("\n🔤 Domain-Specific Tokenization Stratejileri:")
524
+
525
+ # Bilimsel metin için
526
+ print("\n1️⃣ Bilimsel Metin Tokenization:")
527
+ scientific_sample = scientific_dataset[0]['abstract']
528
+ print(f" Orijinal: {scientific_sample[:80]}...")
529
+
530
+ # Basit word tokenization
531
+ words = scientific_sample.split()
532
+ print(f" Word tokens: {len(words)} kelime")
533
+ print(f" İlk 5 token: {words[:5]}")
534
+
535
+ # Sentence tokenization
536
+ sentences = scientific_sample.split('.')
537
+ print(f" Sentence tokens: {len([s for s in sentences if s.strip()])} cümle")
538
+
539
+ # Kod için
540
+ print("\n2️⃣ Kod Tokenization:")
541
+ code_sample = code_dataset[0]['code']
542
+ print(f" Kod:\n{code_sample}")
543
+
544
+ # Satır bazlı
545
+ lines = code_sample.split('\n')
546
+ print(f" Satır sayısı: {len(lines)}")
547
+
548
+ # Token bazlı (basit)
549
+ code_tokens = re.findall(r'\w+|[^\w\s]', code_sample)
550
+ print(f" Token sayısı: {len(code_tokens)}")
551
+ print(f" İlk 10 token: {code_tokens[:10]}")
552
+
553
+
554
+ print("\n" + "="*70)
555
+ print("6. CROSS-DOMAIN DATASET BİRLEŞTİRME")
556
+ print("="*70)
557
+
558
+ print("\n🔄 Farklı Domain'lerden Dataset Birleştirme:")
559
+
560
+ # Her domain'den küçük subset al
561
+ sci_subset = scientific_dataset.select(range(100))
562
+ code_subset = code_dataset.select(range(100))
563
+ fin_subset = financial_dataset.select(range(100))
564
+
565
+ # Ortak format'a çevir
566
+ def normalize_scientific(example):
567
+ return {
568
+ 'text': example['abstract'],
569
+ 'domain': 'scientific',
570
+ 'metadata': {
571
+ 'type': example['domain'],
572
+ 'year': example['year']
573
+ }
574
+ }
575
+
576
+ def normalize_code(example):
577
+ return {
578
+ 'text': example['code'],
579
+ 'domain': 'code',
580
+ 'metadata': {
581
+ 'language': example['language'],
582
+ 'lines': example['lines_of_code']
583
+ }
584
+ }
585
+
586
+ def normalize_financial(example):
587
+ return {
588
+ 'text': example['text'],
589
+ 'domain': 'financial',
590
+ 'metadata': {
591
+ 'sentiment': example['sentiment'],
592
+ 'company': example['company']
593
+ }
594
+ }
595
+
596
+ print("\n📦 Dataset'leri normalize ediyoruz...")
597
+ sci_norm = sci_subset.map(normalize_scientific, remove_columns=sci_subset.column_names)
598
+ code_norm = code_subset.map(normalize_code, remove_columns=code_subset.column_names)
599
+ fin_norm = fin_subset.map(normalize_financial, remove_columns=fin_subset.column_names)
600
+
601
+ # Birleştir
602
+ from datasets import concatenate_datasets
603
+ multi_domain = concatenate_datasets([sci_norm, code_norm, fin_norm])
604
+
605
+ print(f"✅ Multi-domain dataset: {len(multi_domain)} örnek")
606
+ print(f"\nDomain dağılımı:")
607
+ domains = [ex['domain'] for ex in multi_domain]
608
+ domain_dist = Counter(domains)
609
+ for domain, count in domain_dist.items():
610
+ print(f" {domain}: {count}")
611
+
612
+ print(f"\nÖrnek multi-domain kayıtlar:")
613
+ for i in range(3):
614
+ ex = multi_domain[i * 100] # Her domain'den birer örnek
615
+ print(f"\n {i+1}. Domain: {ex['domain']}")
616
+ print(f" Text: {ex['text'][:80]}...")
617
+ print(f" Metadata: {ex['metadata']}")
618
+
619
+
620
+ print("\n" + "="*70)
621
+ print("7. DOMAIN ADAPTATION TEKNİKLERİ")
622
+ print("="*70)
623
+
624
+ print("\n🎯 Domain Adaptation Stratejileri:")
625
+
626
+ # Örnek: Genel domain'den specific domain'e transfer
627
+ print("\n1️⃣ Domain-Specific Vocabulary Analysis:")
628
+
629
+ def analyze_domain_vocabulary(dataset, text_column, domain_name):
630
+ """
631
+ Domain-specific kelime dağarcığı analizi
632
+ """
633
+ all_words = []
634
+ for example in dataset:
635
+ words = example[text_column].lower().split()
636
+ all_words.extend(words)
637
+
638
+ vocab_counts = Counter(all_words)
639
+
640
+ return {
641
+ 'domain': domain_name,
642
+ 'total_words': len(all_words),
643
+ 'unique_words': len(vocab_counts),
644
+ 'top_10_words': vocab_counts.most_common(10)
645
+ }
646
+
647
+ # Her domain için vocabulary analizi
648
+ sci_vocab = analyze_domain_vocabulary(
649
+ scientific_dataset.select(range(500)),
650
+ 'abstract',
651
+ 'Scientific'
652
+ )
653
+ code_vocab = analyze_domain_vocabulary(
654
+ code_dataset.select(range(500)),
655
+ 'code',
656
+ 'Code'
657
+ )
658
+ fin_vocab = analyze_domain_vocabulary(
659
+ financial_dataset.select(range(500)),
660
+ 'text',
661
+ 'Financial'
662
+ )
663
+
664
+ print("\n📚 Domain Vocabulary İstatistikleri:")
665
+ for vocab in [sci_vocab, code_vocab, fin_vocab]:
666
+ print(f"\n {vocab['domain']}:")
667
+ print(f" Toplam kelime: {vocab['total_words']:,}")
668
+ print(f" Benzersiz kelime: {vocab['unique_words']:,}")
669
+ print(f" Vocabulary zenginliği: {vocab['unique_words']/vocab['total_words']:.3f}")
670
+ print(f" Top 5 kelime: {[w for w, c in vocab['top_10_words'][:5]]}")
671
+
672
+
673
+ print("\n2️⃣ Domain-Specific Data Augmentation:")
674
+
675
+ def augment_scientific_text(example):
676
+ """
677
+ Bilimsel metin için data augmentation
678
+ """
679
+ text = example['abstract']
680
+
681
+ # Synonym replacement (basit simülasyon)
682
+ augmented = text.replace('novel', 'innovative')
683
+ augmented = augmented.replace('propose', 'present')
684
+ augmented = augmented.replace('demonstrate', 'show')
685
+
686
+ return {
687
+ **example,
688
+ 'abstract_augmented': augmented
689
+ }
690
+
691
+ print("\n Bilimsel metin augmentation örneği:")
692
+ aug_sample = augment_scientific_text(scientific_dataset[0])
693
+ print(f" Original: {aug_sample['abstract'][:100]}...")
694
+ print(f" Augmented: {aug_sample['abstract_augmented'][:100]}...")
695
+
696
+
697
+ print("\n3️⃣ Domain-Specific Filtering:")
698
+
699
+ def filter_high_quality_scientific(example):
700
+ """
701
+ Yüksek kaliteli bilimsel makaleleri filtrele
702
+ """
703
+ return (
704
+ example['citations'] > 50 and # Çok atıf almış
705
+ example['year'] >= 2020 and # Son yıllarda yayınlanmış
706
+ len(example['abstract'].split()) > 100 # Detaylı abstract
707
+ )
708
+
709
+ high_quality_sci = scientific_dataset.filter(
710
+ filter_high_quality_scientific,
711
+ desc="Filtering high-quality papers"
712
+ )
713
+
714
+ print(f"\n Kaliteli makale filtreleme:")
715
+ print(f" Orijinal: {len(scientific_dataset)} makale")
716
+ print(f" Filtrelenmiş: {len(high_quality_sci)} makale")
717
+ print(f" Oran: {len(high_quality_sci)/len(scientific_dataset)*100:.1f}%")
718
+
719
+
720
+ print("\n" + "="*70)
721
+ print("8. DOMAIN-SPECIFIC EVALUATION METRİKLERİ")
722
+ print("="*70)
723
+
724
+ print("\n📊 Domain-Specific Kalite Metrikleri:")
725
+
726
+ def calculate_domain_metrics(dataset, domain_name):
727
+ """
728
+ Domain-specific kalite metrikleri
729
+ """
730
+ if domain_name == 'scientific':
731
+ # Bilimsel metrikler
732
+ avg_citations = np.mean([ex['citations'] for ex in dataset])
733
+ avg_authors = np.mean([len(ex['authors']) for ex in dataset])
734
+ recent_papers = sum([1 for ex in dataset if ex['year'] >= 2020])
735
+
736
+ return {
737
+ 'domain': domain_name,
738
+ 'avg_citations': avg_citations,
739
+ 'avg_authors': avg_authors,
740
+ 'recent_ratio': recent_papers / len(dataset)
741
+ }
742
+
743
+ elif domain_name == 'code':
744
+ # Kod metrikleri
745
+ avg_loc = np.mean([ex['lines_of_code'] for ex in dataset])
746
+ has_doc = sum([1 for ex in dataset if ex['has_docstring']])
747
+ high_stars = sum([1 for ex in dataset if ex['stars'] > 1000])
748
+
749
+ return {
750
+ 'domain': domain_name,
751
+ 'avg_lines_of_code': avg_loc,
752
+ 'documentation_ratio': has_doc / len(dataset),
753
+ 'popular_ratio': high_stars / len(dataset)
754
+ }
755
+
756
+ elif domain_name == 'financial':
757
+ # Finansal metrikler
758
+ sentiments = [ex['sentiment'] for ex in dataset]
759
+ sent_dist = Counter(sentiments)
760
+ avg_change = np.mean([ex['stock_change'] for ex in dataset])
761
+
762
+ return {
763
+ 'domain': domain_name,
764
+ 'sentiment_distribution': dict(sent_dist),
765
+ 'avg_stock_change': avg_change,
766
+ 'volatility': np.std([ex['stock_change'] for ex in dataset])
767
+ }
768
+
769
+ print("\n1️⃣ Scientific Metrics:")
770
+ sci_metrics = calculate_domain_metrics(scientific_dataset, 'scientific')
771
+ for key, value in sci_metrics.items():
772
+ print(f" {key}: {value}")
773
+
774
+ print("\n2️⃣ Code Metrics:")
775
+ code_metrics = calculate_domain_metrics(code_dataset, 'code')
776
+ for key, value in code_metrics.items():
777
+ print(f" {key}: {value}")
778
+
779
+ print("\n3️⃣ Financial Metrics:")
780
+ fin_metrics = calculate_domain_metrics(financial_dataset, 'financial')
781
+ for key, value in fin_metrics.items():
782
+ print(f" {key}: {value}")
783
+
784
+
785
+ print("\n" + "="*70)
786
+ print("9. BEST PRACTICES - DOMAIN-SPECIFIC DATASETS")
787
+ print("="*70)
788
+
789
+ print("""
790
+ ✅ BİLİMSEL DATASETS:
791
+ - Citation metadata ekle
792
+ - Abstract + full text ayrımı
793
+ - Domain/field classification
794
+ - Author disambiguation
795
+ - Reference parsing
796
+ - LaTeX formül handling
797
+
798
+ ✅ KOD DATASETS:
799
+ - Programlama dili ayrımı
800
+ - Syntax parsing
801
+ - Docstring extraction
802
+ - Repository metadata
803
+ - License bilgisi
804
+ - Code quality metrics (complexity, coverage)
805
+
806
+ ✅ FİNANSAL DATASETS:
807
+ - Sentiment annotation
808
+ - Entity recognition (companies, people)
809
+ - Temporal information
810
+ - Numerical data extraction
811
+ - Market data integration
812
+ - Real-time updates
813
+
814
+ ✅ TIBBİ DATASETS:
815
+ - PHI (Protected Health Information) removal
816
+ - HIPAA compliance
817
+ - Clinical terminology standardization
818
+ - ICD code mapping
819
+ - Anonymization
820
+ - Ethical considerations
821
+
822
+ ✅ GENEL PRENSİPLER:
823
+ - Domain expertise gerekir
824
+ - Specialized tokenization
825
+ - Domain-specific validation
826
+ - Quality filtering
827
+ - Ethical guidelines takip et
828
+ - License ve copyright kontrol et
829
+
830
+ ✅ DATA QUALITY:
831
+ - Domain experts ile validate et
832
+ - Inter-annotator agreement hesapla
833
+ - Bias analysis yap
834
+ - Coverage analysis
835
+ - Statistical validation
836
+ - Regular updates
837
+ """)
838
+
839
+
840
+ print("\n" + "="*70)
841
+ print("✅ BÖLÜM 2 TAMAMLANDI!")
842
+ print("="*70)
843
+ print(f"""
844
+ Bu bölümde öğrendikleriniz:
845
+ ✓ Bilimsel makale datasets ({len(scientific_dataset)} örnek)
846
+ ✓ Kod datasets ({len(code_dataset)} örnek)
847
+ ✓ Finansal analiz datasets ({len(financial_dataset)} örnek)
848
+ ✓ Tıbbi/sağlık datasets ({len(medical_dataset)} örnek)
849
+ ✓ Domain-specific preprocessing
850
+ ✓ Cross-domain dataset birleştirme
851
+ ✓ Domain adaptation teknikleri
852
+ ✓ Domain-specific evaluation metrikleri
853
+
854
+ 📊 ÜRETİLEN DATASETS:
855
+ - Scientific: {len(scientific_dataset):,} makale
856
+ - Code: {len(code_dataset):,} kod örneği
857
+ - Financial: {len(financial_dataset):,} finansal kayıt
858
+ - Medical: {len(medical_dataset):,} tıbbi kayıt
859
+ - Multi-domain: {len(multi_domain):,} birleştirilmiş örnek
860
+
861
+ 📚 SONRAKI BÖLÜM: İleri Teknikler
862
+ - Dataset streaming (büyük datasets için)
863
+ - Custom data collators
864
+ - Feature extraction ve transformation
865
+ - Dataset preprocessing pipelines
866
+ - Advanced filtering strategies
867
+ """)
868
+
869
+ print("\n🚀 Harika! İkinci bölümü tamamladık!")
870
+ print("Üçüncü bölüme (İleri Teknikler) geçelim mi?")
space/modules/02b_cross_domain_fix.py ADDED
@@ -0,0 +1,498 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ CROSS-DOMAIN DATASET BİRLEŞTİRME - DOĞRU YÖNTEM
3
+ ===============================================
4
+
5
+ Bu modül, farklı domain'lerden dataset'leri birleştirirken
6
+ karşılaşılan schema mismatch problemini çözer ve best practices gösterir.
7
+ """
8
+
9
+ from datasets import Dataset, concatenate_datasets
10
+ import numpy as np
11
+ import json
12
+
13
+ print("="*70)
14
+ print("🔧 CROSS-DOMAIN DATASET BİRLEŞTİRME - PROBLEM VE ÇÖZÜM")
15
+ print("="*70)
16
+
17
+ # Sentetik dataset'ler oluştur
18
+ def generate_scientific_papers(num_samples=100):
19
+ def gen():
20
+ for i in range(num_samples):
21
+ yield {
22
+ 'id': f'sci_{i}',
23
+ 'abstract': f'Scientific text {i}',
24
+ 'domain': 'Physics',
25
+ 'year': 2020 + (i % 5)
26
+ }
27
+ return Dataset.from_generator(gen)
28
+
29
+ def generate_code_dataset(num_samples=100):
30
+ def gen():
31
+ for i in range(num_samples):
32
+ yield {
33
+ 'id': f'code_{i}',
34
+ 'code': f'def func_{i}(): pass',
35
+ 'language': 'Python',
36
+ 'lines_of_code': 5
37
+ }
38
+ return Dataset.from_generator(gen)
39
+
40
+ def generate_financial_dataset(num_samples=100):
41
+ def gen():
42
+ for i in range(num_samples):
43
+ yield {
44
+ 'id': f'fin_{i}',
45
+ 'text': f'Company {i} reports earnings',
46
+ 'sentiment': 'positive',
47
+ 'company': f'Corp{i}'
48
+ }
49
+ return Dataset.from_generator(gen)
50
+
51
+ print("\n📚 Sample Datasets Oluşturuluyor...")
52
+ sci_dataset = generate_scientific_papers(100)
53
+ code_dataset = generate_code_dataset(100)
54
+ fin_dataset = generate_financial_dataset(100)
55
+
56
+ print(f"✅ Scientific: {len(sci_dataset)} örnekler")
57
+ print(f" Kolonlar: {sci_dataset.column_names}")
58
+ print(f"✅ Code: {len(code_dataset)} örnekler")
59
+ print(f" Kolonlar: {code_dataset.column_names}")
60
+ print(f"✅ Financial: {len(fin_dataset)} örnekler")
61
+ print(f" Kolonlar: {fin_dataset.column_names}")
62
+
63
+
64
+ print("\n" + "="*70)
65
+ print("❌ PROBLEM: YANLIŞ YÖNTEM")
66
+ print("="*70)
67
+
68
+ print("""
69
+ Hatalı Yaklaşım:
70
+ - Her dataset farklı metadata structure'ı
71
+ - Schema mismatch hatası
72
+ - Arrow type error
73
+
74
+ Örnek hatalı kod:
75
+ metadata: {'type': domain, 'year': year} # Scientific
76
+ metadata: {'language': lang, 'lines': loc} # Code
77
+ metadata: {'sentiment': sent, 'company': comp} # Financial
78
+
79
+ ❌ concatenate_datasets() çalışmaz!
80
+ """)
81
+
82
+
83
+ print("\n" + "="*70)
84
+ print("✅ ÇÖZÜM 1: ORTAK SCHEMA - FLATTEN APPROACH")
85
+ print("="*70)
86
+
87
+ print("\n🔧 Tüm alanları flatten edelim (en basit çözüm):")
88
+
89
+ def normalize_to_flat_schema(example, domain_type):
90
+ """
91
+ Tüm alanları ayrı kolonlara çıkar
92
+ Missing değerler için None kullan
93
+ """
94
+ base = {
95
+ 'id': example.get('id', ''),
96
+ 'text': '',
97
+ 'domain': domain_type,
98
+ # Scientific fields
99
+ 'abstract': None,
100
+ 'sci_domain': None,
101
+ 'year': None,
102
+ # Code fields
103
+ 'code': None,
104
+ 'language': None,
105
+ 'lines_of_code': None,
106
+ # Financial fields
107
+ 'sentiment': None,
108
+ 'company': None,
109
+ }
110
+
111
+ # Domain'e göre doldur
112
+ if domain_type == 'scientific':
113
+ base['text'] = example.get('abstract', '')
114
+ base['abstract'] = example.get('abstract', '')
115
+ base['sci_domain'] = example.get('domain', '')
116
+ base['year'] = example.get('year', None)
117
+ elif domain_type == 'code':
118
+ base['text'] = example.get('code', '')
119
+ base['code'] = example.get('code', '')
120
+ base['language'] = example.get('language', '')
121
+ base['lines_of_code'] = example.get('lines_of_code', None)
122
+ elif domain_type == 'financial':
123
+ base['text'] = example.get('text', '')
124
+ base['sentiment'] = example.get('sentiment', '')
125
+ base['company'] = example.get('company', '')
126
+
127
+ return base
128
+
129
+ # Her dataset'i normalize et
130
+ print(" Normalizing scientific dataset...")
131
+ sci_flat = sci_dataset.map(
132
+ lambda x: normalize_to_flat_schema(x, 'scientific'),
133
+ remove_columns=sci_dataset.column_names,
134
+ desc="Flattening scientific"
135
+ )
136
+
137
+ print(" Normalizing code dataset...")
138
+ code_flat = code_dataset.map(
139
+ lambda x: normalize_to_flat_schema(x, 'code'),
140
+ remove_columns=code_dataset.column_names,
141
+ desc="Flattening code"
142
+ )
143
+
144
+ print(" Normalizing financial dataset...")
145
+ fin_flat = fin_dataset.map(
146
+ lambda x: normalize_to_flat_schema(x, 'financial'),
147
+ remove_columns=fin_dataset.column_names,
148
+ desc="Flattening financial"
149
+ )
150
+
151
+ # Şimdi birleştir - ÇALIŞIR!
152
+ print("\n✅ Birleştiriliyor...")
153
+ multi_domain_flat = concatenate_datasets([sci_flat, code_flat, fin_flat])
154
+
155
+ print(f"\n🎉 BAŞARILI! Multi-domain dataset: {len(multi_domain_flat)} örnek")
156
+ print(f"Kolonlar: {multi_domain_flat.column_names}")
157
+
158
+ # Örnekleri göster
159
+ print("\n📊 Her domain'den örnek:")
160
+ print("\n1. Scientific örnek:")
161
+ sci_ex = multi_domain_flat[0]
162
+ print(f" Domain: {sci_ex['domain']}")
163
+ print(f" Text: {sci_ex['text'][:50]}...")
164
+ print(f" Year: {sci_ex['year']}")
165
+ print(f" Language: {sci_ex['language']}") # None olmalı
166
+
167
+ print("\n2. Code örnek:")
168
+ code_ex = multi_domain_flat[100]
169
+ print(f" Domain: {code_ex['domain']}")
170
+ print(f" Text: {code_ex['text'][:50]}...")
171
+ print(f" Language: {code_ex['language']}")
172
+ print(f" Year: {code_ex['year']}") # None olmalı
173
+
174
+ print("\n3. Financial örnek:")
175
+ fin_ex = multi_domain_flat[200]
176
+ print(f" Domain: {fin_ex['domain']}")
177
+ print(f" Text: {fin_ex['text'][:50]}...")
178
+ print(f" Sentiment: {fin_ex['sentiment']}")
179
+ print(f" Company: {fin_ex['company']}")
180
+
181
+
182
+ print("\n" + "="*70)
183
+ print("✅ ÇÖZÜM 2: JSON METADATA - FLEXIBLE APPROACH")
184
+ print("="*70)
185
+
186
+ print("\n🔧 Metadata'yı JSON string olarak sakla (daha esnek):")
187
+
188
+ def normalize_to_json_schema(example, domain_type):
189
+ """
190
+ Domain-specific metadata'yı JSON string olarak sakla
191
+ Bu yaklaşım daha esnek ve genişletilebilir
192
+ """
193
+ base = {
194
+ 'id': example.get('id', ''),
195
+ 'text': '',
196
+ 'domain': domain_type,
197
+ 'metadata_json': ''
198
+ }
199
+
200
+ metadata = {}
201
+
202
+ if domain_type == 'scientific':
203
+ base['text'] = example.get('abstract', '')
204
+ metadata = {
205
+ 'domain': example.get('domain', ''),
206
+ 'year': example.get('year', None)
207
+ }
208
+ elif domain_type == 'code':
209
+ base['text'] = example.get('code', '')
210
+ metadata = {
211
+ 'language': example.get('language', ''),
212
+ 'lines_of_code': example.get('lines_of_code', None)
213
+ }
214
+ elif domain_type == 'financial':
215
+ base['text'] = example.get('text', '')
216
+ metadata = {
217
+ 'sentiment': example.get('sentiment', ''),
218
+ 'company': example.get('company', '')
219
+ }
220
+
221
+ base['metadata_json'] = json.dumps(metadata)
222
+ return base
223
+
224
+ # Normalize
225
+ print(" Normalizing with JSON metadata...")
226
+ sci_json = sci_dataset.map(
227
+ lambda x: normalize_to_json_schema(x, 'scientific'),
228
+ remove_columns=sci_dataset.column_names
229
+ )
230
+ code_json = code_dataset.map(
231
+ lambda x: normalize_to_json_schema(x, 'code'),
232
+ remove_columns=code_dataset.column_names
233
+ )
234
+ fin_json = fin_dataset.map(
235
+ lambda x: normalize_to_json_schema(x, 'financial'),
236
+ remove_columns=fin_dataset.column_names
237
+ )
238
+
239
+ # Birleştir
240
+ multi_domain_json = concatenate_datasets([sci_json, code_json, fin_json])
241
+
242
+ print(f"\n✅ Multi-domain (JSON): {len(multi_domain_json)} örnek")
243
+ print(f"Kolonlar: {multi_domain_json.column_names}")
244
+
245
+ # Metadata'yı parse et
246
+ print("\n📊 JSON Metadata Örnekleri:")
247
+ for i, idx in enumerate([0, 100, 200]):
248
+ ex = multi_domain_json[idx]
249
+ metadata = json.loads(ex['metadata_json'])
250
+ print(f"\n{i+1}. {ex['domain'].capitalize()}:")
251
+ print(f" Text: {ex['text'][:50]}...")
252
+ print(f" Metadata: {metadata}")
253
+
254
+
255
+ print("\n" + "="*70)
256
+ print("✅ ÇÖZÜM 3: SEPARATE TABLES - DATABASE APPROACH")
257
+ print("="*70)
258
+
259
+ print("""
260
+ 🗄️ Database-style Approach:
261
+
262
+ Ana tablo (unified):
263
+ - id
264
+ - text
265
+ - domain
266
+ - reference_id
267
+
268
+ Domain-specific tablolar:
269
+ - scientific_metadata: reference_id -> {year, domain, ...}
270
+ - code_metadata: reference_id -> {language, lines, ...}
271
+ - financial_metadata: reference_id -> {sentiment, company, ...}
272
+
273
+ 장점:
274
+ ✓ Schema flexibility
275
+ ✓ Easy to extend
276
+ ✓ Efficient storage
277
+ ✓ Type safety
278
+
279
+ 단점:
280
+ ✗ Join gerekir
281
+ ✗ Daha kompleks
282
+ """)
283
+
284
+ # Simple implementation
285
+ def create_separated_tables(datasets_dict):
286
+ """
287
+ Ana tablo + ayrı metadata tabloları
288
+ """
289
+ # Ana tablo
290
+ unified = []
291
+ metadata_tables = {
292
+ 'scientific': [],
293
+ 'code': [],
294
+ 'financial': []
295
+ }
296
+
297
+ ref_id = 0
298
+
299
+ # Scientific
300
+ for ex in datasets_dict['scientific']:
301
+ unified.append({
302
+ 'id': ex['id'],
303
+ 'text': ex['abstract'],
304
+ 'domain': 'scientific',
305
+ 'reference_id': ref_id
306
+ })
307
+ metadata_tables['scientific'].append({
308
+ 'reference_id': ref_id,
309
+ 'sci_domain': ex['domain'],
310
+ 'year': ex['year']
311
+ })
312
+ ref_id += 1
313
+
314
+ # Code
315
+ for ex in datasets_dict['code']:
316
+ unified.append({
317
+ 'id': ex['id'],
318
+ 'text': ex['code'],
319
+ 'domain': 'code',
320
+ 'reference_id': ref_id
321
+ })
322
+ metadata_tables['code'].append({
323
+ 'reference_id': ref_id,
324
+ 'language': ex['language'],
325
+ 'lines_of_code': ex['lines_of_code']
326
+ })
327
+ ref_id += 1
328
+
329
+ # Financial
330
+ for ex in datasets_dict['financial']:
331
+ unified.append({
332
+ 'id': ex['id'],
333
+ 'text': ex['text'],
334
+ 'domain': 'financial',
335
+ 'reference_id': ref_id
336
+ })
337
+ metadata_tables['financial'].append({
338
+ 'reference_id': ref_id,
339
+ 'sentiment': ex['sentiment'],
340
+ 'company': ex['company']
341
+ })
342
+ ref_id += 1
343
+
344
+ return {
345
+ 'unified': Dataset.from_dict({k: [d[k] for d in unified] for k in unified[0].keys()}),
346
+ 'metadata': {k: Dataset.from_dict({k: [d[k] for d in v] for k in v[0].keys()})
347
+ for k, v in metadata_tables.items()}
348
+ }
349
+
350
+ print("\n🔧 Creating separated tables...")
351
+ separated = create_separated_tables({
352
+ 'scientific': sci_dataset,
353
+ 'code': code_dataset,
354
+ 'financial': fin_dataset
355
+ })
356
+
357
+ print(f"\n✅ Unified table: {len(separated['unified'])} records")
358
+ print(f" Columns: {separated['unified'].column_names}")
359
+
360
+ for domain, meta_table in separated['metadata'].items():
361
+ print(f"\n✅ {domain.capitalize()} metadata: {len(meta_table)} records")
362
+ print(f" Columns: {meta_table.column_names}")
363
+
364
+ # Join örneği
365
+ print("\n🔗 Join Example - Scientific record:")
366
+ unified_ex = separated['unified'][0]
367
+ ref_id = unified_ex['reference_id']
368
+ sci_meta = [ex for ex in separated['metadata']['scientific'] if ex['reference_id'] == ref_id][0]
369
+
370
+ print(f" Main table: {unified_ex}")
371
+ print(f" Metadata: {sci_meta}")
372
+
373
+
374
+ print("\n" + "="*70)
375
+ print("📚 BEST PRACTICES - CROSS-DOMAIN DATASETS")
376
+ print("="*70)
377
+
378
+ print("""
379
+ ✅ FLATTEN APPROACH:
380
+ 장점:
381
+ - En basit yöntem
382
+ - Hızlı erişim
383
+ - Tüm veriler bir yerde
384
+ 단점:
385
+ - Çok fazla None değer (sparse)
386
+ - Schema değişikliği zor
387
+ - Memory inefficient
388
+
389
+ Ne zaman kullan:
390
+ - Az sayıda domain
391
+ - Benzer field'lar
392
+ - Simple queries
393
+
394
+ ✅ JSON METADATA APPROACH:
395
+ 장점:
396
+ - Esnek schema
397
+ - Kolay extend
398
+ - Daha az None
399
+ 단점:
400
+ - Parse gerekir
401
+ - Type safety yok
402
+ - Query daha yavaş
403
+
404
+ Ne zaman kullan:
405
+ - Çok farklı domain'ler
406
+ - Sık schema değişikliği
407
+ - Prototype/exploration
408
+
409
+ ✅ SEPARATE TABLES APPROACH:
410
+ 장점:
411
+ - Temiz schema
412
+ - Type safe
413
+ - Efficient storage
414
+ - Professional approach
415
+ 단점:
416
+ - Join gerekir
417
+ - Daha kompleks
418
+ - Setup overhead
419
+
420
+ Ne zaman kullan:
421
+ - Production systems
422
+ - Çok domain
423
+ - Complex queries
424
+ - Large scale
425
+
426
+ ✅ HYBRID APPROACH:
427
+ - Common fields flatten
428
+ - Rare fields JSON
429
+ - Best of both worlds
430
+
431
+ Örnek:
432
+ {
433
+ 'id': string,
434
+ 'text': string,
435
+ 'domain': string,
436
+ 'common_field_1': value,
437
+ 'common_field_2': value,
438
+ 'extra_metadata_json': json_string
439
+ }
440
+
441
+ 🎯 RECOMMENDATION:
442
+ Small project → JSON approach
443
+ Medium project → Flatten approach
444
+ Large project → Separate tables
445
+ Research → Hybrid approach
446
+ """)
447
+
448
+
449
+ print("\n" + "="*70)
450
+ print("🔍 KARŞILAŞTIRMA - PERFORMANCE & STORAGE")
451
+ print("="*70)
452
+
453
+ import sys
454
+
455
+ print("\n📊 Memory Usage Comparison:")
456
+ print(f" Flatten: {sys.getsizeof(multi_domain_flat.data)} bytes")
457
+ print(f" JSON: {sys.getsizeof(multi_domain_json.data)} bytes")
458
+ print(f" Separated (unified): {sys.getsizeof(separated['unified'].data)} bytes")
459
+
460
+ print("\n🚀 Query Speed Simulation:")
461
+ print(" Flatten: O(1) - Direct column access")
462
+ print(" JSON: O(1) + parse overhead")
463
+ print(" Separated: O(log n) - Join required")
464
+
465
+ print("\n💾 Storage Efficiency:")
466
+ total_flat = len(multi_domain_flat) * len(multi_domain_flat.column_names)
467
+ total_json = len(multi_domain_json) * len(multi_domain_json.column_names)
468
+ total_sep = len(separated['unified']) + sum(len(t) for t in separated['metadata'].values())
469
+
470
+ print(f" Flatten: {total_flat} total fields")
471
+ print(f" JSON: {total_json} total fields")
472
+ print(f" Separated: {total_sep} total fields")
473
+
474
+
475
+ print("\n" + "="*70)
476
+ print("✅ ÇÖZÜM ÖZETİ")
477
+ print("="*70)
478
+
479
+ print("""
480
+ 🎯 Ana Sorun:
481
+ ArrowTypeError: struct fields don't match
482
+
483
+ 🔧 Çözümler:
484
+ 1. Flatten: Tüm field'ları ayrı kolonlara çıkar
485
+ 2. JSON: Metadata'yı JSON string olarak sakla
486
+ 3. Separated: Ana tablo + metadata tabloları
487
+
488
+ ✅ En İyi Yaklaşım:
489
+ - Küçük projeler: JSON
490
+ - Orta projeler: Flatten + JSON hybrid
491
+ - Büyük projeler: Separated tables
492
+
493
+ ⚡ Key Takeaway:
494
+ Farklı schema'ları birleştirmeden önce
495
+ ortak bir format'a normalize et!
496
+ """)
497
+
498
+ print("\n🎉 Problem çözüldü! Artık cross-domain dataset'leri güvenle birleştirebilirsiniz.")
space/modules/03_ileri_teknikler_part1.py ADDED
@@ -0,0 +1,856 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ İLERİ TEKNİKLER - HUGGING FACE DATASETS
3
+ ========================================
4
+
5
+ Bu modülde öğrenecekleriniz:
6
+ 1. Custom Data Collators
7
+ 2. Advanced Feature Extraction & Transformation
8
+ 3. Dataset Preprocessing Pipelines
9
+ 4. Data Augmentation Strategies
10
+ 5. Advanced Filtering & Sampling
11
+ 6. Dynamic Batching
12
+ 7. Feature Engineering
13
+ """
14
+
15
+ from datasets import Dataset, DatasetDict
16
+ import numpy as np
17
+ from typing import Dict, List, Any, Callable
18
+ import time
19
+ from collections import defaultdict
20
+ import random
21
+
22
+ print("="*70)
23
+ print("🚀 İLERİ TEKNİKLER - ADVANCED HUGGING FACE DATASETS")
24
+ print("="*70)
25
+
26
+
27
+ print("\n" + "="*70)
28
+ print("1. CUSTOM DATA COLLATORS")
29
+ print("="*70)
30
+
31
+ print("\n📦 Data Collator Nedir?")
32
+ print("""
33
+ Data Collator: Batch'teki örnekleri işleyip model input'una çevirir
34
+ - Padding ekler
35
+ - Tensor'lara çevirir
36
+ - Batch oluşturur
37
+ - Dynamic behavior
38
+ """)
39
+
40
+ # Örnek dataset
41
+ def create_sample_dataset(num_samples=100):
42
+ def gen():
43
+ for i in range(num_samples):
44
+ yield {
45
+ 'text': f"Sample text {i} " * np.random.randint(5, 20),
46
+ 'label': np.random.randint(0, 3),
47
+ 'length': np.random.randint(10, 100),
48
+ 'metadata': {'id': i, 'score': np.random.random()}
49
+ }
50
+ return Dataset.from_generator(gen)
51
+
52
+ dataset = create_sample_dataset(200)
53
+ print(f"\n✅ Dataset: {len(dataset)} örnek")
54
+ print(f"Örnek: {dataset[0]}")
55
+
56
+
57
+ print("\n1️⃣ Basit Collator - Text + Label:")
58
+
59
+ class SimpleCollator:
60
+ """
61
+ En basit collator - sadece text ve label'ı işler
62
+ """
63
+ def __init__(self, max_length=50):
64
+ self.max_length = max_length
65
+
66
+ def __call__(self, batch: List[Dict]) -> Dict[str, List]:
67
+ """
68
+ Batch'i işle
69
+ """
70
+ # Text'leri al ve truncate et
71
+ texts = []
72
+ for example in batch:
73
+ text = example['text']
74
+ words = text.split()[:self.max_length]
75
+ texts.append(' '.join(words))
76
+
77
+ # Label'ları al
78
+ labels = [example['label'] for example in batch]
79
+
80
+ # Length'leri hesapla
81
+ lengths = [len(text.split()) for text in texts]
82
+
83
+ return {
84
+ 'texts': texts,
85
+ 'labels': labels,
86
+ 'lengths': lengths
87
+ }
88
+
89
+ # Test
90
+ simple_collator = SimpleCollator(max_length=30)
91
+ batch = [dataset[i] for i in range(4)]
92
+
93
+ collated = simple_collator(batch)
94
+ print(f"\n✅ Collated batch:")
95
+ print(f" Texts: {len(collated['texts'])} samples")
96
+ print(f" Labels: {collated['labels']}")
97
+ print(f" Lengths: {collated['lengths']}")
98
+ print(f"\n İlk text: {collated['texts'][0][:80]}...")
99
+
100
+
101
+ print("\n2️⃣ Padding Collator - Dynamic Padding:")
102
+
103
+ class PaddingCollator:
104
+ """
105
+ Dynamic padding - batch içindeki max uzunluğa göre padding
106
+ """
107
+ def __init__(self, pad_token='[PAD]', max_length=None):
108
+ self.pad_token = pad_token
109
+ self.max_length = max_length
110
+
111
+ def __call__(self, batch: List[Dict]) -> Dict[str, Any]:
112
+ # Tokenize (basit - space split)
113
+ tokenized = []
114
+ for example in batch:
115
+ tokens = example['text'].split()
116
+ if self.max_length:
117
+ tokens = tokens[:self.max_length]
118
+ tokenized.append(tokens)
119
+
120
+ # Batch içindeki max length'i bul
121
+ max_len = max(len(tokens) for tokens in tokenized)
122
+
123
+ # Padding ekle
124
+ padded = []
125
+ attention_masks = []
126
+
127
+ for tokens in tokenized:
128
+ # Padding
129
+ padding_length = max_len - len(tokens)
130
+ padded_tokens = tokens + [self.pad_token] * padding_length
131
+
132
+ # Attention mask (1 = real token, 0 = padding)
133
+ mask = [1] * len(tokens) + [0] * padding_length
134
+
135
+ padded.append(padded_tokens)
136
+ attention_masks.append(mask)
137
+
138
+ labels = [ex['label'] for ex in batch]
139
+
140
+ return {
141
+ 'input_tokens': padded,
142
+ 'attention_mask': attention_masks,
143
+ 'labels': labels,
144
+ 'original_lengths': [len(tokens) for tokens in tokenized]
145
+ }
146
+
147
+ # Test
148
+ padding_collator = PaddingCollator(max_length=20)
149
+ batch = [dataset[i] for i in range(4)]
150
+
151
+ padded_batch = padding_collator(batch)
152
+ print(f"\n✅ Padded batch:")
153
+ print(f" Batch size: {len(padded_batch['input_tokens'])}")
154
+ print(f" Max length: {len(padded_batch['input_tokens'][0])}")
155
+ print(f" Original lengths: {padded_batch['original_lengths']}")
156
+ print(f"\n İlk örnek tokens: {padded_batch['input_tokens'][0][:15]}")
157
+ print(f" İlk örnek mask: {padded_batch['attention_mask'][0][:15]}")
158
+
159
+
160
+ print("\n3️⃣ Advanced Collator - Multiple Features:")
161
+
162
+ class AdvancedCollator:
163
+ """
164
+ Çoklu feature'ları handle eden advanced collator
165
+ """
166
+ def __init__(self,
167
+ pad_token='[PAD]',
168
+ max_length=50,
169
+ include_metadata=True,
170
+ normalize_scores=True):
171
+ self.pad_token = pad_token
172
+ self.max_length = max_length
173
+ self.include_metadata = include_metadata
174
+ self.normalize_scores = normalize_scores
175
+
176
+ def tokenize_and_pad(self, texts):
177
+ """Tokenize ve pad"""
178
+ tokenized = [text.split()[:self.max_length] for text in texts]
179
+ max_len = max(len(tokens) for tokens in tokenized)
180
+
181
+ padded = []
182
+ masks = []
183
+ for tokens in tokenized:
184
+ pad_len = max_len - len(tokens)
185
+ padded.append(tokens + [self.pad_token] * pad_len)
186
+ masks.append([1] * len(tokens) + [0] * pad_len)
187
+
188
+ return padded, masks
189
+
190
+ def __call__(self, batch: List[Dict]) -> Dict[str, Any]:
191
+ texts = [ex['text'] for ex in batch]
192
+ labels = [ex['label'] for ex in batch]
193
+ lengths = [ex['length'] for ex in batch]
194
+
195
+ # Tokenize and pad
196
+ padded_tokens, attention_masks = self.tokenize_and_pad(texts)
197
+
198
+ result = {
199
+ 'input_tokens': padded_tokens,
200
+ 'attention_mask': attention_masks,
201
+ 'labels': labels,
202
+ 'lengths': lengths
203
+ }
204
+
205
+ # Metadata ekle
206
+ if self.include_metadata:
207
+ ids = [ex['metadata']['id'] for ex in batch]
208
+ scores = [ex['metadata']['score'] for ex in batch]
209
+
210
+ if self.normalize_scores:
211
+ # Min-max normalization
212
+ min_score = min(scores)
213
+ max_score = max(scores)
214
+ if max_score > min_score:
215
+ scores = [(s - min_score) / (max_score - min_score)
216
+ for s in scores]
217
+
218
+ result['ids'] = ids
219
+ result['scores'] = scores
220
+
221
+ # Batch statistics
222
+ result['batch_stats'] = {
223
+ 'size': len(batch),
224
+ 'avg_length': np.mean(lengths),
225
+ 'max_length': max(lengths),
226
+ 'label_distribution': {
227
+ label: labels.count(label) for label in set(labels)
228
+ }
229
+ }
230
+
231
+ return result
232
+
233
+ # Test
234
+ advanced_collator = AdvancedCollator(
235
+ max_length=25,
236
+ include_metadata=True,
237
+ normalize_scores=True
238
+ )
239
+
240
+ batch = [dataset[i] for i in range(8)]
241
+ advanced_batch = advanced_collator(batch)
242
+
243
+ print(f"\n✅ Advanced collated batch:")
244
+ print(f" Input tokens shape: {len(advanced_batch['input_tokens'])} x {len(advanced_batch['input_tokens'][0])}")
245
+ print(f" Labels: {advanced_batch['labels']}")
246
+ print(f" Normalized scores: {[f'{s:.3f}' for s in advanced_batch['scores']]}")
247
+ print(f" Batch stats: {advanced_batch['batch_stats']}")
248
+
249
+
250
+ print("\n" + "="*70)
251
+ print("2. ADVANCED FEATURE EXTRACTION & TRANSFORMATION")
252
+ print("="*70)
253
+
254
+ print("\n🔧 Feature Engineering Pipeline:")
255
+
256
+ class FeatureExtractor:
257
+ """
258
+ Comprehensive feature extraction
259
+ """
260
+ def __init__(self):
261
+ self.features = []
262
+
263
+ def extract_text_features(self, text: str) -> Dict[str, Any]:
264
+ """Text'ten çeşitli feature'lar çıkar"""
265
+ words = text.split()
266
+
267
+ return {
268
+ # Basic features
269
+ 'word_count': len(words),
270
+ 'char_count': len(text),
271
+ 'avg_word_length': np.mean([len(w) for w in words]) if words else 0,
272
+
273
+ # Complexity features
274
+ 'unique_words': len(set(words)),
275
+ 'vocabulary_richness': len(set(words)) / len(words) if words else 0,
276
+
277
+ # Statistical features
278
+ 'word_length_std': np.std([len(w) for w in words]) if words else 0,
279
+ 'max_word_length': max([len(w) for w in words]) if words else 0,
280
+
281
+ # Pattern features
282
+ 'has_numbers': any(char.isdigit() for char in text),
283
+ 'uppercase_ratio': sum(1 for c in text if c.isupper()) / len(text) if text else 0,
284
+ 'punctuation_count': sum(1 for c in text if c in '.,!?;:')
285
+ }
286
+
287
+ def extract_all_features(self, example: Dict) -> Dict:
288
+ """Tüm feature'ları çıkar"""
289
+ text_features = self.extract_text_features(example['text'])
290
+
291
+ # Mevcut feature'ları koru
292
+ result = {**example}
293
+
294
+ # Yeni feature'ları ekle
295
+ for key, value in text_features.items():
296
+ result[f'feat_{key}'] = value
297
+
298
+ return result
299
+
300
+ # Test feature extraction
301
+ print("\n1️⃣ Basic Feature Extraction:")
302
+ extractor = FeatureExtractor()
303
+
304
+ sample_text = "This is a sample text for feature extraction! It has 123 numbers."
305
+ features = extractor.extract_text_features(sample_text)
306
+
307
+ print(f" Text: {sample_text}")
308
+ print(f"\n Extracted features:")
309
+ for key, value in features.items():
310
+ print(f" {key}: {value:.3f}" if isinstance(value, float) else f" {key}: {value}")
311
+
312
+ # Dataset'e uygula
313
+ print("\n2️⃣ Applying to Dataset:")
314
+ featured_dataset = dataset.map(
315
+ extractor.extract_all_features,
316
+ desc="Extracting features"
317
+ )
318
+
319
+ print(f"\n✅ Featured dataset:")
320
+ print(f" Original columns: {dataset.column_names}")
321
+ print(f" New columns: {featured_dataset.column_names}")
322
+ print(f" Total columns: {len(featured_dataset.column_names)}")
323
+
324
+ # Feature istatistikleri
325
+ print(f"\n📊 Feature Statistics:")
326
+ for col in ['feat_word_count', 'feat_vocabulary_richness', 'feat_punctuation_count']:
327
+ values = [ex[col] for ex in featured_dataset.select(range(100))]
328
+ print(f" {col}:")
329
+ print(f" Mean: {np.mean(values):.2f}")
330
+ print(f" Std: {np.std(values):.2f}")
331
+ print(f" Min/Max: {np.min(values):.2f} / {np.max(values):.2f}")
332
+
333
+
334
+ print("\n3️⃣ Advanced Transformations:")
335
+
336
+ class AdvancedTransformer:
337
+ """
338
+ Complex transformations
339
+ """
340
+ def __init__(self):
341
+ self.scaler_params = {}
342
+
343
+ def fit_scaler(self, dataset, columns):
344
+ """Scaling parameters'ı hesapla"""
345
+ print(" Fitting scaler...")
346
+ for col in columns:
347
+ values = [ex[col] for ex in dataset]
348
+ self.scaler_params[col] = {
349
+ 'mean': np.mean(values),
350
+ 'std': np.std(values),
351
+ 'min': np.min(values),
352
+ 'max': np.max(values)
353
+ }
354
+
355
+ def normalize(self, example, columns, method='minmax'):
356
+ """Feature normalization"""
357
+ result = {**example}
358
+
359
+ for col in columns:
360
+ value = example[col]
361
+ params = self.scaler_params.get(col, {})
362
+
363
+ if method == 'minmax':
364
+ # Min-max scaling [0, 1]
365
+ min_val = params.get('min', 0)
366
+ max_val = params.get('max', 1)
367
+ if max_val > min_val:
368
+ normalized = (value - min_val) / (max_val - min_val)
369
+ else:
370
+ normalized = 0
371
+ elif method == 'zscore':
372
+ # Z-score normalization
373
+ mean = params.get('mean', 0)
374
+ std = params.get('std', 1)
375
+ if std > 0:
376
+ normalized = (value - mean) / std
377
+ else:
378
+ normalized = 0
379
+ else:
380
+ normalized = value
381
+
382
+ result[f'{col}_normalized'] = normalized
383
+
384
+ return result
385
+
386
+ def create_interaction_features(self, example):
387
+ """Interaction features oluştur"""
388
+ result = {**example}
389
+
390
+ # Örnek: word_count * vocabulary_richness
391
+ if 'feat_word_count' in example and 'feat_vocabulary_richness' in example:
392
+ result['interaction_wc_vr'] = (
393
+ example['feat_word_count'] * example['feat_vocabulary_richness']
394
+ )
395
+
396
+ # Örnek: char_count / word_count (avg word length)
397
+ if 'feat_char_count' in example and 'feat_word_count' in example:
398
+ if example['feat_word_count'] > 0:
399
+ result['interaction_char_per_word'] = (
400
+ example['feat_char_count'] / example['feat_word_count']
401
+ )
402
+ else:
403
+ result['interaction_char_per_word'] = 0
404
+
405
+ return result
406
+
407
+ # Test transformations
408
+ transformer = AdvancedTransformer()
409
+
410
+ # Scaler fit et
411
+ numeric_features = ['feat_word_count', 'feat_char_count', 'feat_vocabulary_richness']
412
+ transformer.fit_scaler(featured_dataset, numeric_features)
413
+
414
+ print("\n Scaler parameters:")
415
+ for col, params in transformer.scaler_params.items():
416
+ print(f" {col}: μ={params['mean']:.2f}, σ={params['std']:.2f}")
417
+
418
+ # Normalize et
419
+ print("\n Normalizing features...")
420
+ normalized_dataset = featured_dataset.map(
421
+ lambda x: transformer.normalize(x, numeric_features, method='minmax'),
422
+ desc="Normalizing"
423
+ )
424
+
425
+ print(f"\n✅ Normalized dataset: {len(normalized_dataset)} examples")
426
+ print(f" New columns added: {[c for c in normalized_dataset.column_names if 'normalized' in c]}")
427
+
428
+ # Örnek normalized values
429
+ print(f"\n Sample normalized values:")
430
+ sample = normalized_dataset[0]
431
+ for col in numeric_features:
432
+ print(f" {col}: {sample[col]:.2f} → {sample[f'{col}_normalized']:.3f}")
433
+
434
+ # Interaction features
435
+ print("\n Creating interaction features...")
436
+ interaction_dataset = normalized_dataset.map(
437
+ transformer.create_interaction_features,
438
+ desc="Creating interactions"
439
+ )
440
+
441
+ print(f"\n✅ Interaction features added:")
442
+ print(f" interaction_wc_vr: {interaction_dataset[0]['interaction_wc_vr']:.3f}")
443
+ print(f" interaction_char_per_word: {interaction_dataset[0]['interaction_char_per_word']:.3f}")
444
+
445
+
446
+ print("\n" + "="*70)
447
+ print("3. DATASET PREPROCESSING PIPELINES")
448
+ print("="*70)
449
+
450
+ print("\n🔄 End-to-End Pipeline:")
451
+
452
+ class DataPipeline:
453
+ """
454
+ Modular preprocessing pipeline
455
+ """
456
+ def __init__(self, name="pipeline"):
457
+ self.name = name
458
+ self.steps = []
459
+ self.statistics = {}
460
+
461
+ def add_step(self, name: str, func: Callable, **kwargs):
462
+ """Pipeline'a step ekle"""
463
+ self.steps.append({
464
+ 'name': name,
465
+ 'func': func,
466
+ 'kwargs': kwargs
467
+ })
468
+ return self
469
+
470
+ def run(self, dataset: Dataset, verbose=True) -> Dataset:
471
+ """Pipeline'ı çalıştır"""
472
+ if verbose:
473
+ print(f"\n🚀 Running pipeline: {self.name}")
474
+ print(f" Input: {len(dataset)} examples, {len(dataset.column_names)} columns")
475
+
476
+ result = dataset
477
+
478
+ for i, step in enumerate(self.steps):
479
+ if verbose:
480
+ print(f"\n Step {i+1}/{len(self.steps)}: {step['name']}")
481
+
482
+ start_time = time.time()
483
+
484
+ # Step'i çalıştır
485
+ result = step['func'](result, **step['kwargs'])
486
+
487
+ elapsed = time.time() - start_time
488
+
489
+ if verbose:
490
+ print(f" ✓ Completed in {elapsed:.3f}s")
491
+ print(f" Output: {len(result)} examples, {len(result.column_names)} columns")
492
+
493
+ # İstatistikleri kaydet
494
+ self.statistics[step['name']] = {
495
+ 'elapsed_time': elapsed,
496
+ 'output_size': len(result),
497
+ 'output_columns': len(result.column_names)
498
+ }
499
+
500
+ if verbose:
501
+ print(f"\n✅ Pipeline completed!")
502
+ print(f" Total time: {sum(s['elapsed_time'] for s in self.statistics.values()):.3f}s")
503
+
504
+ return result
505
+
506
+ def get_statistics(self):
507
+ """Pipeline istatistiklerini al"""
508
+ return self.statistics
509
+
510
+
511
+ # Pipeline step'leri tanımla
512
+ def step_clean_text(dataset, min_length=10):
513
+ """Text temizleme step"""
514
+ def clean(example):
515
+ text = example['text'].strip()
516
+ text = ' '.join(text.split()) # Fazla boşlukları temizle
517
+ example['text_clean'] = text
518
+ return example
519
+
520
+ return dataset.map(clean, desc="Cleaning text")
521
+
522
+ def step_filter_short(dataset, min_words=5):
523
+ """Kısa metinleri filtrele"""
524
+ return dataset.filter(
525
+ lambda x: len(x['text'].split()) >= min_words,
526
+ desc="Filtering short texts"
527
+ )
528
+
529
+ def step_extract_features(dataset):
530
+ """Feature extraction"""
531
+ extractor = FeatureExtractor()
532
+ return dataset.map(
533
+ extractor.extract_all_features,
534
+ desc="Extracting features"
535
+ )
536
+
537
+ def step_normalize_features(dataset, columns):
538
+ """Feature normalization"""
539
+ transformer = AdvancedTransformer()
540
+ transformer.fit_scaler(dataset, columns)
541
+
542
+ return dataset.map(
543
+ lambda x: transformer.normalize(x, columns, method='minmax'),
544
+ desc="Normalizing features"
545
+ )
546
+
547
+ # Pipeline oluştur ve çalıştır
548
+ print("\n1️⃣ Creating Pipeline:")
549
+ pipeline = DataPipeline(name="Text Processing Pipeline")
550
+
551
+ pipeline.add_step("clean_text", step_clean_text, min_length=10)
552
+ pipeline.add_step("filter_short", step_filter_short, min_words=5)
553
+ pipeline.add_step("extract_features", step_extract_features)
554
+ pipeline.add_step("normalize_features", step_normalize_features,
555
+ columns=['feat_word_count', 'feat_char_count'])
556
+
557
+ # Yeni bir dataset oluştur
558
+ raw_dataset = create_sample_dataset(500)
559
+
560
+ # Pipeline'ı çalıştır
561
+ processed_dataset = pipeline.run(raw_dataset, verbose=True)
562
+
563
+ # Sonuçları göster
564
+ print(f"\n📊 Pipeline Results:")
565
+ print(f" Input examples: {len(raw_dataset)}")
566
+ print(f" Output examples: {len(processed_dataset)}")
567
+ print(f" Columns added: {len(processed_dataset.column_names) - len(raw_dataset.column_names)}")
568
+
569
+ # İstatistikler
570
+ print(f"\n📈 Step Statistics:")
571
+ for step_name, stats in pipeline.get_statistics().items():
572
+ print(f" {step_name}:")
573
+ print(f" Time: {stats['elapsed_time']:.3f}s")
574
+ print(f" Output size: {stats['output_size']}")
575
+
576
+
577
+ print("\n2️⃣ Reusable Pipeline Template:")
578
+
579
+ class PipelineTemplate:
580
+ """
581
+ Re-usable pipeline templates
582
+ """
583
+ @staticmethod
584
+ def basic_nlp_pipeline():
585
+ """Basic NLP preprocessing"""
586
+ pipeline = DataPipeline("Basic NLP")
587
+ pipeline.add_step("clean", step_clean_text)
588
+ pipeline.add_step("filter", step_filter_short, min_words=3)
589
+ return pipeline
590
+
591
+ @staticmethod
592
+ def feature_engineering_pipeline():
593
+ """Feature engineering pipeline"""
594
+ pipeline = DataPipeline("Feature Engineering")
595
+ pipeline.add_step("clean", step_clean_text)
596
+ pipeline.add_step("features", step_extract_features)
597
+ pipeline.add_step("normalize", step_normalize_features,
598
+ columns=['feat_word_count', 'feat_char_count',
599
+ 'feat_vocabulary_richness'])
600
+ return pipeline
601
+
602
+ @staticmethod
603
+ def full_pipeline():
604
+ """Complete preprocessing pipeline"""
605
+ pipeline = DataPipeline("Full Pipeline")
606
+ pipeline.add_step("clean", step_clean_text, min_length=10)
607
+ pipeline.add_step("filter", step_filter_short, min_words=5)
608
+ pipeline.add_step("features", step_extract_features)
609
+ pipeline.add_step("normalize", step_normalize_features,
610
+ columns=['feat_word_count', 'feat_char_count'])
611
+ return pipeline
612
+
613
+ # Template kullanımı
614
+ print("\n Using pipeline template:")
615
+ template_pipeline = PipelineTemplate.feature_engineering_pipeline()
616
+ print(f" Pipeline: {template_pipeline.name}")
617
+ print(f" Steps: {[s['name'] for s in template_pipeline.steps]}")
618
+
619
+
620
+ print("\n" + "="*70)
621
+ print("4. DATA AUGMENTATION STRATEGIES")
622
+ print("="*70)
623
+
624
+ print("\n🎲 Data Augmentation Teknikleri:")
625
+
626
+ class DataAugmenter:
627
+ """
628
+ Data augmentation methods
629
+ """
630
+ def __init__(self, augmentation_prob=0.3):
631
+ self.augmentation_prob = augmentation_prob
632
+
633
+ def random_word_deletion(self, text: str, p=0.1) -> str:
634
+ """Random kelime silme"""
635
+ words = text.split()
636
+ if len(words) <= 2:
637
+ return text
638
+
639
+ new_words = [w for w in words if random.random() > p]
640
+
641
+ # En az 1 kelime kalsın
642
+ if len(new_words) == 0:
643
+ new_words = [random.choice(words)]
644
+
645
+ return ' '.join(new_words)
646
+
647
+ def random_word_swap(self, text: str, n=1) -> str:
648
+ """Random kelime yer değiştirme"""
649
+ words = text.split()
650
+ if len(words) < 2:
651
+ return text
652
+
653
+ for _ in range(n):
654
+ idx1, idx2 = random.sample(range(len(words)), 2)
655
+ words[idx1], words[idx2] = words[idx2], words[idx1]
656
+
657
+ return ' '.join(words)
658
+
659
+ def synonym_replacement(self, text: str, p=0.1) -> str:
660
+ """
661
+ Synonym replacement (basitleştirilmiş)
662
+ Gerçek uygulamada WordNet veya embedding kullanılır
663
+ """
664
+ synonyms = {
665
+ 'good': ['great', 'excellent', 'nice'],
666
+ 'bad': ['poor', 'terrible', 'awful'],
667
+ 'big': ['large', 'huge', 'enormous'],
668
+ 'small': ['tiny', 'little', 'mini']
669
+ }
670
+
671
+ words = text.split()
672
+ new_words = []
673
+
674
+ for word in words:
675
+ if word.lower() in synonyms and random.random() < p:
676
+ new_words.append(random.choice(synonyms[word.lower()]))
677
+ else:
678
+ new_words.append(word)
679
+
680
+ return ' '.join(new_words)
681
+
682
+ def augment_example(self, example: Dict) -> Dict:
683
+ """Tek bir örneği augment et"""
684
+ if random.random() > self.augmentation_prob:
685
+ return example
686
+
687
+ text = example['text']
688
+
689
+ # Random augmentation seç
690
+ aug_method = random.choice([
691
+ self.random_word_deletion,
692
+ self.random_word_swap,
693
+ self.synonym_replacement
694
+ ])
695
+
696
+ augmented_text = aug_method(text)
697
+
698
+ return {
699
+ **example,
700
+ 'text_augmented': augmented_text,
701
+ 'is_augmented': True
702
+ }
703
+
704
+ def augment_dataset(self, dataset: Dataset, num_augmentations=1) -> Dataset:
705
+ """Dataset'i augment et"""
706
+ augmented_examples = []
707
+
708
+ for example in dataset:
709
+ # Original örneği ekle
710
+ augmented_examples.append({
711
+ **example,
712
+ 'is_augmented': False,
713
+ 'text_augmented': example['text']
714
+ })
715
+
716
+ # Augmented versiyonları ekle
717
+ for _ in range(num_augmentations):
718
+ aug_example = self.augment_example(example)
719
+ augmented_examples.append(aug_example)
720
+
721
+ # Dict of lists'e çevir
722
+ dict_data = defaultdict(list)
723
+ for example in augmented_examples:
724
+ for key, value in example.items():
725
+ dict_data[key].append(value)
726
+
727
+ return Dataset.from_dict(dict(dict_data))
728
+
729
+
730
+ print("\n1️⃣ Augmentation Examples:")
731
+ augmenter = DataAugmenter(augmentation_prob=1.0) # Her zaman augment et
732
+
733
+ test_texts = [
734
+ "This is a good example of text augmentation",
735
+ "The big dog ran fast in the park",
736
+ "Data augmentation is important for ML"
737
+ ]
738
+
739
+ for i, text in enumerate(test_texts):
740
+ print(f"\n Original {i+1}: {text}")
741
+ print(f" Deletion: {augmenter.random_word_deletion(text, p=0.2)}")
742
+ print(f" Swap: {augmenter.random_word_swap(text, n=2)}")
743
+ print(f" Synonym: {augmenter.synonym_replacement(text, p=0.3)}")
744
+
745
+
746
+ print("\n2️⃣ Augmenting Dataset:")
747
+ small_dataset = create_sample_dataset(50)
748
+
749
+ print(f" Original dataset: {len(small_dataset)} examples")
750
+
751
+ # Augment et (her örnek için 2 augmented versiyon)
752
+ augmented_dataset = augmenter.augment_dataset(small_dataset, num_augmentations=2)
753
+
754
+ print(f" Augmented dataset: {len(augmented_dataset)} examples")
755
+ print(f" Augmented ratio: {len(augmented_dataset) / len(small_dataset):.1f}x")
756
+
757
+ # Augmented örnekleri göster
758
+ print(f"\n Sample augmentations:")
759
+ for i in range(3):
760
+ original_idx = i * 3 # Original
761
+ aug_idx = i * 3 + 1 # First augmentation
762
+
763
+ orig = augmented_dataset[original_idx]
764
+ aug = augmented_dataset[aug_idx]
765
+
766
+ print(f"\n Example {i+1}:")
767
+ print(f" Original: {orig['text'][:60]}...")
768
+ print(f" Augmented: {aug['text_augmented'][:60]}...")
769
+ print(f" Is augmented: {aug['is_augmented']}")
770
+
771
+
772
+ print("\n3️⃣ Smart Augmentation - Class Balancing:")
773
+
774
+ def smart_augment_for_balance(dataset, label_column='label', target_per_class=100):
775
+ """
776
+ Class'ları balance etmek için smart augmentation
777
+ """
778
+ augmenter = DataAugmenter(augmentation_prob=1.0)
779
+
780
+ # Label distribution'ı hesapla
781
+ labels = [ex[label_column] for ex in dataset]
782
+ label_counts = {label: labels.count(label) for label in set(labels)}
783
+
784
+ print(f"\n Original distribution:")
785
+ for label, count in sorted(label_counts.items()):
786
+ print(f" Label {label}: {count} examples")
787
+
788
+ # Balanced dataset oluştur
789
+ balanced_examples = []
790
+
791
+ for label in set(labels):
792
+ # Bu label'a ait örnekleri al
793
+ label_examples = [ex for ex in dataset if ex[label_column] == label]
794
+ current_count = len(label_examples)
795
+
796
+ # Original örnekleri ekle
797
+ for ex in label_examples:
798
+ balanced_examples.append({
799
+ **ex,
800
+ 'is_augmented': False,
801
+ 'text_augmented': ex['text']
802
+ })
803
+
804
+ # Eksik kısmı augmentation ile doldur
805
+ if current_count < target_per_class:
806
+ needed = target_per_class - current_count
807
+
808
+ for i in range(needed):
809
+ # Cycle through examples
810
+ source_ex = label_examples[i % len(label_examples)]
811
+ aug_ex = augmenter.augment_example(source_ex)
812
+ balanced_examples.append(aug_ex)
813
+
814
+ # Dataset'e çevir
815
+ dict_data = defaultdict(list)
816
+ for example in balanced_examples:
817
+ for key, value in example.items():
818
+ dict_data[key].append(value)
819
+
820
+ return Dataset.from_dict(dict(dict_data))
821
+
822
+ # Test smart augmentation
823
+ print("\n Applying smart augmentation for balance:")
824
+ balanced_dataset = smart_augment_for_balance(small_dataset, target_per_class=60)
825
+
826
+ print(f"\n Balanced distribution:")
827
+ balanced_labels = [ex['label'] for ex in balanced_dataset]
828
+ balanced_counts = {label: balanced_labels.count(label) for label in set(balanced_labels)}
829
+ for label, count in sorted(balanced_counts.items()):
830
+ print(f" Label {label}: {count} examples")
831
+
832
+ print(f"\n Total examples: {len(small_dataset)} → {len(balanced_dataset)}")
833
+
834
+
835
+ print("\n" + "="*70)
836
+ print("✅ BÖLÜM 3 TAMAMLANDI! (Devam ediyor...)")
837
+ print("="*70)
838
+
839
+ print("""
840
+ Bu bölümde öğrendikleriniz (1. Kısım):
841
+ ✓ Custom Data Collators (3 tip)
842
+ ✓ Advanced Feature Extraction
843
+ ✓ Feature Transformation & Normalization
844
+ ✓ Preprocessing Pipelines
845
+ ✓ Data Augmentation Strategies
846
+ ✓ Smart Class Balancing
847
+
848
+ 📚 SONRAKI: Advanced Filtering & Sampling
849
+ - Complex filtering strategies
850
+ - Stratified sampling
851
+ - Active learning sampling
852
+ - Diversity sampling
853
+ """)
854
+
855
+ print("\n▶️ Devam ediyoruz...")
856
+ time.sleep(1)
space/modules/03_ileri_teknikler_part2.py ADDED
@@ -0,0 +1,776 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ İLERİ TEKNİKLER - PART 2
3
+ ========================
4
+
5
+ Bu modülde öğrenecekleriniz:
6
+ 5. Advanced Filtering & Sampling
7
+ 6. Dynamic Batching
8
+ 7. Active Learning Integration
9
+ """
10
+
11
+ from datasets import Dataset
12
+ import numpy as np
13
+ from typing import Dict, List, Any
14
+ import random
15
+ from collections import defaultdict, Counter
16
+
17
+ print("\n" + "="*70)
18
+ print("5. ADVANCED FILTERING & SAMPLING")
19
+ print("="*70)
20
+
21
+ # Dataset oluştur
22
+ def create_diverse_dataset(num_samples=1000):
23
+ def gen():
24
+ domains = ['science', 'tech', 'sports', 'politics', 'entertainment']
25
+ difficulties = ['easy', 'medium', 'hard']
26
+
27
+ for i in range(num_samples):
28
+ domain = np.random.choice(domains)
29
+ difficulty = np.random.choice(difficulties)
30
+
31
+ yield {
32
+ 'id': i,
33
+ 'text': f"Sample text {i} in {domain} " * np.random.randint(5, 20),
34
+ 'domain': domain,
35
+ 'difficulty': difficulty,
36
+ 'score': np.random.random(),
37
+ 'label': np.random.randint(0, 3),
38
+ 'length': np.random.randint(50, 500),
39
+ 'quality': np.random.choice(['high', 'medium', 'low'])
40
+ }
41
+ return Dataset.from_generator(gen)
42
+
43
+ dataset = create_diverse_dataset(1000)
44
+ print(f"✅ Dataset: {len(dataset)} örnekler")
45
+
46
+
47
+ print("\n1️⃣ Complex Multi-Condition Filtering:")
48
+
49
+ class AdvancedFilter:
50
+ """
51
+ Complex filtering with multiple conditions
52
+ """
53
+ @staticmethod
54
+ def filter_by_multiple_conditions(dataset, conditions: List[callable]):
55
+ """
56
+ Birden fazla koşulu AND ile uygula
57
+ """
58
+ def combined_filter(example):
59
+ return all(condition(example) for condition in conditions)
60
+
61
+ return dataset.filter(combined_filter, desc="Multi-condition filtering")
62
+
63
+ @staticmethod
64
+ def filter_by_score_percentile(dataset, percentile=75, column='score'):
65
+ """
66
+ Belirli percentile'ın üstündeki örnekleri filtrele
67
+ """
68
+ scores = [ex[column] for ex in dataset]
69
+ threshold = np.percentile(scores, percentile)
70
+
71
+ return dataset.filter(
72
+ lambda x: x[column] >= threshold,
73
+ desc=f"Filtering top {100-percentile}%"
74
+ )
75
+
76
+ @staticmethod
77
+ def filter_balanced_classes(dataset, label_column='label', samples_per_class=100):
78
+ """
79
+ Her class'tan eşit sayıda örnek al
80
+ """
81
+ # Label'lara göre grupla
82
+ label_groups = defaultdict(list)
83
+ for i, ex in enumerate(dataset):
84
+ label_groups[ex[label_column]].append(i)
85
+
86
+ # Her class'tan sample al
87
+ selected_indices = []
88
+ for label, indices in label_groups.items():
89
+ # Random sample
90
+ n_samples = min(samples_per_class, len(indices))
91
+ sampled = random.sample(indices, n_samples)
92
+ selected_indices.extend(sampled)
93
+
94
+ return dataset.select(sorted(selected_indices))
95
+
96
+ # Test filters
97
+ print("\n Complex filtering örneği:")
98
+
99
+ # Birden fazla koşul
100
+ conditions = [
101
+ lambda x: x['length'] > 100, # Uzun metinler
102
+ lambda x: x['score'] > 0.5, # Yüksek score
103
+ lambda x: x['quality'] == 'high' # Yüksek kalite
104
+ ]
105
+
106
+ filtered = AdvancedFilter.filter_by_multiple_conditions(dataset, conditions)
107
+ print(f" Original: {len(dataset)} examples")
108
+ print(f" Filtered (length>100 AND score>0.5 AND quality=high): {len(filtered)} examples")
109
+ print(f" Kept: {len(filtered)/len(dataset)*100:.1f}%")
110
+
111
+ # Percentile filtering
112
+ print("\n Percentile filtering:")
113
+ top_25 = AdvancedFilter.filter_by_score_percentile(dataset, percentile=75)
114
+ print(f" Top 25% by score: {len(top_25)} examples")
115
+
116
+ # Balanced sampling
117
+ print("\n Balanced class sampling:")
118
+ balanced = AdvancedFilter.filter_balanced_classes(dataset, samples_per_class=100)
119
+ labels = [ex['label'] for ex in balanced]
120
+ label_dist = Counter(labels)
121
+ print(f" Total: {len(balanced)} examples")
122
+ print(f" Distribution: {dict(label_dist)}")
123
+
124
+
125
+ print("\n2️⃣ Stratified Sampling:")
126
+
127
+ class StratifiedSampler:
128
+ """
129
+ Stratified sampling for representative splits
130
+ """
131
+ @staticmethod
132
+ def stratified_split(dataset,
133
+ stratify_column='label',
134
+ train_ratio=0.8,
135
+ seed=42):
136
+ """
137
+ Stratified train/test split
138
+ """
139
+ random.seed(seed)
140
+
141
+ # Group by stratify column
142
+ groups = defaultdict(list)
143
+ for i, ex in enumerate(dataset):
144
+ groups[ex[stratify_column]].append(i)
145
+
146
+ train_indices = []
147
+ test_indices = []
148
+
149
+ # Split each group
150
+ for group_indices in groups.values():
151
+ random.shuffle(group_indices)
152
+ split_point = int(len(group_indices) * train_ratio)
153
+ train_indices.extend(group_indices[:split_point])
154
+ test_indices.extend(group_indices[split_point:])
155
+
156
+ train_dataset = dataset.select(sorted(train_indices))
157
+ test_dataset = dataset.select(sorted(test_indices))
158
+
159
+ return train_dataset, test_dataset
160
+
161
+ @staticmethod
162
+ def multi_stratified_split(dataset,
163
+ stratify_columns=['label', 'domain'],
164
+ train_ratio=0.8,
165
+ seed=42):
166
+ """
167
+ Multiple column stratification
168
+ """
169
+ random.seed(seed)
170
+
171
+ # Create combined stratification key
172
+ groups = defaultdict(list)
173
+ for i, ex in enumerate(dataset):
174
+ key = tuple(ex[col] for col in stratify_columns)
175
+ groups[key].append(i)
176
+
177
+ train_indices = []
178
+ test_indices = []
179
+
180
+ # Split each group
181
+ for group_indices in groups.values():
182
+ random.shuffle(group_indices)
183
+ split_point = int(len(group_indices) * train_ratio)
184
+ train_indices.extend(group_indices[:split_point])
185
+ test_indices.extend(group_indices[split_point:])
186
+
187
+ train_dataset = dataset.select(sorted(train_indices))
188
+ test_dataset = dataset.select(sorted(test_indices))
189
+
190
+ return train_dataset, test_dataset
191
+
192
+ # Test stratified sampling
193
+ print("\n Single column stratification (label):")
194
+ train, test = StratifiedSampler.stratified_split(dataset, stratify_column='label')
195
+
196
+ print(f" Train: {len(train)} examples")
197
+ train_labels = [ex['label'] for ex in train]
198
+ train_dist = Counter(train_labels)
199
+ print(f" Train distribution: {dict(train_dist)}")
200
+
201
+ print(f"\n Test: {len(test)} examples")
202
+ test_labels = [ex['label'] for ex in test]
203
+ test_dist = Counter(test_labels)
204
+ print(f" Test distribution: {dict(test_dist)}")
205
+
206
+ # Multi-column stratification
207
+ print("\n Multi-column stratification (label + domain):")
208
+ train_multi, test_multi = StratifiedSampler.multi_stratified_split(
209
+ dataset,
210
+ stratify_columns=['label', 'domain']
211
+ )
212
+
213
+ print(f" Train: {len(train_multi)} examples")
214
+ print(f" Test: {len(test_multi)} examples")
215
+
216
+ # Check distribution
217
+ train_combos = [(ex['label'], ex['domain']) for ex in train_multi.select(range(min(100, len(train_multi))))]
218
+ print(f" Sample combinations in train: {len(set(train_combos))} unique")
219
+
220
+
221
+ print("\n3️⃣ Diversity Sampling:")
222
+
223
+ class DiversitySampler:
224
+ """
225
+ Sample diverse examples from dataset
226
+ """
227
+ @staticmethod
228
+ def max_diversity_sampling(dataset,
229
+ n_samples=100,
230
+ feature_columns=['length', 'score'],
231
+ seed=42):
232
+ """
233
+ Maksimum diversity için örnekleri seç
234
+ K-means benzeri algoritma
235
+ """
236
+ random.seed(seed)
237
+ np.random.seed(seed)
238
+
239
+ # Feature matrix oluştur
240
+ features = []
241
+ for ex in dataset:
242
+ feat_vector = [ex[col] for col in feature_columns]
243
+ features.append(feat_vector)
244
+ features = np.array(features)
245
+
246
+ # Normalize
247
+ features = (features - features.mean(axis=0)) / (features.std(axis=0) + 1e-8)
248
+
249
+ # Greedy selection
250
+ selected_indices = []
251
+
252
+ # İlk örneği random seç
253
+ first_idx = random.randint(0, len(dataset) - 1)
254
+ selected_indices.append(first_idx)
255
+
256
+ # Kalan örnekleri seç
257
+ for _ in range(n_samples - 1):
258
+ max_dist = -1
259
+ best_idx = -1
260
+
261
+ # Her aday için min distance to selected hesapla
262
+ for candidate_idx in range(len(dataset)):
263
+ if candidate_idx in selected_indices:
264
+ continue
265
+
266
+ # Min distance to any selected point
267
+ min_dist = float('inf')
268
+ for sel_idx in selected_indices:
269
+ dist = np.linalg.norm(
270
+ features[candidate_idx] - features[sel_idx]
271
+ )
272
+ min_dist = min(min_dist, dist)
273
+
274
+ # En uzak olanı seç
275
+ if min_dist > max_dist:
276
+ max_dist = min_dist
277
+ best_idx = candidate_idx
278
+
279
+ if best_idx != -1:
280
+ selected_indices.append(best_idx)
281
+
282
+ return dataset.select(selected_indices)
283
+
284
+ @staticmethod
285
+ def coverage_based_sampling(dataset,
286
+ coverage_column='domain',
287
+ n_samples_per_value=20):
288
+ """
289
+ Her category'den belirli sayıda örnek al (coverage)
290
+ """
291
+ groups = defaultdict(list)
292
+ for i, ex in enumerate(dataset):
293
+ groups[ex[coverage_column]].append(i)
294
+
295
+ selected_indices = []
296
+ for group_indices in groups.values():
297
+ n = min(n_samples_per_value, len(group_indices))
298
+ sampled = random.sample(group_indices, n)
299
+ selected_indices.extend(sampled)
300
+
301
+ return dataset.select(sorted(selected_indices))
302
+
303
+ # Test diversity sampling
304
+ print("\n Max diversity sampling:")
305
+ diverse_sample = DiversitySampler.max_diversity_sampling(
306
+ dataset,
307
+ n_samples=100,
308
+ feature_columns=['length', 'score']
309
+ )
310
+
311
+ print(f" Selected: {len(diverse_sample)} diverse examples")
312
+
313
+ # Diversity ölçüsü
314
+ lengths = [ex['length'] for ex in diverse_sample]
315
+ scores = [ex['score'] for ex in diverse_sample]
316
+ print(f" Length range: {min(lengths)} - {max(lengths)}")
317
+ print(f" Length std: {np.std(lengths):.2f}")
318
+ print(f" Score range: {min(scores):.3f} - {max(scores):.3f}")
319
+
320
+ # Coverage sampling
321
+ print("\n Coverage-based sampling:")
322
+ coverage_sample = DiversitySampler.coverage_based_sampling(
323
+ dataset,
324
+ coverage_column='domain',
325
+ n_samples_per_value=20
326
+ )
327
+
328
+ print(f" Selected: {len(coverage_sample)} examples")
329
+ domains = [ex['domain'] for ex in coverage_sample]
330
+ domain_dist = Counter(domains)
331
+ print(f" Domain distribution: {dict(domain_dist)}")
332
+
333
+
334
+ print("\n4️⃣ Active Learning Sampling:")
335
+
336
+ class ActiveLearningSampler:
337
+ """
338
+ Active learning için uncertainty-based sampling
339
+ """
340
+ @staticmethod
341
+ def uncertainty_sampling(dataset,
342
+ uncertainty_scores: List[float],
343
+ n_samples=100,
344
+ strategy='least_confident'):
345
+ """
346
+ Model uncertainty'ye göre sample
347
+ """
348
+ if len(uncertainty_scores) != len(dataset):
349
+ raise ValueError("Uncertainty scores must match dataset size")
350
+
351
+ # Strategy'ye göre sırala
352
+ if strategy == 'least_confident':
353
+ # En düşük confidence (en yüksek uncertainty)
354
+ sorted_indices = np.argsort(uncertainty_scores)[::-1]
355
+ elif strategy == 'margin':
356
+ # En düşük margin
357
+ sorted_indices = np.argsort(uncertainty_scores)
358
+ else:
359
+ sorted_indices = np.argsort(uncertainty_scores)[::-1]
360
+
361
+ # Top n'i seç
362
+ selected_indices = sorted_indices[:n_samples].tolist()
363
+
364
+ return dataset.select(selected_indices)
365
+
366
+ @staticmethod
367
+ def diversity_uncertainty_sampling(dataset,
368
+ uncertainty_scores: List[float],
369
+ n_samples=100,
370
+ diversity_weight=0.5):
371
+ """
372
+ Uncertainty + Diversity kombinasyonu
373
+ """
374
+ # Simulated diversity scores (gerçekte embedding distance kullanılır)
375
+ diversity_scores = [random.random() for _ in range(len(dataset))]
376
+
377
+ # Combined score
378
+ combined_scores = [
379
+ (1 - diversity_weight) * uncertainty_scores[i] +
380
+ diversity_weight * diversity_scores[i]
381
+ for i in range(len(dataset))
382
+ ]
383
+
384
+ # Top n
385
+ sorted_indices = np.argsort(combined_scores)[::-1]
386
+ selected_indices = sorted_indices[:n_samples].tolist()
387
+
388
+ return dataset.select(selected_indices)
389
+
390
+ # Test active learning sampling
391
+ print("\n Uncertainty-based sampling:")
392
+
393
+ # Simulate uncertainty scores (gerçekte model'den gelir)
394
+ uncertainty_scores = [random.random() for _ in range(len(dataset))]
395
+
396
+ uncertain_sample = ActiveLearningSampler.uncertainty_sampling(
397
+ dataset,
398
+ uncertainty_scores,
399
+ n_samples=50,
400
+ strategy='least_confident'
401
+ )
402
+
403
+ print(f" Selected: {len(uncertain_sample)} most uncertain examples")
404
+ selected_uncertainties = [uncertainty_scores[i] for i in range(50)]
405
+ print(f" Avg uncertainty: {np.mean(selected_uncertainties):.3f}")
406
+
407
+ # Diversity + Uncertainty
408
+ print("\n Diversity + Uncertainty sampling:")
409
+ diverse_uncertain = ActiveLearningSampler.diversity_uncertainty_sampling(
410
+ dataset,
411
+ uncertainty_scores,
412
+ n_samples=50,
413
+ diversity_weight=0.3 # 30% diversity, 70% uncertainty
414
+ )
415
+
416
+ print(f" Selected: {len(diverse_uncertain)} examples")
417
+
418
+
419
+ print("\n" + "="*70)
420
+ print("6. DYNAMIC BATCHING")
421
+ print("="*70)
422
+
423
+ print("\n📦 Dynamic Batching Strategies:")
424
+
425
+ class DynamicBatcher:
426
+ """
427
+ Dynamic batching for efficient training
428
+ """
429
+ def __init__(self, dataset, batch_size=32):
430
+ self.dataset = dataset
431
+ self.batch_size = batch_size
432
+
433
+ def length_based_batching(self, length_column='length', max_length_diff=50):
434
+ """
435
+ Benzer uzunluktaki örnekleri aynı batch'te topla
436
+ """
437
+ # Uzunluğa göre sırala
438
+ sorted_indices = sorted(
439
+ range(len(self.dataset)),
440
+ key=lambda i: self.dataset[i][length_column]
441
+ )
442
+
443
+ # Batch'leri oluştur
444
+ batches = []
445
+ for i in range(0, len(sorted_indices), self.batch_size):
446
+ batch_indices = sorted_indices[i:i + self.batch_size]
447
+ batches.append(self.dataset.select(batch_indices))
448
+
449
+ return batches
450
+
451
+ def bucket_batching(self, length_column='length', n_buckets=5):
452
+ """
453
+ Bucket-based batching - uzunluklara göre gruplama
454
+ """
455
+ lengths = [ex[length_column] for ex in self.dataset]
456
+ min_len, max_len = min(lengths), max(lengths)
457
+
458
+ # Bucket boundaries
459
+ bucket_size = (max_len - min_len) / n_buckets
460
+ buckets = [[] for _ in range(n_buckets)]
461
+
462
+ # Örnekleri bucket'lara ata
463
+ for i, ex in enumerate(self.dataset):
464
+ length = ex[length_column]
465
+ bucket_idx = min(int((length - min_len) / bucket_size), n_buckets - 1)
466
+ buckets[bucket_idx].append(i)
467
+
468
+ # Her bucket'tan batch'ler oluştur
469
+ all_batches = []
470
+ for bucket_indices in buckets:
471
+ random.shuffle(bucket_indices)
472
+ for i in range(0, len(bucket_indices), self.batch_size):
473
+ batch_indices = bucket_indices[i:i + self.batch_size]
474
+ all_batches.append(self.dataset.select(batch_indices))
475
+
476
+ return all_batches
477
+
478
+ def get_batch_statistics(self, batches, length_column='length'):
479
+ """
480
+ Batch istatistiklerini hesapla
481
+ """
482
+ stats = []
483
+ for i, batch in enumerate(batches):
484
+ lengths = [ex[length_column] for ex in batch]
485
+ stats.append({
486
+ 'batch_id': i,
487
+ 'size': len(batch),
488
+ 'min_length': min(lengths),
489
+ 'max_length': max(lengths),
490
+ 'avg_length': np.mean(lengths),
491
+ 'std_length': np.std(lengths)
492
+ })
493
+ return stats
494
+
495
+ # Test dynamic batching
496
+ print("\n1️⃣ Length-based Batching:")
497
+ batcher = DynamicBatcher(dataset, batch_size=50)
498
+
499
+ length_batches = batcher.length_based_batching(length_column='length')
500
+ print(f" Total batches: {len(length_batches)}")
501
+
502
+ # İlk 5 batch'in istatistikleri
503
+ stats = batcher.get_batch_statistics(length_batches[:5])
504
+ print(f"\n First 5 batch statistics:")
505
+ for stat in stats:
506
+ print(f" Batch {stat['batch_id']}: "
507
+ f"size={stat['size']}, "
508
+ f"length range=[{stat['min_length']}-{stat['max_length']}], "
509
+ f"std={stat['std_length']:.1f}")
510
+
511
+ # Padding efficiency
512
+ print(f"\n Padding efficiency:")
513
+ total_padding = sum(
514
+ (stat['max_length'] - stat['avg_length']) * stat['size']
515
+ for stat in stats
516
+ )
517
+ print(f" Average padding per example: {total_padding / sum(s['size'] for s in stats):.1f}")
518
+
519
+
520
+ print("\n2️⃣ Bucket Batching:")
521
+ bucket_batches = batcher.bucket_batching(n_buckets=5)
522
+ print(f" Total batches: {len(bucket_batches)}")
523
+
524
+ # Bucket istatistikleri
525
+ bucket_stats = batcher.get_batch_statistics(bucket_batches[:10])
526
+ print(f"\n Sample bucket statistics:")
527
+ for stat in bucket_stats[:5]:
528
+ print(f" Batch {stat['batch_id']}: "
529
+ f"size={stat['size']}, "
530
+ f"length range=[{stat['min_length']}-{stat['max_length']}]")
531
+
532
+
533
+ print("\n3️⃣ Smart Batch Composition:")
534
+
535
+ class SmartBatcher:
536
+ """
537
+ Intelligent batch composition
538
+ """
539
+ @staticmethod
540
+ def create_balanced_batches(dataset,
541
+ label_column='label',
542
+ batch_size=32):
543
+ """
544
+ Her batch'te class balance sağla
545
+ """
546
+ # Label'lara göre grupla
547
+ label_groups = defaultdict(list)
548
+ for i, ex in enumerate(dataset):
549
+ label_groups[ex[label_column]].append(i)
550
+
551
+ # Her label'dan eşit sayıda örnek al
552
+ n_labels = len(label_groups)
553
+ per_label = batch_size // n_labels
554
+
555
+ batches = []
556
+ max_iterations = max(len(indices) for indices in label_groups.values()) // per_label
557
+
558
+ for iteration in range(max_iterations):
559
+ batch_indices = []
560
+
561
+ for label, indices in label_groups.items():
562
+ start = iteration * per_label
563
+ end = start + per_label
564
+ if start < len(indices):
565
+ batch_indices.extend(indices[start:min(end, len(indices))])
566
+
567
+ if batch_indices:
568
+ random.shuffle(batch_indices)
569
+ batches.append(dataset.select(batch_indices))
570
+
571
+ return batches
572
+
573
+ @staticmethod
574
+ def create_diverse_batches(dataset,
575
+ diversity_column='domain',
576
+ batch_size=32):
577
+ """
578
+ Her batch'te çeşitlilik sağla
579
+ """
580
+ groups = defaultdict(list)
581
+ for i, ex in enumerate(dataset):
582
+ groups[ex[diversity_column]].append(i)
583
+
584
+ # Round-robin şeklinde batch oluştur
585
+ all_indices = list(range(len(dataset)))
586
+ random.shuffle(all_indices)
587
+
588
+ batches = []
589
+ for i in range(0, len(all_indices), batch_size):
590
+ batch_indices = all_indices[i:i + batch_size]
591
+ batches.append(dataset.select(batch_indices))
592
+
593
+ return batches
594
+
595
+ # Test smart batching
596
+ print("\n Balanced batches:")
597
+ balanced_batches = SmartBatcher.create_balanced_batches(dataset, batch_size=30)
598
+ print(f" Created: {len(balanced_batches)} batches")
599
+
600
+ # İlk batch'in label distribution'ı
601
+ first_batch_labels = [ex['label'] for ex in balanced_batches[0]]
602
+ label_dist = Counter(first_batch_labels)
603
+ print(f" First batch label distribution: {dict(label_dist)}")
604
+
605
+ print("\n Diverse batches:")
606
+ diverse_batches = SmartBatcher.create_diverse_batches(dataset, batch_size=30)
607
+ print(f" Created: {len(diverse_batches)} batches")
608
+
609
+ # İlk batch'in domain distribution'ı
610
+ first_batch_domains = [ex['domain'] for ex in diverse_batches[0]]
611
+ domain_dist = Counter(first_batch_domains)
612
+ print(f" First batch domain distribution: {dict(domain_dist)}")
613
+
614
+
615
+ print("\n" + "="*70)
616
+ print("7. PRODUCTION-READY PATTERNS")
617
+ print("="*70)
618
+
619
+ print("\n🎯 Real-World Integration Patterns:")
620
+
621
+ class DatasetManager:
622
+ """
623
+ Production-ready dataset management
624
+ """
625
+ def __init__(self, dataset, validation_rules=None):
626
+ self.dataset = dataset
627
+ self.validation_rules = validation_rules or []
628
+ self.statistics = {}
629
+
630
+ def validate(self):
631
+ """Dataset'i validate et"""
632
+ print("\n Validating dataset...")
633
+ issues = []
634
+
635
+ # Temel validations
636
+ if len(self.dataset) == 0:
637
+ issues.append("Dataset is empty")
638
+
639
+ # Custom validation rules
640
+ for rule in self.validation_rules:
641
+ try:
642
+ result = rule(self.dataset)
643
+ if not result['valid']:
644
+ issues.append(result['message'])
645
+ except Exception as e:
646
+ issues.append(f"Validation error: {str(e)}")
647
+
648
+ if issues:
649
+ print(f" ⚠️ Found {len(issues)} issues:")
650
+ for issue in issues:
651
+ print(f" - {issue}")
652
+ return False
653
+ else:
654
+ print(" ✅ Validation passed")
655
+ return True
656
+
657
+ def compute_statistics(self):
658
+ """İstatistikleri hesapla"""
659
+ print("\n Computing statistics...")
660
+
661
+ self.statistics = {
662
+ 'size': len(self.dataset),
663
+ 'columns': self.dataset.column_names,
664
+ 'memory_size': len(str(self.dataset)), # Approximation
665
+ }
666
+
667
+ # Numeric column statistics
668
+ for col in self.dataset.column_names:
669
+ try:
670
+ values = [ex[col] for ex in self.dataset.select(range(min(100, len(self.dataset))))]
671
+ if all(isinstance(v, (int, float)) for v in values):
672
+ self.statistics[f'{col}_stats'] = {
673
+ 'mean': np.mean(values),
674
+ 'std': np.std(values),
675
+ 'min': np.min(values),
676
+ 'max': np.max(values)
677
+ }
678
+ except:
679
+ pass
680
+
681
+ print(f" ✅ Statistics computed")
682
+ return self.statistics
683
+
684
+ def summary(self):
685
+ """Dataset özeti"""
686
+ print(f"\n📊 Dataset Summary:")
687
+ print(f" Size: {len(self.dataset):,} examples")
688
+ print(f" Columns: {len(self.dataset.column_names)}")
689
+ print(f" Column names: {', '.join(self.dataset.column_names[:5])}...")
690
+
691
+ # Test production patterns
692
+ print("\n Creating dataset manager:")
693
+
694
+ # Custom validation rules
695
+ def check_text_length(dataset):
696
+ lengths = [len(ex['text']) for ex in dataset.select(range(min(100, len(dataset))))]
697
+ avg_length = np.mean(lengths)
698
+ return {
699
+ 'valid': avg_length > 10,
700
+ 'message': f"Average text length too short: {avg_length:.1f}"
701
+ }
702
+
703
+ def check_label_distribution(dataset):
704
+ labels = [ex['label'] for ex in dataset]
705
+ label_counts = Counter(labels)
706
+ min_count = min(label_counts.values())
707
+ return {
708
+ 'valid': min_count >= 10,
709
+ 'message': f"Imbalanced labels: min count = {min_count}"
710
+ }
711
+
712
+ manager = DatasetManager(
713
+ dataset,
714
+ validation_rules=[check_text_length, check_label_distribution]
715
+ )
716
+
717
+ # Validate
718
+ manager.validate()
719
+
720
+ # Statistics
721
+ stats = manager.compute_statistics()
722
+
723
+ # Summary
724
+ manager.summary()
725
+
726
+
727
+ print("\n" + "="*70)
728
+ print("✅ BÖLÜM 3 TAMAMLANDI!")
729
+ print("="*70)
730
+
731
+ print(f"""
732
+ Bu bölümde öğrendikleriniz (Tam Liste):
733
+
734
+ PART 1:
735
+ ✓ Custom Data Collators (3 tip: Simple, Padding, Advanced)
736
+ ✓ Advanced Feature Extraction (10+ features)
737
+ ✓ Feature Transformation & Normalization
738
+ ✓ Interaction Features
739
+ ✓ End-to-End Preprocessing Pipelines
740
+ ✓ Pipeline Templates
741
+ ✓ Data Augmentation (Word deletion, swap, synonym)
742
+ ✓ Smart Class Balancing
743
+
744
+ PART 2:
745
+ ✓ Complex Multi-Condition Filtering
746
+ ✓ Percentile Filtering
747
+ ✓ Stratified Sampling (Single & Multi-column)
748
+ ✓ Diversity Sampling (Max diversity, Coverage-based)
749
+ ✓ Active Learning Sampling (Uncertainty-based)
750
+ ✓ Dynamic Batching (Length-based, Bucket-based)
751
+ ✓ Smart Batch Composition (Balanced, Diverse)
752
+ ✓ Production-Ready Dataset Management
753
+
754
+ 📊 PERFORMANS KAZANIMLARI:
755
+ - Dynamic batching: Padding'i %40+ azaltır
756
+ - Stratified sampling: Balanced splits
757
+ - Diversity sampling: Daha representative data
758
+ - Smart augmentation: 3x veri artışı
759
+
760
+ 🎯 KEY TAKEAWAYS:
761
+ - Collator'lar modele göre özelleştirilmeli
762
+ - Pipeline pattern code organization'ı kolaylaştırır
763
+ - Augmentation class imbalance'ı çözer
764
+ - Stratified sampling generalization'ı iyileştirir
765
+ - Dynamic batching training efficiency'yi artırır
766
+
767
+ 📚 SONRAKI BÖLÜM: Özel Görevler İçin Datasets
768
+ - Question Answering (SQuAD, Natural Questions)
769
+ - Summarization (CNN/DailyMail)
770
+ - Named Entity Recognition
771
+ - Sentiment Analysis
772
+ - Text Classification
773
+ """)
774
+
775
+ print("\n🎉 Tebrikler! İleri teknikler modülünü tamamladınız!")
776
+ print("4. bölüme geçelim mi? (Özel Görevler)")
space/modules/04_ozel_gorevler.py ADDED
@@ -0,0 +1,1039 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ÖZEL GÖREVLER İÇİN DATASETS - İLERİ SEVİYE
3
+ ==========================================
4
+
5
+ Bu modülde öğrenecekleriniz:
6
+ 1. Question Answering (QA) Datasets
7
+ 2. Summarization Datasets
8
+ 3. Named Entity Recognition (NER)
9
+ 4. Sentiment Analysis
10
+ 5. Text Classification
11
+ 6. Multi-Task Learning Datasets
12
+ """
13
+
14
+ from datasets import Dataset, DatasetDict
15
+ import numpy as np
16
+ from typing import Dict, List, Any
17
+ import random
18
+ from collections import Counter, defaultdict
19
+ import json
20
+
21
+ print("="*70)
22
+ print("📚 ÖZEL GÖREVLER İÇİN DATASETS")
23
+ print("="*70)
24
+
25
+
26
+ print("\n" + "="*70)
27
+ print("1. QUESTION ANSWERING (QA) DATASETS")
28
+ print("="*70)
29
+
30
+ print("\n❓ Question Answering Dataset Yapısı:")
31
+
32
+ class QADatasetCreator:
33
+ """
34
+ Question Answering dataset oluşturucu
35
+ """
36
+ @staticmethod
37
+ def create_extractive_qa_dataset(num_samples=200):
38
+ """
39
+ Extractive QA (SQuAD-style)
40
+ Cevap context'in içinden extract edilir
41
+ """
42
+ contexts = [
43
+ "The Amazon rainforest, also known as Amazonia, is a moist broadleaf tropical rainforest. "
44
+ "It covers most of the Amazon basin of South America. The basin covers 7 million square kilometers. "
45
+ "The rainforest contains approximately 390 billion individual trees.",
46
+
47
+ "Python is a high-level programming language. It was created by Guido van Rossum in 1991. "
48
+ "Python emphasizes code readability with significant indentation. It supports multiple programming paradigms "
49
+ "including structured, object-oriented and functional programming.",
50
+
51
+ "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. "
52
+ "It was designed by Gustave Eiffel and completed in 1889. Standing 330 meters tall, "
53
+ "it was the world's tallest man-made structure until 1930.",
54
+
55
+ "Artificial Intelligence is the simulation of human intelligence by machines. "
56
+ "AI research began in 1956 at Dartmouth College. Modern AI techniques include "
57
+ "machine learning, deep learning, and natural language processing."
58
+ ]
59
+
60
+ qa_pairs = [
61
+ ("What is the Amazon rainforest?", "a moist broadleaf tropical rainforest", 0),
62
+ ("How many square kilometers does the Amazon basin cover?", "7 million square kilometers", 0),
63
+ ("Who created Python?", "Guido van Rossum", 1),
64
+ ("When was Python created?", "1991", 1),
65
+ ("Where is the Eiffel Tower located?", "Paris, France", 2),
66
+ ("How tall is the Eiffel Tower?", "330 meters", 2),
67
+ ("When did AI research begin?", "1956", 3),
68
+ ("Where did AI research begin?", "Dartmouth College", 3),
69
+ ]
70
+
71
+ def gen():
72
+ for i in range(num_samples):
73
+ context_idx = i % len(contexts)
74
+ qa_idx = i % len(qa_pairs)
75
+
76
+ context = contexts[context_idx]
77
+ question, answer, expected_ctx = qa_pairs[qa_idx]
78
+
79
+ # Answer span'i bul
80
+ answer_start = context.find(answer) if context_idx == expected_ctx else -1
81
+
82
+ yield {
83
+ 'id': f'qa_{i}',
84
+ 'context': context,
85
+ 'question': question,
86
+ 'answers': {
87
+ 'text': [answer],
88
+ 'answer_start': [answer_start] if answer_start >= 0 else [-1]
89
+ },
90
+ 'is_impossible': answer_start < 0
91
+ }
92
+
93
+ return Dataset.from_generator(gen)
94
+
95
+ @staticmethod
96
+ def create_multiple_choice_qa(num_samples=100):
97
+ """
98
+ Multiple Choice QA
99
+ """
100
+ questions = [
101
+ {
102
+ 'question': 'What is the capital of France?',
103
+ 'choices': ['London', 'Berlin', 'Paris', 'Madrid'],
104
+ 'answer': 2
105
+ },
106
+ {
107
+ 'question': 'Which planet is known as the Red Planet?',
108
+ 'choices': ['Venus', 'Mars', 'Jupiter', 'Saturn'],
109
+ 'answer': 1
110
+ },
111
+ {
112
+ 'question': 'Who wrote Romeo and Juliet?',
113
+ 'choices': ['Charles Dickens', 'William Shakespeare', 'Jane Austen', 'Mark Twain'],
114
+ 'answer': 1
115
+ }
116
+ ]
117
+
118
+ def gen():
119
+ for i in range(num_samples):
120
+ q = questions[i % len(questions)]
121
+ yield {
122
+ 'id': f'mcqa_{i}',
123
+ 'question': q['question'],
124
+ 'choices': q['choices'],
125
+ 'answer': q['answer'],
126
+ 'answer_text': q['choices'][q['answer']]
127
+ }
128
+
129
+ return Dataset.from_generator(gen)
130
+
131
+ print("\n1️⃣ Extractive QA Dataset (SQuAD-style):")
132
+ qa_dataset = QADatasetCreator.create_extractive_qa_dataset(200)
133
+
134
+ print(f"✅ Dataset: {len(qa_dataset)} QA pairs")
135
+ print(f"\nÖrnek QA:")
136
+ sample = qa_dataset[0]
137
+ print(f" Context: {sample['context'][:100]}...")
138
+ print(f" Question: {sample['question']}")
139
+ print(f" Answer: {sample['answers']['text'][0]}")
140
+ print(f" Answer start: {sample['answers']['answer_start'][0]}")
141
+ print(f" Is impossible: {sample['is_impossible']}")
142
+
143
+ # İstatistikler
144
+ print(f"\n📊 QA Statistics:")
145
+ impossible_count = sum([1 for ex in qa_dataset if ex['is_impossible']])
146
+ print(f" Total questions: {len(qa_dataset)}")
147
+ print(f" Answerable: {len(qa_dataset) - impossible_count}")
148
+ print(f" Impossible: {impossible_count}")
149
+
150
+ # Answer length distribution
151
+ answerable = [ex for ex in qa_dataset if not ex['is_impossible']]
152
+ answer_lengths = [len(ex['answers']['text'][0].split()) for ex in answerable]
153
+ print(f" Avg answer length: {np.mean(answer_lengths):.1f} words")
154
+
155
+
156
+ print("\n2️⃣ Multiple Choice QA:")
157
+ mcqa_dataset = QADatasetCreator.create_multiple_choice_qa(100)
158
+
159
+ print(f"✅ Dataset: {len(mcqa_dataset)} questions")
160
+ print(f"\nÖrnek:")
161
+ sample = mcqa_dataset[0]
162
+ print(f" Question: {sample['question']}")
163
+ print(f" Choices:")
164
+ for i, choice in enumerate(sample['choices']):
165
+ marker = "✓" if i == sample['answer'] else " "
166
+ print(f" {marker} {i}. {choice}")
167
+ print(f" Correct answer: {sample['answer_text']}")
168
+
169
+
170
+ print("\n3️⃣ QA Preprocessing Pipeline:")
171
+
172
+ class QAPreprocessor:
173
+ """
174
+ QA-specific preprocessing
175
+ """
176
+ @staticmethod
177
+ def validate_qa_example(example):
178
+ """
179
+ QA örneğini validate et
180
+ """
181
+ if example['is_impossible']:
182
+ return True
183
+
184
+ answer = example['answers']['text'][0]
185
+ answer_start = example['answers']['answer_start'][0]
186
+ context = example['context']
187
+
188
+ # Answer context'te var mı?
189
+ if answer_start >= 0:
190
+ extracted = context[answer_start:answer_start + len(answer)]
191
+ return extracted == answer
192
+ return False
193
+
194
+ @staticmethod
195
+ def add_qa_features(example):
196
+ """
197
+ QA feature'ları ekle
198
+ """
199
+ result = {**example}
200
+
201
+ # Question type
202
+ question_lower = example['question'].lower()
203
+ if question_lower.startswith('what'):
204
+ q_type = 'what'
205
+ elif question_lower.startswith('who'):
206
+ q_type = 'who'
207
+ elif question_lower.startswith('when'):
208
+ q_type = 'when'
209
+ elif question_lower.startswith('where'):
210
+ q_type = 'where'
211
+ elif question_lower.startswith('how'):
212
+ q_type = 'how'
213
+ elif question_lower.startswith('why'):
214
+ q_type = 'why'
215
+ else:
216
+ q_type = 'other'
217
+
218
+ result['question_type'] = q_type
219
+ result['context_length'] = len(example['context'].split())
220
+ result['question_length'] = len(example['question'].split())
221
+
222
+ if not example['is_impossible']:
223
+ answer = example['answers']['text'][0]
224
+ result['answer_length'] = len(answer.split())
225
+ else:
226
+ result['answer_length'] = 0
227
+
228
+ return result
229
+
230
+ # Preprocessing uygula
231
+ print("\n Applying QA preprocessing:")
232
+ qa_processed = qa_dataset.map(
233
+ QAPreprocessor.add_qa_features,
234
+ desc="Adding QA features"
235
+ )
236
+
237
+ print(f"✅ Processed: {len(qa_processed)} examples")
238
+ print(f" New columns: {[c for c in qa_processed.column_names if c not in qa_dataset.column_names]}")
239
+
240
+ # Question type distribution
241
+ q_types = [ex['question_type'] for ex in qa_processed]
242
+ type_dist = Counter(q_types)
243
+ print(f"\n Question type distribution:")
244
+ for qtype, count in type_dist.most_common():
245
+ print(f" {qtype}: {count}")
246
+
247
+
248
+ print("\n" + "="*70)
249
+ print("2. SUMMARIZATION DATASETS")
250
+ print("="*70)
251
+
252
+ print("\n📝 Summarization Dataset Yapısı:")
253
+
254
+ class SummarizationDatasetCreator:
255
+ """
256
+ Summarization dataset oluşturucu
257
+ """
258
+ @staticmethod
259
+ def create_news_summarization(num_samples=100):
260
+ """
261
+ News summarization (CNN/DailyMail style)
262
+ """
263
+ article_templates = [
264
+ {
265
+ 'article': "Scientists have made a breakthrough discovery in renewable energy. "
266
+ "Researchers at MIT developed a new solar panel technology that increases "
267
+ "efficiency by 40%. The innovation uses advanced nanomaterials. "
268
+ "This could revolutionize the solar energy industry. "
269
+ "The team published their findings in Nature Energy journal. "
270
+ "Commercial applications are expected within 5 years.",
271
+ 'summary': "MIT researchers developed solar panels with 40% higher efficiency using nanomaterials."
272
+ },
273
+ {
274
+ 'article': "The global tech conference concluded yesterday with major announcements. "
275
+ "Leading companies unveiled new AI technologies and products. "
276
+ "Attendance reached record numbers with over 50,000 participants. "
277
+ "Industry experts discussed future trends in artificial intelligence. "
278
+ "The conference featured 200 speakers from 30 countries.",
279
+ 'summary': "Global tech conference featured AI announcements with record 50,000 attendees."
280
+ },
281
+ {
282
+ 'article': "Climate change continues to impact global weather patterns. "
283
+ "Recent studies show increasing temperatures worldwide. "
284
+ "Scientists warn of more frequent extreme weather events. "
285
+ "International cooperation is needed to address the crisis. "
286
+ "Many countries are implementing new environmental policies.",
287
+ 'summary': "Studies reveal climate change effects and call for international action."
288
+ }
289
+ ]
290
+
291
+ def gen():
292
+ for i in range(num_samples):
293
+ template = article_templates[i % len(article_templates)]
294
+
295
+ yield {
296
+ 'id': f'summ_{i}',
297
+ 'article': template['article'],
298
+ 'summary': template['summary'],
299
+ 'article_length': len(template['article'].split()),
300
+ 'summary_length': len(template['summary'].split()),
301
+ 'compression_ratio': len(template['summary']) / len(template['article'])
302
+ }
303
+
304
+ return Dataset.from_generator(gen)
305
+
306
+ @staticmethod
307
+ def create_abstractive_summarization(num_samples=100):
308
+ """
309
+ Abstractive summarization - yeni kelimeler içeren özetler
310
+ """
311
+ def gen():
312
+ for i in range(num_samples):
313
+ article_length = np.random.randint(100, 500)
314
+ summary_length = np.random.randint(20, 50)
315
+
316
+ yield {
317
+ 'id': f'abs_summ_{i}',
318
+ 'article': f"Long article about topic {i}. " * (article_length // 5),
319
+ 'summary': f"Brief summary of article {i}. " * (summary_length // 5),
320
+ 'summary_type': 'abstractive',
321
+ 'article_length': article_length,
322
+ 'summary_length': summary_length
323
+ }
324
+
325
+ return Dataset.from_generator(gen)
326
+
327
+ print("\n1️⃣ News Summarization Dataset:")
328
+ summ_dataset = SummarizationDatasetCreator.create_news_summarization(100)
329
+
330
+ print(f"✅ Dataset: {len(summ_dataset)} article-summary pairs")
331
+ print(f"\nÖrnek:")
332
+ sample = summ_dataset[0]
333
+ print(f" Article ({sample['article_length']} words):")
334
+ print(f" {sample['article'][:150]}...")
335
+ print(f" Summary ({sample['summary_length']} words):")
336
+ print(f" {sample['summary']}")
337
+ print(f" Compression ratio: {sample['compression_ratio']:.2%}")
338
+
339
+ # Summarization statistics
340
+ print(f"\n📊 Summarization Statistics:")
341
+ avg_article_len = np.mean([ex['article_length'] for ex in summ_dataset])
342
+ avg_summary_len = np.mean([ex['summary_length'] for ex in summ_dataset])
343
+ avg_compression = np.mean([ex['compression_ratio'] for ex in summ_dataset])
344
+
345
+ print(f" Avg article length: {avg_article_len:.1f} words")
346
+ print(f" Avg summary length: {avg_summary_len:.1f} words")
347
+ print(f" Avg compression ratio: {avg_compression:.2%}")
348
+
349
+
350
+ print("\n2️⃣ Summarization Quality Metrics:")
351
+
352
+ class SummarizationMetrics:
353
+ """
354
+ Summarization için quality metrics
355
+ """
356
+ @staticmethod
357
+ def calculate_rouge_proxy(article, summary):
358
+ """
359
+ Basitleştirilmiş ROUGE-like metric
360
+ Gerçek ROUGE için rouge-score library kullanılır
361
+ """
362
+ article_words = set(article.lower().split())
363
+ summary_words = set(summary.lower().split())
364
+
365
+ # Overlap
366
+ overlap = len(article_words & summary_words)
367
+
368
+ # Precision, Recall, F1
369
+ precision = overlap / len(summary_words) if summary_words else 0
370
+ recall = overlap / len(article_words) if article_words else 0
371
+ f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
372
+
373
+ return {
374
+ 'precision': precision,
375
+ 'recall': recall,
376
+ 'f1': f1
377
+ }
378
+
379
+ @staticmethod
380
+ def add_quality_metrics(example):
381
+ """
382
+ Quality metrics ekle
383
+ """
384
+ metrics = SummarizationMetrics.calculate_rouge_proxy(
385
+ example['article'],
386
+ example['summary']
387
+ )
388
+
389
+ return {
390
+ **example,
391
+ 'rouge_precision': metrics['precision'],
392
+ 'rouge_recall': metrics['recall'],
393
+ 'rouge_f1': metrics['f1']
394
+ }
395
+
396
+ # Metrics ekle
397
+ print("\n Adding quality metrics:")
398
+ summ_with_metrics = summ_dataset.map(
399
+ SummarizationMetrics.add_quality_metrics,
400
+ desc="Calculating metrics"
401
+ )
402
+
403
+ print(f"✅ Metrics added")
404
+ print(f"\nSample metrics:")
405
+ sample = summ_with_metrics[0]
406
+ print(f" ROUGE Precision: {sample['rouge_precision']:.3f}")
407
+ print(f" ROUGE Recall: {sample['rouge_recall']:.3f}")
408
+ print(f" ROUGE F1: {sample['rouge_f1']:.3f}")
409
+
410
+
411
+ print("\n" + "="*70)
412
+ print("3. NAMED ENTITY RECOGNITION (NER)")
413
+ print("="*70)
414
+
415
+ print("\n🏷️ NER Dataset Yapısı:")
416
+
417
+ class NERDatasetCreator:
418
+ """
419
+ Named Entity Recognition dataset oluşturucu
420
+ """
421
+ @staticmethod
422
+ def create_ner_dataset(num_samples=100):
423
+ """
424
+ NER dataset (CoNLL format)
425
+ """
426
+ templates = [
427
+ {
428
+ 'tokens': ['John', 'Smith', 'works', 'at', 'Google', 'in', 'New', 'York'],
429
+ 'ner_tags': ['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'O', 'B-LOC', 'I-LOC']
430
+ },
431
+ {
432
+ 'tokens': ['Apple', 'announced', 'new', 'products', 'in', 'California'],
433
+ 'ner_tags': ['B-ORG', 'O', 'O', 'O', 'O', 'B-LOC']
434
+ },
435
+ {
436
+ 'tokens': ['Dr.', 'Jane', 'Brown', 'visited', 'Paris', 'last', 'Monday'],
437
+ 'ner_tags': ['O', 'B-PER', 'I-PER', 'O', 'B-LOC', 'O', 'B-DATE']
438
+ }
439
+ ]
440
+
441
+ # Tag to ID mapping
442
+ tag2id = {
443
+ 'O': 0,
444
+ 'B-PER': 1, 'I-PER': 2,
445
+ 'B-ORG': 3, 'I-ORG': 4,
446
+ 'B-LOC': 5, 'I-LOC': 6,
447
+ 'B-DATE': 7, 'I-DATE': 8
448
+ }
449
+
450
+ def gen():
451
+ for i in range(num_samples):
452
+ template = templates[i % len(templates)]
453
+
454
+ yield {
455
+ 'id': f'ner_{i}',
456
+ 'tokens': template['tokens'],
457
+ 'ner_tags': template['ner_tags'],
458
+ 'ner_tag_ids': [tag2id[tag] for tag in template['ner_tags']],
459
+ 'sentence': ' '.join(template['tokens'])
460
+ }
461
+
462
+ return Dataset.from_generator(gen), tag2id
463
+
464
+ print("\n1️⃣ NER Dataset:")
465
+ ner_dataset, tag2id = NERDatasetCreator.create_ner_dataset(100)
466
+
467
+ print(f"✅ Dataset: {len(ner_dataset)} sentences")
468
+ print(f" Tag vocabulary: {len(tag2id)} tags")
469
+ print(f" Tags: {list(tag2id.keys())}")
470
+
471
+ print(f"\nÖrnek:")
472
+ sample = ner_dataset[0]
473
+ print(f" Sentence: {sample['sentence']}")
474
+ print(f" Tokens: {sample['tokens']}")
475
+ print(f" NER tags: {sample['ner_tags']}")
476
+ print(f"\n Token-Tag pairs:")
477
+ for token, tag in zip(sample['tokens'], sample['ner_tags']):
478
+ if tag != 'O':
479
+ print(f" {token}: {tag}")
480
+
481
+
482
+ print("\n2️⃣ NER Statistics:")
483
+
484
+ class NERAnalyzer:
485
+ """
486
+ NER dataset analizi
487
+ """
488
+ @staticmethod
489
+ def analyze_entities(dataset):
490
+ """
491
+ Entity statistics
492
+ """
493
+ all_tags = []
494
+ entity_counts = defaultdict(int)
495
+
496
+ for ex in dataset:
497
+ tags = ex['ner_tags']
498
+ all_tags.extend(tags)
499
+
500
+ # Entity'leri say
501
+ for tag in tags:
502
+ if tag.startswith('B-'):
503
+ entity_type = tag.split('-')[1]
504
+ entity_counts[entity_type] += 1
505
+
506
+ tag_dist = Counter(all_tags)
507
+
508
+ return {
509
+ 'tag_distribution': dict(tag_dist),
510
+ 'entity_counts': dict(entity_counts),
511
+ 'total_tokens': len(all_tags),
512
+ 'entity_tokens': len([t for t in all_tags if t != 'O'])
513
+ }
514
+
515
+ analyzer = NERAnalyzer()
516
+ ner_stats = analyzer.analyze_entities(ner_dataset)
517
+
518
+ print(f"\n Total tokens: {ner_stats['total_tokens']}")
519
+ print(f" Entity tokens: {ner_stats['entity_tokens']} "
520
+ f"({ner_stats['entity_tokens']/ner_stats['total_tokens']*100:.1f}%)")
521
+
522
+ print(f"\n Entity type distribution:")
523
+ for entity_type, count in sorted(ner_stats['entity_counts'].items()):
524
+ print(f" {entity_type}: {count} entities")
525
+
526
+ print(f"\n Tag distribution:")
527
+ for tag, count in sorted(ner_stats['tag_distribution'].items(), key=lambda x: -x[1])[:5]:
528
+ print(f" {tag}: {count}")
529
+
530
+
531
+ print("\n3️⃣ NER Data Augmentation:")
532
+
533
+ class NERAugmenter:
534
+ """
535
+ NER için data augmentation
536
+ """
537
+ @staticmethod
538
+ def swap_entities(example, entity_bank):
539
+ """
540
+ Entity'leri farklı entity'lerle değiştir
541
+ """
542
+ tokens = example['tokens'].copy()
543
+ ner_tags = example['ner_tags'].copy()
544
+
545
+ # B-tags'i bul
546
+ for i, tag in enumerate(ner_tags):
547
+ if tag.startswith('B-'):
548
+ entity_type = tag.split('-')[1]
549
+ if entity_type in entity_bank and entity_bank[entity_type]:
550
+ # Random entity seç
551
+ new_entity = random.choice(entity_bank[entity_type])
552
+ tokens[i] = new_entity
553
+
554
+ return {
555
+ **example,
556
+ 'tokens': tokens,
557
+ 'sentence': ' '.join(tokens),
558
+ 'is_augmented': True
559
+ }
560
+
561
+ # Entity bank oluştur
562
+ entity_bank = {
563
+ 'PER': ['Alice', 'Bob', 'Charlie', 'Diana'],
564
+ 'ORG': ['Microsoft', 'Amazon', 'Tesla', 'IBM'],
565
+ 'LOC': ['London', 'Tokyo', 'Berlin', 'Sydney']
566
+ }
567
+
568
+ augmenter = NERAugmenter()
569
+ print("\n Entity swapping örneği:")
570
+ original = ner_dataset[0]
571
+ augmented = augmenter.swap_entities(original, entity_bank)
572
+
573
+ print(f" Original: {original['sentence']}")
574
+ print(f" Augmented: {augmented['sentence']}")
575
+
576
+
577
+ print("\n" + "="*70)
578
+ print("4. SENTIMENT ANALYSIS")
579
+ print("="*70)
580
+
581
+ print("\n😊 Sentiment Analysis Dataset Yapısı:")
582
+
583
+ class SentimentDatasetCreator:
584
+ """
585
+ Sentiment analysis dataset oluşturucu
586
+ """
587
+ @staticmethod
588
+ def create_sentiment_dataset(num_samples=200):
589
+ """
590
+ Binary/Multi-class sentiment classification
591
+ """
592
+ positive_texts = [
593
+ "This product is amazing! Highly recommended.",
594
+ "Excellent service and great quality.",
595
+ "I love this! Best purchase ever.",
596
+ "Fantastic experience, will buy again.",
597
+ "Outstanding quality and fast delivery."
598
+ ]
599
+
600
+ negative_texts = [
601
+ "Terrible product, waste of money.",
602
+ "Very disappointed with the quality.",
603
+ "Poor customer service, never again.",
604
+ "Worst purchase I've ever made.",
605
+ "Completely unsatisfied with this."
606
+ ]
607
+
608
+ neutral_texts = [
609
+ "It's okay, nothing special.",
610
+ "Average product, meets basic needs.",
611
+ "Neither good nor bad, just acceptable.",
612
+ "Standard quality for the price.",
613
+ "It works as described."
614
+ ]
615
+
616
+ def gen():
617
+ for i in range(num_samples):
618
+ sentiment_choice = i % 3
619
+
620
+ if sentiment_choice == 0:
621
+ text = positive_texts[i % len(positive_texts)]
622
+ label = 2 # Positive
623
+ label_text = 'positive'
624
+ elif sentiment_choice == 1:
625
+ text = negative_texts[i % len(negative_texts)]
626
+ label = 0 # Negative
627
+ label_text = 'negative'
628
+ else:
629
+ text = neutral_texts[i % len(neutral_texts)]
630
+ label = 1 # Neutral
631
+ label_text = 'neutral'
632
+
633
+ # Simulated confidence score
634
+ confidence = np.random.uniform(0.7, 1.0)
635
+
636
+ yield {
637
+ 'id': f'sent_{i}',
638
+ 'text': text,
639
+ 'label': label,
640
+ 'label_text': label_text,
641
+ 'confidence': confidence,
642
+ 'text_length': len(text.split())
643
+ }
644
+
645
+ return Dataset.from_generator(gen)
646
+
647
+ @staticmethod
648
+ def create_aspect_based_sentiment(num_samples=100):
649
+ """
650
+ Aspect-based sentiment analysis
651
+ Farklı aspect'ler için farklı sentiment'ler
652
+ """
653
+ def gen():
654
+ aspects = ['quality', 'price', 'service', 'delivery']
655
+
656
+ for i in range(num_samples):
657
+ aspect_sentiments = {
658
+ aspect: {
659
+ 'sentiment': random.choice(['positive', 'negative', 'neutral']),
660
+ 'score': np.random.uniform(0, 1)
661
+ }
662
+ for aspect in aspects
663
+ }
664
+
665
+ yield {
666
+ 'id': f'aspect_sent_{i}',
667
+ 'text': f"Review text {i} discussing various aspects.",
668
+ 'aspect_sentiments': aspect_sentiments
669
+ }
670
+
671
+ return Dataset.from_generator(gen)
672
+
673
+ print("\n1️⃣ Sentiment Classification Dataset:")
674
+ sentiment_dataset = SentimentDatasetCreator.create_sentiment_dataset(300)
675
+
676
+ print(f"✅ Dataset: {len(sentiment_dataset)} reviews")
677
+
678
+ # Label distribution
679
+ labels = [ex['label_text'] for ex in sentiment_dataset]
680
+ label_dist = Counter(labels)
681
+ print(f"\n📊 Label distribution:")
682
+ for label, count in label_dist.items():
683
+ pct = count / len(sentiment_dataset) * 100
684
+ print(f" {label}: {count} ({pct:.1f}%)")
685
+
686
+ # Örnekler
687
+ print(f"\nÖrnekler:")
688
+ for label in ['positive', 'negative', 'neutral']:
689
+ example = [ex for ex in sentiment_dataset if ex['label_text'] == label][0]
690
+ print(f"\n {label.capitalize()}:")
691
+ print(f" Text: {example['text']}")
692
+ print(f" Confidence: {example['confidence']:.2f}")
693
+
694
+
695
+ print("\n2️⃣ Aspect-Based Sentiment:")
696
+ aspect_dataset = SentimentDatasetCreator.create_aspect_based_sentiment(50)
697
+
698
+ print(f"✅ Dataset: {len(aspect_dataset)} reviews")
699
+ print(f"\nÖrnek aspect-based analysis:")
700
+ sample = aspect_dataset[0]
701
+ print(f" Text: {sample['text']}")
702
+ print(f" Aspect sentiments:")
703
+ for aspect, sentiment_info in sample['aspect_sentiments'].items():
704
+ print(f" {aspect}: {sentiment_info['sentiment']} (score: {sentiment_info['score']:.2f})")
705
+
706
+
707
+ print("\n3️⃣ Sentiment Feature Engineering:")
708
+
709
+ class SentimentFeatureEngineer:
710
+ """
711
+ Sentiment için feature engineering
712
+ """
713
+ @staticmethod
714
+ def extract_sentiment_features(example):
715
+ """
716
+ Sentiment-specific features
717
+ """
718
+ text = example['text'].lower()
719
+
720
+ # Sentiment keywords (simplified)
721
+ positive_words = ['great', 'excellent', 'amazing', 'love', 'best', 'fantastic']
722
+ negative_words = ['terrible', 'worst', 'poor', 'bad', 'disappointed', 'waste']
723
+
724
+ pos_count = sum([1 for word in positive_words if word in text])
725
+ neg_count = sum([1 for word in negative_words if word in text])
726
+
727
+ # Punctuation features
728
+ exclamation_count = text.count('!')
729
+ question_count = text.count('?')
730
+
731
+ # Capitalization
732
+ upper_count = sum([1 for c in example['text'] if c.isupper()])
733
+
734
+ return {
735
+ **example,
736
+ 'positive_word_count': pos_count,
737
+ 'negative_word_count': neg_count,
738
+ 'exclamation_count': exclamation_count,
739
+ 'question_count': question_count,
740
+ 'upper_case_count': upper_count,
741
+ 'sentiment_score': pos_count - neg_count # Simple score
742
+ }
743
+
744
+ feature_engineer = SentimentFeatureEngineer()
745
+ sentiment_featured = sentiment_dataset.map(
746
+ feature_engineer.extract_sentiment_features,
747
+ desc="Extracting sentiment features"
748
+ )
749
+
750
+ print(f"\n Feature extraction completed")
751
+ print(f" New features: positive_word_count, negative_word_count, sentiment_score, etc.")
752
+
753
+ print(f"\n Feature correlation with labels:")
754
+ for label_text in ['positive', 'negative', 'neutral']:
755
+ subset = [ex for ex in sentiment_featured if ex['label_text'] == label_text]
756
+ avg_score = np.mean([ex['sentiment_score'] for ex in subset])
757
+ avg_pos = np.mean([ex['positive_word_count'] for ex in subset])
758
+ avg_neg = np.mean([ex['negative_word_count'] for ex in subset])
759
+
760
+ print(f"\n {label_text.capitalize()}:")
761
+ print(f" Avg sentiment score: {avg_score:.2f}")
762
+ print(f" Avg positive words: {avg_pos:.2f}")
763
+ print(f" Avg negative words: {avg_neg:.2f}")
764
+
765
+
766
+ print("\n" + "="*70)
767
+ print("5. TEXT CLASSIFICATION")
768
+ print("="*70)
769
+
770
+ print("\n📊 General Text Classification:")
771
+
772
+ class TextClassificationDataset:
773
+ """
774
+ Multi-class text classification
775
+ """
776
+ @staticmethod
777
+ def create_topic_classification(num_samples=200):
778
+ """
779
+ Topic/Category classification
780
+ """
781
+ topics = {
782
+ 'sports': [
783
+ "The team won the championship with a final score of 3-1.",
784
+ "Athletes trained hard for the upcoming Olympic games.",
785
+ "The basketball match was exciting until the last minute."
786
+ ],
787
+ 'technology': [
788
+ "The new smartphone features advanced AI capabilities.",
789
+ "Software update improves system performance significantly.",
790
+ "Researchers developed a breakthrough algorithm for data processing."
791
+ ],
792
+ 'politics': [
793
+ "The parliament voted on the new legislation today.",
794
+ "Government announces policy changes affecting citizens.",
795
+ "Election results show close competition between candidates."
796
+ ],
797
+ 'entertainment': [
798
+ "The movie premiere attracted thousands of fans.",
799
+ "New album breaks streaming records in first week.",
800
+ "Award ceremony celebrates best performances of the year."
801
+ ]
802
+ }
803
+
804
+ topic_to_id = {topic: i for i, topic in enumerate(topics.keys())}
805
+
806
+ def gen():
807
+ for i in range(num_samples):
808
+ topic = list(topics.keys())[i % len(topics)]
809
+ text = topics[topic][i % len(topics[topic])]
810
+
811
+ yield {
812
+ 'id': f'topic_{i}',
813
+ 'text': text,
814
+ 'label': topic_to_id[topic],
815
+ 'label_text': topic
816
+ }
817
+
818
+ return Dataset.from_generator(gen), topic_to_id
819
+
820
+ print("\n1️⃣ Topic Classification Dataset:")
821
+ topic_dataset, topic_to_id = TextClassificationDataset.create_topic_classification(200)
822
+
823
+ print(f"✅ Dataset: {len(topic_dataset)} documents")
824
+ print(f" Topics: {list(topic_to_id.keys())}")
825
+
826
+ # Topic distribution
827
+ topics = [ex['label_text'] for ex in topic_dataset]
828
+ topic_dist = Counter(topics)
829
+ print(f"\n📊 Topic distribution:")
830
+ for topic, count in topic_dist.items():
831
+ print(f" {topic}: {count}")
832
+
833
+ # Örnekler
834
+ print(f"\nÖrnekler:")
835
+ for topic in list(topic_to_id.keys())[:3]:
836
+ example = [ex for ex in topic_dataset if ex['label_text'] == topic][0]
837
+ print(f"\n {topic.capitalize()}:")
838
+ print(f" {example['text']}")
839
+
840
+
841
+ print("\n" + "="*70)
842
+ print("6. MULTI-TASK LEARNING DATASETS")
843
+ print("="*70)
844
+
845
+ print("\n🎯 Multi-Task Dataset Yapısı:")
846
+
847
+ class MultiTaskDatasetCreator:
848
+ """
849
+ Birden fazla task için unified dataset
850
+ """
851
+ @staticmethod
852
+ def create_multitask_dataset(num_samples=100):
853
+ """
854
+ Aynı text için multiple task annotations
855
+ """
856
+ def gen():
857
+ for i in range(num_samples):
858
+ text = f"Sample text {i} with multiple annotations for various tasks."
859
+
860
+ yield {
861
+ 'id': f'multi_{i}',
862
+ 'text': text,
863
+
864
+ # Task 1: Sentiment
865
+ 'sentiment': random.choice(['positive', 'negative', 'neutral']),
866
+ 'sentiment_score': np.random.random(),
867
+
868
+ # Task 2: Topic
869
+ 'topic': random.choice(['sports', 'tech', 'politics']),
870
+ 'topic_confidence': np.random.random(),
871
+
872
+ # Task 3: Language quality
873
+ 'grammar_score': np.random.uniform(0.5, 1.0),
874
+ 'readability_score': np.random.uniform(0.5, 1.0),
875
+
876
+ # Metadata
877
+ 'text_length': len(text.split())
878
+ }
879
+
880
+ return Dataset.from_generator(gen)
881
+
882
+ print("\n1️⃣ Multi-Task Dataset:")
883
+ multitask_dataset = MultiTaskDatasetCreator.create_multitask_dataset(100)
884
+
885
+ print(f"✅ Dataset: {len(multitask_dataset)} examples")
886
+ print(f" Tasks: sentiment, topic, grammar, readability")
887
+
888
+ print(f"\nÖrnek multi-task annotation:")
889
+ sample = multitask_dataset[0]
890
+ print(f" Text: {sample['text']}")
891
+ print(f"\n Task Annotations:")
892
+ print(f" Sentiment: {sample['sentiment']} (score: {sample['sentiment_score']:.2f})")
893
+ print(f" Topic: {sample['topic']} (confidence: {sample['topic_confidence']:.2f})")
894
+ print(f" Grammar score: {sample['grammar_score']:.2f}")
895
+ print(f" Readability: {sample['readability_score']:.2f}")
896
+
897
+
898
+ print("\n2️⃣ Task-Specific Data Loaders:")
899
+
900
+ class MultiTaskLoader:
901
+ """
902
+ Multi-task dataset'i task-specific olarak yükle
903
+ """
904
+ def __init__(self, dataset):
905
+ self.dataset = dataset
906
+
907
+ def get_task_dataset(self, task_name, task_columns):
908
+ """
909
+ Belirli bir task için dataset al
910
+ """
911
+ def extract_task_data(example):
912
+ result = {
913
+ 'text': example['text'],
914
+ 'id': example['id']
915
+ }
916
+ for col in task_columns:
917
+ result[col] = example[col]
918
+ return result
919
+
920
+ return self.dataset.map(
921
+ extract_task_data,
922
+ remove_columns=[c for c in self.dataset.column_names
923
+ if c not in ['text', 'id'] + task_columns],
924
+ desc=f"Loading {task_name} task"
925
+ )
926
+
927
+ loader = MultiTaskLoader(multitask_dataset)
928
+
929
+ # Task-specific datasets
930
+ print("\n Creating task-specific datasets:")
931
+
932
+ sentiment_task = loader.get_task_dataset(
933
+ 'sentiment',
934
+ ['sentiment', 'sentiment_score']
935
+ )
936
+ print(f" Sentiment task: {len(sentiment_task)} examples, columns: {sentiment_task.column_names}")
937
+
938
+ topic_task = loader.get_task_dataset(
939
+ 'topic',
940
+ ['topic', 'topic_confidence']
941
+ )
942
+ print(f" Topic task: {len(topic_task)} examples, columns: {topic_task.column_names}")
943
+
944
+
945
+ print("\n" + "="*70)
946
+ print("📚 BEST PRACTICES - TASK-SPECIFIC DATASETS")
947
+ print("="*70)
948
+
949
+ print("""
950
+ ✅ QUESTION ANSWERING:
951
+ - SQuAD format: context, question, answer, answer_start
952
+ - Validate answer spans
953
+ - Handle impossible questions
954
+ - Question type classification
955
+ - Context length management
956
+
957
+ ✅ SUMMARIZATION:
958
+ - Multiple reference summaries
959
+ - Compression ratio tracking
960
+ - ROUGE scores for validation
961
+ - Abstractive vs Extractive
962
+ - Length constraints
963
+
964
+ ✅ NAMED ENTITY RECOGNITION:
965
+ - BIO/BIOES tagging scheme
966
+ - Entity type taxonomy
967
+ - Nested entities handling
968
+ - Cross-sentence entities
969
+ - Entity linking (optional)
970
+
971
+ ✅ SENTIMENT ANALYSIS:
972
+ - Multi-level granularity (binary/3-class/5-class)
973
+ - Aspect-based sentiment
974
+ - Confidence scores
975
+ - Domain-specific lexicons
976
+ - Emotion detection
977
+
978
+ ✅ TEXT CLASSIFICATION:
979
+ - Balanced classes
980
+ - Hierarchical categories
981
+ - Multi-label support
982
+ - Confidence calibration
983
+ - Class imbalance handling
984
+
985
+ ✅ MULTI-TASK LEARNING:
986
+ - Consistent text preprocessing
987
+ - Task-specific heads
988
+ - Shared representations
989
+ - Task weighting strategies
990
+ - Auxiliary tasks
991
+
992
+ 🎯 GENERAL PRINCIPLES:
993
+ - Clear annotation guidelines
994
+ - Inter-annotator agreement
995
+ - Quality control checks
996
+ - Regular dataset updates
997
+ - Version control
998
+ - Documentation
999
+ """)
1000
+
1001
+
1002
+ print("\n" + "="*70)
1003
+ print("✅ BÖLÜM 4 TAMAMLANDI!")
1004
+ print("="*70)
1005
+
1006
+ print(f"""
1007
+ Bu bölümde öğrendikleriniz:
1008
+ ✓ Question Answering datasets (Extractive & Multiple Choice)
1009
+ ✓ Summarization datasets (News & Abstractive)
1010
+ ✓ Named Entity Recognition (BIO tagging)
1011
+ ✓ Sentiment Analysis (Binary, Multi-class, Aspect-based)
1012
+ ✓ Text Classification (Topic classification)
1013
+ ✓ Multi-Task Learning datasets
1014
+
1015
+ 📊 ÜRETİLEN DATASETS:
1016
+ - QA: 200 extractive + 100 multiple choice
1017
+ - Summarization: 100 news articles
1018
+ - NER: 100 annotated sentences
1019
+ - Sentiment: 300 reviews + 50 aspect-based
1020
+ - Topic: 200 documents
1021
+ - Multi-task: 100 multi-annotated examples
1022
+
1023
+ 🎯 KEY LEARNINGS:
1024
+ - Her task farklı data format gerektirir
1025
+ - Quality metrics task-specific
1026
+ - Preprocessing task'a göre özelleştirilmeli
1027
+ - Multi-task learning verimli öğrenme sağlar
1028
+ - Annotation quality critical
1029
+
1030
+ 📚 SERİ TAMAMLANDI!
1031
+ Tüm modüller başarıyla tamamlandı:
1032
+ ✅ Bölüm 1: Büyük Ölçekli Datasets
1033
+ ✅ Bölüm 2: Domain-Specific Datasets
1034
+ ✅ Bölüm 3: İleri Teknikler
1035
+ ✅ Bölüm 4: Özel Görevler İçin Datasets
1036
+ """)
1037
+
1038
+ print("\n🎉 Tebrikler! Tüm modülleri tamamladınız!")
1039
+ print("Şimdi bu bilgileri kendi projelerinizde kullanabilirsiniz! 🚀")