File size: 1,473 Bytes
2e6a47d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Domain-Specific Datasets Örnekleri

Bu klasör, farklı domain'ler için özelleştirilmiş dataset örnekleri içerir.

## Domain'ler

### 🔬 Bilimsel Makaleler
- arXiv, PubMed style
- 2,000 örnek
- Citation tracking
- Abstract + full text

### 💻 Kod Datasets
- 6 programlama dili
- 2,000 kod örneği
- Syntax parsing
- Docstring extraction

### 💰 Finansal Veri
- Sentiment analysis
- Market data
- 2,000 kayıt
- Time series

### 🏥 Tıbbi Veri
- PHI anonymization
- HIPAA compliance
- 2,000 kayıt
- Clinical notes

## Cross-Domain Integration

### Problem: Schema Mismatch
```python
# ❌ Bu HATA verir
combined = concatenate_datasets([sci_ds, code_ds])
# ArrowTypeError: struct fields don't match
```

### Çözüm 1: Flatten Approach
```python
# ✅ Ortak schema
def normalize(ex, domain):
    return {
        'text': ex.get('text'),
        'domain': domain,
        'field1': ex.get('field1'),
        'field2': ex.get('field2'),
        # ... tüm field'lar
    }
```

### Çözüm 2: JSON Metadata
```python
# ✅ Esnek yapı
def normalize(ex, domain):
    return {
        'text': ex.get('text'),
        'domain': domain,
        'metadata_json': json.dumps(ex.get('meta', {}))
    }
```

### Çözüm 3: Separate Tables
```python
# ✅ Database-style
unified_table + metadata_tables
```

## Best Practices

✅ Domain expertise kullan  
✅ Specialized tokenization  
✅ Quality filtering  
✅ Ethical guidelines  
✅ Schema normalization