MEHMET TUĞRUL KAYA
Initial commit: Advanced Dataset Tutorial
2e6a47d

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

Domain-Specific Datasets Örnekleri

Bu klasör, farklı domain'ler için özelleştirilmiş dataset örnekleri içerir.

Domain'ler

🔬 Bilimsel Makaleler

  • arXiv, PubMed style
  • 2,000 örnek
  • Citation tracking
  • Abstract + full text

💻 Kod Datasets

  • 6 programlama dili
  • 2,000 kod örneği
  • Syntax parsing
  • Docstring extraction

💰 Finansal Veri

  • Sentiment analysis
  • Market data
  • 2,000 kayıt
  • Time series

🏥 Tıbbi Veri

  • PHI anonymization
  • HIPAA compliance
  • 2,000 kayıt
  • Clinical notes

Cross-Domain Integration

Problem: Schema Mismatch

# ❌ Bu HATA verir
combined = concatenate_datasets([sci_ds, code_ds])
# ArrowTypeError: struct fields don't match

Çözüm 1: Flatten Approach

# ✅ Ortak schema
def normalize(ex, domain):
    return {
        'text': ex.get('text'),
        'domain': domain,
        'field1': ex.get('field1'),
        'field2': ex.get('field2'),
        # ... tüm field'lar
    }

Çözüm 2: JSON Metadata

# ✅ Esnek yapı
def normalize(ex, domain):
    return {
        'text': ex.get('text'),
        'domain': domain,
        'metadata_json': json.dumps(ex.get('meta', {}))
    }

Çözüm 3: Separate Tables

# ✅ Database-style
unified_table + metadata_tables

Best Practices

✅ Domain expertise kullan
✅ Specialized tokenization
✅ Quality filtering
✅ Ethical guidelines
✅ Schema normalization