BERT Türkçe Deprem Tweet NER Modeli

Deprem tweetlerinden kritik bilgileri otomatik olarak çıkarmak için SFT + Pseudo-Labeling (Stacking Ensemble) yöntemiyle fine-tune edilmiş NER modeli.

Base model: dbmdz/bert-base-turkish-cased
Eğitim: 500 altın standart → LLM pseudo-labeling → 17.000 Stacked veri seti
Veri seti: 6 Şubat 2023 Türkiye-Suriye depremi tweet'leri
Dil: Türkçe

Desteklenen Etiketler

Etiket	Açıklama	Örnek
`LOC`	Konum / Adres	Hatay Antakya Armutlu Sokak
`PER`	Kişi adı	Murat Filazoğlu
`PHONE`	Telefon numarası	0539 218 3976
`NEED`	İhtiyaç / Talep	battaniye, jeneratör
`ORG`	Kurum / Organizasyon	AFAD, Kızılay
`LINK`	URL bağlantısı	https://t.co/...

⚠️ Önemli Not: Ön-İşleme (Tokenization)

Bu model eğitilirken metinler standart BERT tokenizer'ı ile değil, özel olarak NLTK TweetTokenizer kullanılarak kelimelere ayrılmıştır. Bu sayede @kullanici_adi, https://... gibi ifadeler ve telefon numaraları parçalanmadan tek bir bütün olarak ele alınır.

Hugging Face'in standart pipeline fonksiyonu metni agresif bir şekilde parçaladığı için URL'lerde ve kurumsal etiketlerde bölünmeler yaşatabilir. Tam ve doğru performans almak için aşağıdaki özel çıkarım (inference) kodunu kullanmanız tavsiye edilir.

Kullanım

Aşağıdaki kod bloğu, modelin eğitildiği formata uygun olarak metni işler, etiketleri birleştirir ve güven skorlarıyla (score) birlikte döndürür:

import torch
import nltk
from nltk.tokenize import TweetTokenizer
from transformers import AutoTokenizer, AutoModelForTokenClassification

# 1. NLTK TweetTokenizer'ı başlat (Link ve etiketleri korumak için gerekli)
nltk.download('punkt', quiet=True)
tknzr = TweetTokenizer(preserve_case=True, strip_handles=False, reduce_len=False)

# 2. Modeli ve Tokenizer'ı Hugging Face'den yükle
repo_id = "nypgd/bert-turkish-deprem-ner"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
model.eval()
id2label = model.config.id2label

# 3. Özel Tahmin Fonksiyonu
def extract_entities(text):
    original_tokens = tknzr.tokenize(text)
    if not original_tokens: return []

    inputs = tokenizer(original_tokens, is_split_into_words=True, return_tensors="pt", truncation=True)
    with torch.no_grad():
        logits = model(**inputs).logits

    predictions = torch.argmax(logits, dim=2)[0]
    word_ids = inputs.word_ids()

    token_tags = []
    token_logit_idx = []
    prev_word_idx = None
    
    for sub_pos, (pred_id, word_idx) in enumerate(zip(predictions, word_ids)):
        if word_idx is None:
            continue
        if word_idx != prev_word_idx:
            token_tags.append(id2label[pred_id.item()])
            token_logit_idx.append(sub_pos)
        prev_word_idx = word_idx

    # Entity Birleştirme (Noktalama işaretlerini atlama mantığıyla)
    SKIP_TOKENS = {',', '-', ':', '.', '/', '(', ')'}
    entities = []
    current_ent = None

    for i, (token, tag) in enumerate(zip(original_tokens, token_tags)):
        if tag == 'O' and current_ent and token in SKIP_TOKENS:
            continue
        
        if tag == 'O':
            if current_ent: entities.append(current_ent)
            current_ent = None
            continue

        ent_type = tag[2:]
        score_val = round(torch.softmax(logits[0, token_logit_idx[i]], dim=-1).max().item(), 4)

        if tag.startswith('B-'):
            if current_ent: entities.append(current_ent)
            current_ent = {"entity_group": ent_type, "word": token, "score": score_val, "start": i, "end": i + 1}
        elif tag.startswith('I-') and current_ent and current_ent["entity_group"] == ent_type:
            current_ent["word"] += " " + token
            current_ent["end"] = i + 1
            current_ent["score"] = round(min(current_ent["score"], score_val), 4)
        else:
            if current_ent: entities.append(current_ent)
            current_ent = {"entity_group": ent_type, "word": token, "score": score_val, "start": i, "end": i + 1}

    if current_ent: entities.append(current_ent)
    return entities


# 4. Test Edelim
text = """
@AFADTurkiye ve Ahbap ekipleri, Gaziantep İslahiye Yeni Mahalle Karanfil Sokak No:5 adresinde 7 kişi enkaz altında mahsur kaldı. 
Acil ısıtıcı, çadır ve çocuk maması gerekiyor. Saha sorumlusu Ayşe Yurt iletişim: 0533 123 45 67.
 Konum ve detaylı bilgi için: https://t.co/yardimadresi
"""

results = extract_entities(text)
for ent in results:
    etiket = f"[{ent['entity_group']}]"
    print(f"{etiket:<9} {ent['word']}  (score: {ent['score']:.4f})")

print("[")
for i, ent in enumerate(results):
    virgul = "," if i < len(results) - 1 else ""
    print(f"  {ent}{virgul}")
print("]")


# Formatlı Çıktı
───────────────────────────────────────────────────────────────────────────────────
[ORG]    @AFADTurkiye  (score: 0.9999)
[ORG]    Ahbap  (score: 0.9996)
[LOC]    Gaziantep İslahiye Yeni Mahalle Karanfil Sokak No 5  (score: 0.9999)
[NEED]   ısıtıcı  (score: 0.9999)
[NEED]   çadır  (score: 0.9999)
[NEED]   çocuk maması  (score: 0.9997)
[PER]    Ayşe Yurt  (score: 0.9993)
[PHONE]  0533 123 45 67  (score: 0.9998)
[LINK]   https://t.co/yardimadresi  (score: 0.9999)
───────────────────────────────────────────────────────────────────────────────────

# Ham Çıktı (Tek satırlık liste formatı)
[
  {'entity_group': 'ORG', 'word': '@AFADTurkiye', 'score': 0.9999, 'start': 0, 'end': 1},
  {'entity_group': 'ORG', 'word': 'Ahbap', 'score': 0.9996, 'start': 2, 'end': 3},
  {'entity_group': 'LOC', 'word': 'Gaziantep İslahiye Yeni Mahalle Karanfil Sokak No 5', 'score': 0.9999, 'start': 5, 'end': 14},
  {'entity_group': 'NEED', 'word': 'ısıtıcı', 'score': 0.9999, 'start': 23, 'end': 24},
  {'entity_group': 'NEED', 'word': 'çadır', 'score': 0.9999, 'start': 25, 'end': 26},
  {'entity_group': 'NEED', 'word': 'çocuk maması', 'score': 0.9997, 'start': 27, 'end': 29},
  {'entity_group': 'PER', 'word': 'Ayşe Yurt', 'score': 0.9993, 'start': 33, 'end': 35},
  {'entity_group': 'PHONE', 'word': '0533 123 45 67', 'score': 0.9998, 'start': 37, 'end': 41},
  {'entity_group': 'LINK', 'word': '[https://t.co/yardimadresi](https://t.co/yardimadresi)', 'score': 0.9999, 'start': 48, 'end': 49}
]

Downloads last month: 14

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for nypgd/bert-turkish-deprem-ner

Base model

dbmdz/bert-base-turkish-cased

Finetuned

(178)

this model

Space using nypgd/bert-turkish-deprem-ner 1

Evaluation results

F1 (Stacked 17k)
self-reported

0.970
Precision
self-reported

0.968
Recall
self-reported

0.972