HERIUN's picture
Update README.md
caffcc8 verified
metadata
language:
  - ko
license: mit
tags:
  - spacy
  - token-classification
  - named-entity-recognition
  - korean
  - klue
  - roberta
model-index:
  - name: ner-kor-roberta_aihub_094_208_90k
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        metrics:
          - type: f1
            value: 0.9046
            name: F1 (overall)
          - type: precision
            value: 0.8795
            name: Precision (overall)
          - type: recall
            value: 0.9312
            name: Recall (overall)
base_model:
  - klue/roberta-base

ner-kor-roberta_aihub_094_208_90k

ํ•œ๊ตญ์–ด ๊ฐœ์ฒด๋ช… ์ธ์‹(NER) ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. KLUE RoBERTa-base๋ฅผ ๋ฐฑ๋ณธ์œผ๋กœ, AIHub ํ•œ๊ตญ์–ด NER ๋ฐ์ดํ„ฐ์…‹(์•ฝ 90๋งŒ ๋ฌธ์žฅ)์œผ๋กœ ํŒŒ์ธํŠœ๋‹ํ•˜์˜€์Šต๋‹ˆ๋‹ค. spaCy 3.8 + spacy-transformers ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.


์ง€์› ๋ ˆ์ด๋ธ”

๋ ˆ์ด๋ธ” ์˜๋ฏธ ์˜ˆ์‹œ
PER ์ธ๋ฌผ (Person) ์ด์ˆœ์‹ , ํ™๊ธธ๋™
ORG ๊ธฐ๊ด€ยท์กฐ์ง (Organization) ์‚ผ์„ฑ์ „์ž, ๊ตญ๋ฆฝ์ค‘์•™๋ฐ•๋ฌผ๊ด€
LOC ์žฅ์†Œยท์ง€๋ช… (Location) ์„œ์šธ, ํ•œ๊ฐ•, ์—ฌ์ˆ˜
ADD ์ฃผ์†Œ (Address) ์„œ์šธํŠน๋ณ„์‹œ ๊ฐ•๋‚จ๊ตฌ ํ…Œํ—ค๋ž€๋กœ
DAT ๋‚ ์งœยท๊ธฐ๊ฐ„ (Date) 2024๋…„ 1์›”, ์ง€๋‚œ์ฃผ
TIM ์‹œ๊ฐ„ (Time) ์˜คํ›„ 3์‹œ, ์ƒˆ๋ฒฝ
QT ์ˆ˜๋Ÿ‰ยท์ˆ˜์น˜ (Quantity) 3kg, 100๋ช…, 5์ฒœ์›
PHN ์ „ํ™”๋ฒˆํ˜ธ (Phone) 010-1234-5678
URL URLยท์ด๋ฉ”์ผ (URL) www.example.com

ํ•™์Šต ๋ฐ์ดํ„ฐ ์˜ˆ์‹œ

{"text": "๊ด€๊ด‘์ง€๋ช… 38ํ•ด๋ณ€", "entities": [[5, 9, "LOC"]]}

์„ฑ๋Šฅ (test set, 90,873 ๋ฌธ์žฅ)

๋ ˆ์ด๋ธ” Precision Recall F1
์ „์ฒด 0.8795 0.9312 0.9046
ADD 0.9990 0.9997 0.9994
PHN 0.9873 0.9915 0.9894
URL 0.9793 0.9833 0.9813
TIM 0.9202 0.9122 0.9162
DAT 0.8245 0.9659 0.8896
QT 0.8147 0.9163 0.8625
LOC 0.8182 0.8840 0.8498
PER 0.6778 0.7847 0.7273
ORG 0.6807 0.7338 0.7063

์‚ฌ์šฉ๋ฒ•

spaCy๋กœ ์ง์ ‘ ์‚ฌ์šฉ

import spacy

nlp = spacy.load("๊ฒฝ๋กœ/๋˜๋Š”/๋ชจ๋ธ๋ช…")

doc = nlp("์ด์ˆœ์‹  ์žฅ๊ตฐ์€ ์ „๋ผ๋„ ์—ฌ์ˆ˜์—์„œ ์‹ธ์› ๋‹ค.")
for ent in doc.ents:
    print(ent.text, ent.label_)
# ์ด์ˆœ์‹    PER
# ์ „๋ผ๋„   LOC
# ์—ฌ์ˆ˜์—์„œ LOC

ํ•™์Šต ์ •๋ณด

ํ•ญ๋ชฉ ๊ฐ’
๋ฐฑ๋ณธ ๋ชจ๋ธ klue/roberta-base
ํ”„๋ ˆ์ž„์›Œํฌ spaCy 3.8 + spacy-transformers
ํ•™์Šต ๋ฐ์ดํ„ฐ AIHub ํ•œ๊ตญ์–ด NER ๋ฐ์ดํ„ฐ์…‹
ํ•™์Šต ๋ฌธ์žฅ ์ˆ˜ 726,972
๊ฒ€์ฆ ๋ฌธ์žฅ ์ˆ˜ 90,871
ํ…Œ์ŠคํŠธ ๋ฌธ์žฅ ์ˆ˜ 90,873
์ด ํ•™์Šต ์Šคํ… 20,000
์˜ตํ‹ฐ๋งˆ์ด์ € Adam (lr=5e-5, warmup 250 steps)
Mixed Precision FP16 (mixed_precision = true)
Batch ์ „๋žต batch_by_padded (size=2000)
Gradient ๋ˆ„์  3 subbatch

๋ชจ๋ธ ํŒŒ์ผ ๊ตฌ์กฐ

์ด ๋ชจ๋ธ์€ spaCy ํฌ๋งท์œผ๋กœ ์ €์žฅ๋˜์–ด ์žˆ์œผ๋ฉฐ spacy.load()๋กœ ์ง์ ‘ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

model-best/
โ”œโ”€โ”€ config.cfg          # spaCy ํŒŒ์ดํ”„๋ผ์ธ ์„ค์ •
โ”œโ”€โ”€ meta.json           # ๋ชจ๋ธ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ฐ ์„ฑ๋Šฅ ๊ธฐ๋ก
โ”œโ”€โ”€ transformer/        # klue/roberta-base ํŒŒ์ธํŠœ๋‹ ๊ฐ€์ค‘์น˜ (444MB)
โ”œโ”€โ”€ ner/                # NER ์ „์ด ํŒŒ์„œ ๊ฐ€์ค‘์น˜
โ”œโ”€โ”€ doc_cleaner/        # ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ์ปดํฌ๋„ŒํŠธ
โ””โ”€โ”€ vocab/              # ์–ดํœ˜ ์‚ฌ์ „

๋ผ์ด์„ ์Šค

MIT