์–ด๋ฆฐ์ด ๊ต์œก ์ ํ•ฉ์„ฑ ๊ฒฝ์ œยท๊ธˆ์œต ๊ธฐ์‚ฌ ๋ถ„๋ฅ˜ ๋ชจ๋ธ

maninglearchine/kobert-article-classifier๋Š” ํ•œ๊ตญ์–ด ๊ฒฝ์ œยท๊ธˆ์œต ๊ธฐ์‚ฌ๋ฅผ ๋งŒ 13์„ธ ์ดํ•˜ ์–ด๋ฆฐ์ด์˜ ํ•™์Šต ๋ชฉ์ ์— ์ ํ•ฉํ•œ์ง€ ์ž๋™์œผ๋กœ ํŒ๋ณ„ํ•˜๋Š” ์ด์ง„ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

klue/bert-base๋ฅผ GPT-4o-mini๋กœ ๋ผ๋ฒจ๋งํ•œ ํ•œ๊ตญ์–ด ๊ฒฝ์ œยท๊ธˆ์œต ๋ฐ์ดํ„ฐ 5,000๊ฑด์œผ๋กœ ํŒŒ์ธํŠœ๋‹ํ–ˆ์Šต๋‹ˆ๋‹ค.


๋ผ๋ฒจ ์ฒด๊ณ„

๋ผ๋ฒจ ID ์„ค๋ช… ์˜ˆ์‹œ ํ‚ค์›Œ๋“œ
์ ์ ˆ 1 ์–ด๋ฆฐ์ด ํ•™์Šต์— ์ ํ•ฉํ•œ ๊ธฐ์‚ฌ ์ €์ถ•, ์šฉ๋ˆ, ๋ฌผ๊ฐ€, ์„ธ๊ธˆ, ๋ฌด์—ญ, ํ˜‘๋™์กฐํ•ฉ
๋ถ€์ ์ ˆ 0 ์–ด๋ฆฐ์ด์—๊ฒŒ ๋ถ€์ ํ•ฉํ•œ ๊ธฐ์‚ฌ ELS, ๋ ˆ๋ฒ„๋ฆฌ์ง€, ๊ณต๋งค๋„, DSR, ํŒŒ์ƒ์ƒํ’ˆ, ๊ฐ•์ œ์ฒญ์‚ฐ

์ ์ ˆ ๊ธฐ์ค€

  • ๊ธฐ์ดˆ ๊ฒฝ์ œ ๊ฐœ๋… (์ €์ถ•, ๋ฌผ๊ฐ€, ์ˆ˜์š”ยท๊ณต๊ธ‰, ์„ธ๊ธˆ์˜ ์—ญํ•  ๋“ฑ)
  • ๊ธฐ์—… ์„ฑ์žฅ ์Šคํ† ๋ฆฌ, ์ฐฝ์—… ์ด์•ผ๊ธฐ
  • ํ™˜๊ฒฝ๊ฒฝ์ œ, ๊ณต์ •๋ฌด์—ญ, ์‚ฌํšŒ์ ๊ธฐ์—…
  • ์ผ์ƒ ์† ๊ฒฝ์ œ ์›๋ฆฌ๋ฅผ ์‰ฝ๊ฒŒ ์„ค๋ช…ํ•œ ์ฝ˜ํ…์ธ 

๋ถ€์ ์ ˆ ๊ธฐ์ค€

  • ํŒŒ์ƒ์ƒํ’ˆ (ELS, DLS, CFD, ์„ ๋ฌผยท์˜ต์…˜)
  • ๋ ˆ๋ฒ„๋ฆฌ์ง€ยท๊ณต๋งค๋„ยท๋งˆ์ง„์ฝœ ๋“ฑ ํˆฌ๊ธฐ์„ฑ ๊ฑฐ๋ž˜
  • ๋ณต์žกํ•œ ๊ธˆ์œต ๊ทœ์ œ (๋ฐ”์ คโ…ข, IFRS17, DSR ๋“ฑ)
  • ๊ธฐ์—… ๊ตฌ์กฐ์กฐ์ •ยท๋ถ€๋„ยท๋ฒ•์ •๊ด€๋ฆฌ

์‚ฌ์šฉ ๋ฐฉ๋ฒ•

๋น ๋ฅธ ์‹œ์ž‘ (pipeline)

from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="maninglearchine/kobert-article-classifier",
)

samples = [
    "์šฉ๋ˆ์œผ๋กœ ๋ฐฐ์šฐ๋Š” ์ €์ถ•์˜ ์ฒซ๊ฑธ์Œโ€ฆ์–ด๋ฆฐ์ด ๊ฒฝ์ œ๊ต์‹ค ํ˜„์žฅ",
    "ELS ๋…น์ธ ๊ตฌ๊ฐ„ ์ง„์ž…โ€ฆ๋ ˆ๋ฒ„๋ฆฌ์ง€ ํˆฌ์ž์ž ๊ฐ•์ œ์ฒญ์‚ฐ ์†์ถœ",
]

for text in samples:
    result = clf(text)
    print(f"{result[0]['label']} ({result[0]['score']:.3f}) | {text[:30]}")

# ์ถœ๋ ฅ ์˜ˆ์‹œ:
# ์ ์ ˆ (0.999) | ์šฉ๋ˆ์œผ๋กœ ๋ฐฐ์šฐ๋Š” ์ €์ถ•์˜ ์ฒซ๊ฑธ์Œโ€ฆ์–ด๋ฆฐ์ด
# ๋ถ€์ ์ ˆ (0.998) | ELS ๋…น์ธ ๊ตฌ๊ฐ„ ์ง„์ž…โ€ฆ๋ ˆ๋ฒ„๋ฆฌ์ง€ ํˆฌ์ž์ž

์ƒ์„ธ ์ œ์–ด (AutoModel)

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id  = "maninglearchine/kobert-article-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

def classify(text: str, max_length: int = 64) -> dict:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
        max_length=max_length,
    )
    with torch.no_grad():
        logits = model(**inputs).logits
    probs  = torch.softmax(logits, dim=-1)[0]
    label  = model.config.id2label[logits.argmax(-1).item()]
    return {
        "label":       label,
        "score":       probs.max().item(),
        "๋ถ€์ ์ ˆ_prob": probs[0].item(),
        "์ ์ ˆ_prob":   probs[1].item(),
    }

texts = [
    "์„ธ๊ธˆ์€ ์™œ ๋‚ด์•ผ ํ• ๊นŒ?โ€ฆํ•™๊ตยท๋„๋กœยท์†Œ๋ฐฉ์„œ๊ฐ€ ๋งŒ๋“ค์–ด์ง€๋Š” ๋น„๋ฐ€",
    "๊ณต๋งค๋„ ์ž”๊ณ  ๊ธ‰์ฆโ€ฆํƒ€๊นƒ ์ข…๋ชฉ ๋ณ€๋™์„ฑ ์‹ฌํ™”๋กœ ๋ฐ˜๋Œ€๋งค๋งค ์œ„ํ—˜ ๊ณ ์กฐ",
    "์ˆ˜์š”์™€ ๊ณต๊ธ‰์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€?โ€ฆ์‹œ์žฅ ๊ฐ€๊ฒฉ์ด ๊ฒฐ์ •๋˜๋Š” ์›๋ฆฌ",
    "๋ฐ”์ คโ…ข ์ž๋ณธ๊ทœ์ œ ๊ฐ•ํ™”โ€ฆ์€ํ–‰๊ถŒ ์ˆ˜์กฐ์› ์ถ”๊ฐ€ ์ ๋ฆฝ ๋ถ€๋‹ด",
]

for text in texts:
    r = classify(text)
    print(f"[{r['label']}] ํ™•๋ฅ ={r['score']:.3f} | {text[:35]}")

๋ฐฐ์น˜ ์ฒ˜๋ฆฌ (Excel ํŒŒ์ผ)

from transformers import pipeline
import pandas as pd

clf = pipeline(
    "text-classification",
    model="maninglearchine/kobert-article-classifier",
    batch_size=32,
    device=-1,   # CPU ์‚ฌ์šฉ. GPU: device=0
)

df = pd.read_excel("articles.xlsx")
results = clf(df["๊ธฐ์‚ฌ๋ณธ๋ฌธ"].tolist())

df["label"]      = [r["label"] for r in results]
df["confidence"] = [r["score"] for r in results]
df.to_excel("classified_articles.xlsx", index=False)

ํ•™์Šต ์ •๋ณด

๋ฐ์ดํ„ฐ์…‹

ํ•ญ๋ชฉ ๋‚ด์šฉ
์›๋ณธ ๋ฐ์ดํ„ฐ ํ•œ๊ตญ์–ด ๊ฒฝ์ œยท๊ธˆ์œต ๊ธฐ์‚ฌ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ 5,000๊ฑด
๋ผ๋ฒจ๋ง ๋ฐฉ๋ฒ• GPT-4o-mini ์ž๋™ ๋ผ๋ฒจ๋ง
ํด๋ž˜์Šค ๋ถ„ํฌ ์ ์ ˆ(1): 2,500๊ฑด / ๋ถ€์ ์ ˆ(0): 2,500๊ฑด (50:50 ๊ท ํ˜•)
ํ•™์Šต ์ƒ˜ํ”Œ ๊ท ํ˜• ์ƒ˜ํ”Œ๋ง 2,000๊ฑด (ํด๋ž˜์Šค๋‹น 1,000๊ฑด)
๋ฐ์ดํ„ฐ ๋ถ„ํ•  Train 1,600 / Val 200 / Test 200 (8:1:1, stratified)

ํ•™์Šต ์„ค์ •

Base Model   : klue/bert-base
Max Length   : 64 tokens
Batch Size   : 32
Epochs       : 3
LR           : 3e-5
Warmup       : 10% (warmup_ratio=0.1)
Weight Decay : 0.01
Optimizer    : AdamW
Eval Strategy: epoch (best model by F1 Macro)
Device       : CPU
ํ•™์Šต ์‹œ๊ฐ„    : ์•ฝ 26.6๋ถ„ (1,598์ดˆ)

Epoch๋ณ„ ํ•™์Šต ๋กœ๊ทธ

Epoch Train Loss Val Loss Val Accuracy Val F1 Macro
1 0.1241 0.000409 1.0000 1.0000
2 0.0089 0.000233 1.0000 1.0000
3 0.0004 0.000206 1.0000 1.0000

์„ฑ๋Šฅ ํ‰๊ฐ€

Test ์…‹ ๊ฒฐ๊ณผ (200๊ฑด)

์ง€ํ‘œ ๊ฐ’
Accuracy 1.0000 (100%)
F1 Macro 1.0000
Precision (Macro) 1.0000
Recall (Macro) 1.0000
F1 (๋ถ€์ ์ ˆ) 1.0000
F1 (์ ์ ˆ) 1.0000

Confusion Matrix

              ์˜ˆ์ธก: ๋ถ€์ ์ ˆ   ์˜ˆ์ธก: ์ ์ ˆ
์‹ค์ œ: ๋ถ€์ ์ ˆ      100           0
์‹ค์ œ: ์ ์ ˆ          0         100

TF-IDF ๋ชจ๋ธ๊ณผ ๋น„๊ต

๋ชจ๋ธ F1 Macro ํ•™์Šต ์‹œ๊ฐ„ ๋ฌธ๋งฅ ์ดํ•ด ์ถ”๊ฐ€ ํ•™์Šต
LR + Word TF-IDF 1.0000 0.3์ดˆ X ์–ด๋ ค์›€
LinearSVC + Word 1.0000 0.2์ดˆ X ์–ด๋ ค์›€
RandomForest 1.0000 1.1์ดˆ X ์–ด๋ ค์›€
KoBERT (๋ณธ ๋ชจ๋ธ) 1.0000 1,598์ดˆ O ํŒŒ์ธํŠœ๋‹ ๊ฐ€๋Šฅ

ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์˜ ์–ดํœ˜ ๋ถ„๋ฆฌ๊ฐ€ ๋ช…ํ™•ํ•ด ๋ชจ๋“  ๋ชจ๋ธ์ด 100%๋ฅผ ๋‹ฌ์„ฑํ–ˆ์œผ๋‚˜, ์‹ค์ œ ๋‰ด์Šค ๊ธฐ์‚ฌ(๋ณตํ•ฉ ์ฃผ์ œ, ๋ฌธ๋งฅ ์˜์กด ํ‘œํ˜„)์—์„œ๋Š” KoBERT๊ฐ€ ํ›จ์”ฌ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.


ํ•œ๊ณ„์  ๋ฐ ์ฃผ์˜์‚ฌํ•ญ

  • ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ํ•™์Šต: ์‹ค์ œ ๋‰ด์Šค ๊ธฐ์‚ฌ๊ฐ€ ์•„๋‹Œ ํ…œํ”Œ๋ฆฟ ๊ธฐ๋ฐ˜ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋์Šต๋‹ˆ๋‹ค. ์‹ค์„œ๋น„์Šค ์ ์šฉ ์ „ ์‹ค์ œ ๊ธฐ์‚ฌ ๋ฐ์ดํ„ฐ๋กœ ์ถ”๊ฐ€ ํŒŒ์ธํŠœ๋‹์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.
  • ์ตœ๋Œ€ ๊ธธ์ด 64 ํ† ํฐ: ์งง์€ ํ…์ŠคํŠธ์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธด ๊ธฐ์‚ฌ๋Š” ์ œ๋ชฉ + ์ฒซ ๋‹จ๋ฝ๋งŒ ์ž…๋ ฅํ•˜์„ธ์š”.
  • ์ด์ง„ ๋ถ„๋ฅ˜๋งŒ ์ง€์›: ์ ์ ˆ/๋ถ€์ ์ ˆ ๋‘ ํด๋ž˜์Šค๋งŒ ์ง€์›ํ•˜๋ฉฐ, ์—ฐ๋ น๋ณ„ ์„ธ๋ถ„ํ™” ๋ถ„๋ฅ˜๋Š” ์ง€์›ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ถ”๊ฐ€ ํŒŒ์ธํŠœ๋‹

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments

model_id  = "maninglearchine/kobert-article-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForSequenceClassification.from_pretrained(model_id)

training_args = TrainingArguments(
    output_dir="./finetuned",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    learning_rate=1e-5,   # ๋‚ฎ์€ LR์œผ๋กœ ์ ์ง„์  ํŒŒ์ธํŠœ๋‹
    eval_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=your_train_dataset,
    eval_dataset=your_eval_dataset,
)
trainer.train()
model.push_to_hub("maninglearchine/kobert-article-classifier-v2")

๋ผ์ด์„ ์Šค

MIT License โ€” ์ž์œ ๋กญ๊ฒŒ ์‚ฌ์šฉ, ์ˆ˜์ •, ๋ฐฐํฌ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Downloads last month
132
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for maninglearchine/kobert-article-classifier

Base model

klue/bert-base
Finetuned
(171)
this model