MindE 민원 분류기 (bert-v9)

한국 공공 민원을 11개 카테고리로 자동 분류하는 KLUE BERT 기반 모델.

카테고리 (11)

ID	카테고리	per-class F1
1	교통	0.882
2	건축	0.755
3	행정	0.812
4	보건위생	0.911
5	환경	0.874
6	문화_여가	0.825
7	농축산	0.909
8	복지	0.866
9	세무	0.974
10	상하수도	0.921
11	경제	0.874

Test set (20,788건)

Accuracy: 0.871
Macro F1: 0.873
Weighted F1: 0.871

학습 데이터

AI Hub 143번 "민원 업무 효율, 자동화를 위한 언어 AI 학습데이터" (~86만 건, 18 카테고리 → 11 매핑)
group_id 단위 8:1:1 분할 + 카테고리당 train 20k cap
마스킹 토큰(#@주소# 등) → special token([ADDR] 등) 치환

학습 설정

Base: klue/bert-base
max_length: 128
batch_size: 32
epochs: 3
learning_rate: 2e-5
warmup_ratio: 0.1
weight_decay: 0.01
학습 시간: ~45분 (RTX 4060 Ti)

사용 예시

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("atti433/minde-classifier")
model = AutoModelForSequenceClassification.from_pretrained("atti433/minde-classifier")

text = "집 앞에 차가 자꾸 불법주차해서 너무 불편합니다."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
    logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)
labels = ['교통','건축','행정','보건위생','환경','문화_여가','농축산','복지','세무','상하수도','경제']
pred = labels[probs.argmax().item()]
print(pred, probs.max().item())

또는 본 프로젝트의 chatbot_service.classify_complaint() 사용.

한계

학습 데이터(AI Hub 143)는 창원시 민원 중심이라 지역 어휘 편향 가능
"건축" 카테고리 F1 0.755가 가장 낮음 — 안전건설과 raw_category에 도로/시설 민원이 섞여있던 라벨 노이즈 영향
동음이의/짧은 텍스트(예: "신호등")는 confidence 낮음. top-3로 받아서 LLM이 판단 권장

Downloads last month: 42

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for atti433/minde-classifier

Base model

klue/bert-base

Finetuned

(171)

this model