atti433
/

minde-classifier

Text Classification

public-administration

text-embeddings-inference

Model card Files Files and versions

atti433 commited on 8 days ago

Commit

2b2fbde

·

verified ·

1 Parent(s): 6ca8609

Add model card

Files changed (1) hide show

README.md +85 -0

README.md ADDED Viewed

	@@ -0,0 +1,85 @@

+---
+language:
+- ko
+license: other
+library_name: transformers
+pipeline_tag: text-classification
+base_model: klue/bert-base
+tags:
+- bert
+- klue
+- korean
+- text-classification
+- minwon
+- complaint
+- public-administration
+---
+# MindE 민원 분류기 (bert-v9)
+한국 공공 민원을 **11개 카테고리**로 자동 분류하는 KLUE BERT 기반 모델.
+## 카테고리 (11)
+| ID | 카테고리 | per-class F1 |
+|---:|---|---:|
+| 1 | 교통 | 0.882 |
+| 2 | 건축 | 0.755 |
+| 3 | 행정 | 0.812 |
+| 4 | 보건위생 | 0.911 |
+| 5 | 환경 | 0.874 |
+| 6 | 문화_여가 | 0.825 |
+| 7 | 농축산 | 0.909 |
+| 8 | 복지 | 0.866 |
+| 9 | 세무 | 0.974 |
+| 10 | 상하수도 | 0.921 |
+| 11 | 경제 | 0.874 |
+**Test set (20,788건)**
+- Accuracy: **0.871**
+- Macro F1: **0.873**
+- Weighted F1: 0.871
+## 학습 데이터
+- AI Hub 143번 "민원 업무 효율, 자동화를 위한 언어 AI 학습데이터" (~86만 건, 18 카테고리 → 11 매핑)
+- group_id 단위 8:1:1 분할 + 카테고리당 train 20k cap
+- 마스킹 토큰(`#@주소#` 등) → special token(`[ADDR]` 등) 치환
+## 학습 설정
+- Base: `klue/bert-base`
+- max_length: 128
+- batch_size: 32
+- epochs: 3
+- learning_rate: 2e-5
+- warmup_ratio: 0.1
+- weight_decay: 0.01
+- 학습 시간: ~45분 (RTX 4060 Ti)
+## 사용 예시
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+tokenizer = AutoTokenizer.from_pretrained("atti433/minde-classifier")
+model = AutoModelForSequenceClassification.from_pretrained("atti433/minde-classifier")
+text = "집 앞에 차가 자꾸 불법주차해서 너무 불편합니다."
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
+with torch.no_grad():
+    logits = model(**inputs).logits
+probs = torch.softmax(logits, dim=-1)
+labels = ['교통','건축','행정','보건위생','환경','문화_여가','농축산','복지','세무','상하수도','경제']
+pred = labels[probs.argmax().item()]
+print(pred, probs.max().item())
+```
+또는 본 프로젝트의 `chatbot_service.classify_complaint()` 사용.
+## 한계
+- 학습 데이터(AI Hub 143)는 창원시 민원 중심이라 지역 어휘 편향 가능
+- "건축" 카테고리 F1 0.755가 가장 낮음 — 안전건설과 raw_category에 도로/시설 민원이 섞여있던 라벨 노이즈 영향
+- 동음이의/짧은 텍스트(예: "신호등")는 confidence 낮음. top-3로 받아서 LLM이 판단 권장