atti433
/

minde-classifier

Text Classification

public-administration

text-embeddings-inference

Model card Files Files and versions

minde-classifier / README.md

atti433's picture

Add model card

2b2fbde verified 4 days ago

|

History Blame Contribute Delete

2.45 kB

	---
	language:
	- ko
	license: other
	library_name: transformers
	pipeline_tag: text-classification
	base_model: klue/bert-base
	tags:
	- bert
	- klue
	- korean
	- text-classification
	- minwon
	- complaint
	- public-administration
	---

	# MindE 민원 분류기 (bert-v9)

	한국 공공 민원을 11개 카테고리로 자동 분류하는 KLUE BERT 기반 모델.

	## 카테고리 (11)

	\| ID \| 카테고리 \| per-class F1 \|
	\|---:\|---\|---:\|
	\| 1 \| 교통 \| 0.882 \|
	\| 2 \| 건축 \| 0.755 \|
	\| 3 \| 행정 \| 0.812 \|
	\| 4 \| 보건위생 \| 0.911 \|
	\| 5 \| 환경 \| 0.874 \|
	\| 6 \| 문화_여가 \| 0.825 \|
	\| 7 \| 농축산 \| 0.909 \|
	\| 8 \| 복지 \| 0.866 \|
	\| 9 \| 세무 \| 0.974 \|
	\| 10 \| 상하수도 \| 0.921 \|
	\| 11 \| 경제 \| 0.874 \|

	Test set (20,788건)
	- Accuracy: 0.871
	- Macro F1: 0.873
	- Weighted F1: 0.871

	## 학습 데이터

	- AI Hub 143번 "민원 업무 효율, 자동화를 위한 언어 AI 학습데이터" (~86만 건, 18 카테고리 → 11 매핑)
	- group_id 단위 8:1:1 분할 + 카테고리당 train 20k cap
	- 마스킹 토큰(`#@주소#` 등) → special token(`[ADDR]` 등) 치환

	## 학습 설정

	- Base: `klue/bert-base`
	- max_length: 128
	- batch_size: 32
	- epochs: 3
	- learning_rate: 2e-5
	- warmup_ratio: 0.1
	- weight_decay: 0.01
	- 학습 시간: ~45분 (RTX 4060 Ti)

	## 사용 예시

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained("atti433/minde-classifier")
	model = AutoModelForSequenceClassification.from_pretrained("atti433/minde-classifier")

	text = "집 앞에 차가 자꾸 불법주차해서 너무 불편합니다."
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.softmax(logits, dim=-1)
	labels = ['교통','건축','행정','보건위생','환경','문화_여가','농축산','복지','세무','상하수도','경제']
	pred = labels[probs.argmax().item()]
	print(pred, probs.max().item())
	```

	또는 본 프로젝트의 `chatbot_service.classify_complaint()` 사용.

	## 한계

	- 학습 데이터(AI Hub 143)는 창원시 민원 중심이라 지역 어휘 편향 가능
	- "건축" 카테고리 F1 0.755가 가장 낮음 — 안전건설과 raw_category에 도로/시설 민원이 섞여있던 라벨 노이즈 영향
	- 동음이의/짧은 텍스트(예: "신호등")는 confidence 낮음. top-3로 받아서 LLM이 판단 권장