spinxxxx
/

git-issues-priority-ko

Text Classification

commit-priority

Model card Files Files and versions

git-issues-priority-ko / README.md

spinxxxx's picture

feat: add issue priority prediction model (score-based)

902efd1 24 days ago

|

history blame contribute delete

9.12 kB

	---
	language:
	- ko
	- en
	tags:
	- text-classification
	- regression
	- commit-priority
	- issue-priority
	license: apache-2.0
	datasets:
	- custom
	metrics:
	- mae
	- rmse
	- spearman
	---

	# Issue Priority Predictor (Korean)

	커밋/이슈의 우선순위를 자동으로 예측하는 한국어/영어 지원 모델

	## Model Details

	이 모델은 GitHub 커밋 텍스트를 기반으로
	우선순위 점수(priority score)를 예측하는 다국어 모델입니다.

	distilbert-base-multilingual-cased를 기반으로 하여,
	한국어와 영어로 작성된 커밋 데이터를 사용해 파인튜닝되었습니다.

	모델은 입력 텍스트에 대해 0~1 범위의 연속적인 점수를 출력하며,
	점수가 높을수록 상대적으로 우선순위가 높음을 의미합니다.
	최종적인 우선순위 클래스(HIGH / MED / LOW)는
	서비스 환경에 맞는 후처리 정책을 통해 결정하는 것을 전제로 합니다.

	Evaluation Metrics

	아래 평가지표는 0~1로 스케일링된 우선순위 점수를 기준으로 산출되었습니다.

	Loss: 0.0045

	MAE (평균 절대 오차): 0.0122

	RMSE (평균 제곱근 오차): 0.0150

	Spearman 상관계수: 0.8473

	Note
	본 모델은 우선순위를 직접 분류(classification)하지 않고, 모델이 예측한 점수를 기반으로
	도메인 정책(보안, 결제, 장애, 문서 변경 등)을 반영한 후처리를 적용하도록 설계되었습니다.

	## 🚀 빠른 시작

	### 모델 예측 (점수만 출력)

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import json

	# 모델 로드
	model_name = "your-username/issue-priority-ko"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	model.eval()

	# 예측 (점수만 출력)
	text = "로그인 안됨, 토큰 만료 처리 필요"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

	with torch.no_grad():
	score_raw = model(**inputs).logits.item() # 0~1 범위 점수

	# 원래 스케일로 복원
	with open("score_thresholds.json", "r", encoding="utf-8") as f:
	thresholds = json.load(f)

	score = score_raw * (thresholds["train_max"] - thresholds["train_min"]) + thresholds["train_min"]

	print(f"Predicted Score: {score:.4f}")
	```

	### 점수 → 클래스 변환 (후처리)

	```python
	# 방법 1: to_priority 함수 사용 (권장)
	from postprocess.to_priority import to_priority

	# 기본 변환 (후처리 규칙 없음)
	priority = to_priority(score=score, text=text)
	print(f"Priority: {priority}")

	# 후처리 규칙 포함 (옵션)
	priority = to_priority(score=score, text=text, use_rules=True)
	print(f"Priority (with rules): {priority}")
	```

	```python
	# 방법 2: 직접 변환
	if score >= thresholds["q_high"]:
	priority = "HIGH"
	elif score <= thresholds["q_low"]:
	priority = "LOW"
	else:
	priority = "MED"
	```

	## 📋 모델 정보

	\| 항목 \| 내용 \|
	\|------\|------\|
	\| 기반 모델 \| `distilbert-base-multilingual-cased` \|
	\| 작업 유형 \| 회귀 (Regression) \|
	\| 입력 \| 커밋/이슈 제목 + 본문 텍스트 \|
	\| 출력 \| 우선순위 점수 (float) \|
	\| 클래스 변환 \| 후처리로 수행 (`to_priority()` 함수) \|
	\| 언어 \| 한국어, 영어 \|
	\| 최대 길이 \| 256 토큰 \|

	> 중요: 모델은 점수만 출력합니다. HIGH/MED/LOW 클래스 변환은 `to_priority()` 함수를 사용하세요.

	## 🎯 주요 특징

	1. 다국어 지원: 한국어와 영어 커밋/이슈 모두 처리 가능
	2. 키워드 기반 후처리: `postprocess/priority_rules.yaml`로 규칙 커스터마이징
	3. 배치 내 상대 정렬: 여러 이슈를 함께 비교하여 더 정확한 우선순위 예측
	4. 경량 모델: DistilBERT 기반으로 빠른 추론 속도

	## 📁 폴더 구조

	```
	issue-priority-ko/
	├── README.md # 이 파일
	├── config.json # 모델 설정
	├── model.safetensors # 모델 가중치
	├── tokenizer.json # 토크나이저
	├── tokenizer_config.json
	├── vocab.txt
	├── score_thresholds.json # 우선순위 변환 임계값
	│
	├── postprocess/ # 후처리 규칙 (옵션)
	│ ├── to_priority.py # 점수→클래스 변환 함수
	│ ├── priority_rules.yaml # 키워드 기반 규칙 (옵션)
	│ └── README.md # 후처리 설명
	│
	├── examples/ # 사용 예제
	│ ├── input.json
	│ └── output.json
	│
	└── requirements.txt # 의존성 패키지
	```

	## 🔄 점수 → 클래스 변환

	### `to_priority()` 함수 사용

	```python
	from postprocess.to_priority import to_priority

	# 기본 변환 (threshold 기반)
	priority = to_priority(score=0.82, text="로그인 에러 발생")

	# 후처리 규칙 포함 (옵션)
	priority = to_priority(score=0.82, text="로그인 에러 발생", use_rules=True)

	# 배치 변환
	from postprocess.to_priority import to_priority_batch
	scores = [0.82, 0.75, 0.90]
	texts = ["로그인 에러", "README 수정", "서버 다운"]
	priorities = to_priority_batch(scores, texts, use_rules=True)
	```

	### 후처리 규칙 (옵션)

	`postprocess/priority_rules.yaml`을 사용하여 키워드 기반 규칙을 적용할 수 있습니다.

	규칙 예시:
	- LOW 강제: `readme`, `typo`, `문서` → 무조건 LOW
	- 최소 MED 보장: `장애`, `에러`, `로그인`, `결제` → 최소 MED
	- HIGH 부스트: `데이터 손실`, `무한`, `critical` → HIGH

	자세한 내용은 [`postprocess/README.md`](postprocess/README.md)를 참고하세요.

	## 📊 성능 지표

	\| 지표 \| 값 \|
	\|------\|-----\|
	\| MAE \| 0.009 (스케일된 값 기준) \|
	\| RMSE \| 0.015 (스케일된 값 기준) \|
	\| Spearman Correlation \| 0.85 \|

	> 참고: 모델은 상대적 순위 예측에 더 적합합니다. 절대 점수보다는 배치 내 비교를 권장합니다.

	## 💡 사용 팁

	### 1. 단일 예측
	```python
	# 모델 예측
	text = "로그인 안됨"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
	with torch.no_grad():
	score_raw = model(**inputs).logits.item()

	# 스케일 복원
	score = score_raw * (thresholds["train_max"] - thresholds["train_min"]) + thresholds["train_min"]

	# 클래스 변환
	from postprocess.to_priority import to_priority
	priority = to_priority(score=score, text=text, use_rules=True)
	```

	### 2. 배치 예측 (권장)
	```python
	texts = ["이슈1", "이슈2", "이슈3"]
	inputs = tokenizer(texts, return_tensors="pt", truncation=True, max_length=256, padding=True)

	with torch.no_grad():
	scores_raw = model(**inputs).logits.squeeze(-1).numpy()

	# 스케일 복원
	scores = scores_raw * (train_max - train_min) + train_min

	# 배치 내 상대 정렬 (quantile 기반)
	from scipy.stats import rankdata
	normalized = rankdata(scores, method='average') / len(scores)

	# 상위 30% = HIGH, 하위 30% = LOW
	q_high = np.percentile(normalized, 70)
	q_low = np.percentile(normalized, 30)
	```

	### 3. 배치 예측 + 클래스 변환
	```python
	# 배치 예측
	texts = ["이슈1", "이슈2", "이슈3"]
	inputs = tokenizer(texts, return_tensors="pt", truncation=True, max_length=256, padding=True)

	with torch.no_grad():
	scores_raw = model(**inputs).logits.squeeze(-1).numpy()

	# 스케일 복원
	scores = scores_raw * (thresholds["train_max"] - thresholds["train_min"]) + thresholds["train_min"]

	# 배치 클래스 변환
	from postprocess.to_priority import to_priority_batch
	priorities = to_priority_batch(scores, texts, use_rules=True)

	for text, score, priority in zip(texts, scores, priorities):
	print(f"{priority}: {score:.4f} - {text}")
	```

	## ⚠️ 주의사항

	1. 모델 출력: 모델은 점수만 출력합니다 (회귀 모델). 클래스 변환은 `to_priority()` 함수 사용
	2. 스케일 복원 필수: 모델 출력은 0~1 범위입니다. `score_thresholds.json`으로 원래 스케일 복원 필요
	3. 상대적 순위: 절대 점수보다는 배치 내 상대 비교가 더 정확
	4. 후처리 규칙: `priority_rules.yaml`은 옵션입니다. 필요시에만 사용
	5. 도메인 적응: 새로운 도메인에서는 재학습 또는 파인튜닝 권장

	## 📚 예제

	실제 사용 예제는 [`examples/`](examples/) 폴더를 참고하세요.

	- `input.json`: 입력 예제
	- `output.json`: 출력 예제

	## 🔗 관련 자료

	- 변환 함수: [`postprocess/to_priority.py`](postprocess/to_priority.py) - 점수→클래스 변환
	- 후처리 규칙 (옵션): [`postprocess/priority_rules.yaml`](postprocess/priority_rules.yaml)
	- 후처리 설명: [`postprocess/README.md`](postprocess/README.md)

	## 📄 라이센스

	- Apache 2.0