spinxxxx's picture
feat: add issue priority prediction model (score-based)
902efd1
---
language:
- ko
- en
tags:
- text-classification
- regression
- commit-priority
- issue-priority
license: apache-2.0
datasets:
- custom
metrics:
- mae
- rmse
- spearman
---
# Issue Priority Predictor (Korean)
**컀밋/이슈의 μš°μ„ μˆœμœ„λ₯Ό μžλ™μœΌλ‘œ μ˜ˆμΈ‘ν•˜λŠ” ν•œκ΅­μ–΄/μ˜μ–΄ 지원 λͺ¨λΈ**
## Model Details
이 λͺ¨λΈμ€ GitHub 컀밋 ν…μŠ€νŠΈλ₯Ό 기반으둜
μš°μ„ μˆœμœ„ 점수(priority score)λ₯Ό μ˜ˆμΈ‘ν•˜λŠ” λ‹€κ΅­μ–΄ λͺ¨λΈμž…λ‹ˆλ‹€.
distilbert-base-multilingual-casedλ₯Ό 기반으둜 ν•˜μ—¬,
ν•œκ΅­μ–΄μ™€ μ˜μ–΄λ‘œ μž‘μ„±λœ 컀밋 데이터λ₯Ό μ‚¬μš©ν•΄ νŒŒμΈνŠœλ‹λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
λͺ¨λΈμ€ μž…λ ₯ ν…μŠ€νŠΈμ— λŒ€ν•΄ 0~1 λ²”μœ„μ˜ 연속적인 점수λ₯Ό 좜λ ₯ν•˜λ©°,
μ μˆ˜κ°€ λ†’μ„μˆ˜λ‘ μƒλŒ€μ μœΌλ‘œ μš°μ„ μˆœμœ„κ°€ λ†’μŒμ„ μ˜λ―Έν•©λ‹ˆλ‹€.
μ΅œμ’…μ μΈ μš°μ„ μˆœμœ„ 클래슀(HIGH / MED / LOW)λŠ”
μ„œλΉ„μŠ€ ν™˜κ²½μ— λ§žλŠ” ν›„μ²˜λ¦¬ 정책을 톡해 κ²°μ •ν•˜λŠ” 것을 μ „μ œλ‘œ ν•©λ‹ˆλ‹€.
Evaluation Metrics
μ•„λž˜ ν‰κ°€μ§€ν‘œλŠ” 0~1둜 μŠ€μΌ€μΌλ§λœ μš°μ„ μˆœμœ„ 점수λ₯Ό κΈ°μ€€μœΌλ‘œ μ‚°μΆœλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
Loss: 0.0045
MAE (평균 μ ˆλŒ€ 였차): 0.0122
RMSE (평균 제곱근 였차): 0.0150
Spearman μƒκ΄€κ³„μˆ˜: 0.8473
**Note**
λ³Έ λͺ¨λΈμ€ μš°μ„ μˆœμœ„λ₯Ό 직접 λΆ„λ₯˜(classification)ν•˜μ§€ μ•Šκ³ , λͺ¨λΈμ΄ μ˜ˆμΈ‘ν•œ 점수λ₯Ό 기반으둜
도메인 μ •μ±…(λ³΄μ•ˆ, 결제, μž₯μ• , λ¬Έμ„œ λ³€κ²½ λ“±)을 λ°˜μ˜ν•œ ν›„μ²˜λ¦¬λ₯Ό μ μš©ν•˜λ„λ‘ μ„€κ³„λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
## πŸš€ λΉ λ₯Έ μ‹œμž‘
### λͺ¨λΈ 예츑 (점수만 좜λ ₯)
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import json
# λͺ¨λΈ λ‘œλ“œ
model_name = "your-username/issue-priority-ko"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
# 예츑 (점수만 좜λ ₯)
text = "둜그인 μ•ˆλ¨, 토큰 만료 처리 ν•„μš”"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
score_raw = model(**inputs).logits.item() # 0~1 λ²”μœ„ 점수
# μ›λž˜ μŠ€μΌ€μΌλ‘œ 볡원
with open("score_thresholds.json", "r", encoding="utf-8") as f:
thresholds = json.load(f)
score = score_raw * (thresholds["train_max"] - thresholds["train_min"]) + thresholds["train_min"]
print(f"Predicted Score: {score:.4f}")
```
### 점수 β†’ 클래슀 λ³€ν™˜ (ν›„μ²˜λ¦¬)
```python
# 방법 1: to_priority ν•¨μˆ˜ μ‚¬μš© (ꢌμž₯)
from postprocess.to_priority import to_priority
# κΈ°λ³Έ λ³€ν™˜ (ν›„μ²˜λ¦¬ κ·œμΉ™ μ—†μŒ)
priority = to_priority(score=score, text=text)
print(f"Priority: {priority}")
# ν›„μ²˜λ¦¬ κ·œμΉ™ 포함 (μ˜΅μ…˜)
priority = to_priority(score=score, text=text, use_rules=True)
print(f"Priority (with rules): {priority}")
```
```python
# 방법 2: 직접 λ³€ν™˜
if score >= thresholds["q_high"]:
priority = "HIGH"
elif score <= thresholds["q_low"]:
priority = "LOW"
else:
priority = "MED"
```
## πŸ“‹ λͺ¨λΈ 정보
| ν•­λͺ© | λ‚΄μš© |
|------|------|
| **기반 λͺ¨λΈ** | `distilbert-base-multilingual-cased` |
| **μž‘μ—… μœ ν˜•** | νšŒκ·€ (Regression) |
| **μž…λ ₯** | 컀밋/이슈 제λͺ© + λ³Έλ¬Έ ν…μŠ€νŠΈ |
| **좜λ ₯** | μš°μ„ μˆœμœ„ 점수 (float) |
| **클래슀 λ³€ν™˜** | ν›„μ²˜λ¦¬λ‘œ μˆ˜ν–‰ (`to_priority()` ν•¨μˆ˜) |
| **μ–Έμ–΄** | ν•œκ΅­μ–΄, μ˜μ–΄ |
| **μ΅œλŒ€ 길이** | 256 토큰 |
> **μ€‘μš”**: λͺ¨λΈμ€ 점수만 좜λ ₯ν•©λ‹ˆλ‹€. HIGH/MED/LOW 클래슀 λ³€ν™˜μ€ `to_priority()` ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ„Έμš”.
## 🎯 μ£Όμš” νŠΉμ§•
1. **λ‹€κ΅­μ–΄ 지원**: ν•œκ΅­μ–΄μ™€ μ˜μ–΄ 컀밋/이슈 λͺ¨λ‘ 처리 κ°€λŠ₯
2. **ν‚€μ›Œλ“œ 기반 ν›„μ²˜λ¦¬**: `postprocess/priority_rules.yaml`둜 κ·œμΉ™ μ»€μŠ€ν„°λ§ˆμ΄μ§•
3. **배치 λ‚΄ μƒλŒ€ μ •λ ¬**: μ—¬λŸ¬ 이슈λ₯Ό ν•¨κ»˜ λΉ„κ΅ν•˜μ—¬ 더 μ •ν™•ν•œ μš°μ„ μˆœμœ„ 예츑
4. **κ²½λŸ‰ λͺ¨λΈ**: DistilBERT 기반으둜 λΉ λ₯Έ μΆ”λ‘  속도
## πŸ“ 폴더 ꡬ쑰
```
issue-priority-ko/
β”œβ”€β”€ README.md # 이 파일
β”œβ”€β”€ config.json # λͺ¨λΈ μ„€μ •
β”œβ”€β”€ model.safetensors # λͺ¨λΈ κ°€μ€‘μΉ˜
β”œβ”€β”€ tokenizer.json # ν† ν¬λ‚˜μ΄μ €
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ vocab.txt
β”œβ”€β”€ score_thresholds.json # μš°μ„ μˆœμœ„ λ³€ν™˜ μž„κ³„κ°’
β”‚
β”œβ”€β”€ postprocess/ # ν›„μ²˜λ¦¬ κ·œμΉ™ (μ˜΅μ…˜)
β”‚ β”œβ”€β”€ to_priority.py # μ μˆ˜β†’ν΄λž˜μŠ€ λ³€ν™˜ ν•¨μˆ˜
β”‚ β”œβ”€β”€ priority_rules.yaml # ν‚€μ›Œλ“œ 기반 κ·œμΉ™ (μ˜΅μ…˜)
β”‚ └── README.md # ν›„μ²˜λ¦¬ μ„€λͺ…
β”‚
β”œβ”€β”€ examples/ # μ‚¬μš© 예제
β”‚ β”œβ”€β”€ input.json
β”‚ └── output.json
β”‚
└── requirements.txt # μ˜μ‘΄μ„± νŒ¨ν‚€μ§€
```
## πŸ”„ 점수 β†’ 클래슀 λ³€ν™˜
### `to_priority()` ν•¨μˆ˜ μ‚¬μš©
```python
from postprocess.to_priority import to_priority
# κΈ°λ³Έ λ³€ν™˜ (threshold 기반)
priority = to_priority(score=0.82, text="둜그인 μ—λŸ¬ λ°œμƒ")
# ν›„μ²˜λ¦¬ κ·œμΉ™ 포함 (μ˜΅μ…˜)
priority = to_priority(score=0.82, text="둜그인 μ—λŸ¬ λ°œμƒ", use_rules=True)
# 배치 λ³€ν™˜
from postprocess.to_priority import to_priority_batch
scores = [0.82, 0.75, 0.90]
texts = ["둜그인 μ—λŸ¬", "README μˆ˜μ •", "μ„œλ²„ λ‹€μš΄"]
priorities = to_priority_batch(scores, texts, use_rules=True)
```
### ν›„μ²˜λ¦¬ κ·œμΉ™ (μ˜΅μ…˜)
`postprocess/priority_rules.yaml`을 μ‚¬μš©ν•˜μ—¬ ν‚€μ›Œλ“œ 기반 κ·œμΉ™μ„ μ μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
**κ·œμΉ™ μ˜ˆμ‹œ:**
- **LOW κ°•μ œ**: `readme`, `typo`, `λ¬Έμ„œ` β†’ 무쑰건 LOW
- **μ΅œμ†Œ MED 보μž₯**: `μž₯μ• `, `μ—λŸ¬`, `둜그인`, `결제` β†’ μ΅œμ†Œ MED
- **HIGH λΆ€μŠ€νŠΈ**: `데이터 손싀`, `λ¬΄ν•œ`, `critical` β†’ HIGH
μžμ„Έν•œ λ‚΄μš©μ€ [`postprocess/README.md`](postprocess/README.md)λ₯Ό μ°Έκ³ ν•˜μ„Έμš”.
## πŸ“Š μ„±λŠ₯ μ§€ν‘œ
| μ§€ν‘œ | κ°’ |
|------|-----|
| **MAE** | 0.009 (μŠ€μΌ€μΌλœ κ°’ κΈ°μ€€) |
| **RMSE** | 0.015 (μŠ€μΌ€μΌλœ κ°’ κΈ°μ€€) |
| **Spearman Correlation** | 0.85 |
> **μ°Έκ³ **: λͺ¨λΈμ€ μƒλŒ€μ  μˆœμœ„ μ˜ˆμΈ‘μ— 더 μ ν•©ν•©λ‹ˆλ‹€. μ ˆλŒ€ μ μˆ˜λ³΄λ‹€λŠ” 배치 λ‚΄ 비ꡐλ₯Ό ꢌμž₯ν•©λ‹ˆλ‹€.
## πŸ’‘ μ‚¬μš© 팁
### 1. 단일 예츑
```python
# λͺ¨λΈ 예츑
text = "둜그인 μ•ˆλ¨"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
score_raw = model(**inputs).logits.item()
# μŠ€μΌ€μΌ 볡원
score = score_raw * (thresholds["train_max"] - thresholds["train_min"]) + thresholds["train_min"]
# 클래슀 λ³€ν™˜
from postprocess.to_priority import to_priority
priority = to_priority(score=score, text=text, use_rules=True)
```
### 2. 배치 예츑 (ꢌμž₯)
```python
texts = ["이슈1", "이슈2", "이슈3"]
inputs = tokenizer(texts, return_tensors="pt", truncation=True, max_length=256, padding=True)
with torch.no_grad():
scores_raw = model(**inputs).logits.squeeze(-1).numpy()
# μŠ€μΌ€μΌ 볡원
scores = scores_raw * (train_max - train_min) + train_min
# 배치 λ‚΄ μƒλŒ€ μ •λ ¬ (quantile 기반)
from scipy.stats import rankdata
normalized = rankdata(scores, method='average') / len(scores)
# μƒμœ„ 30% = HIGH, ν•˜μœ„ 30% = LOW
q_high = np.percentile(normalized, 70)
q_low = np.percentile(normalized, 30)
```
### 3. 배치 예츑 + 클래슀 λ³€ν™˜
```python
# 배치 예츑
texts = ["이슈1", "이슈2", "이슈3"]
inputs = tokenizer(texts, return_tensors="pt", truncation=True, max_length=256, padding=True)
with torch.no_grad():
scores_raw = model(**inputs).logits.squeeze(-1).numpy()
# μŠ€μΌ€μΌ 볡원
scores = scores_raw * (thresholds["train_max"] - thresholds["train_min"]) + thresholds["train_min"]
# 배치 클래슀 λ³€ν™˜
from postprocess.to_priority import to_priority_batch
priorities = to_priority_batch(scores, texts, use_rules=True)
for text, score, priority in zip(texts, scores, priorities):
print(f"{priority}: {score:.4f} - {text}")
```
## ⚠️ μ£Όμ˜μ‚¬ν•­
1. **λͺ¨λΈ 좜λ ₯**: λͺ¨λΈμ€ 점수만 좜λ ₯ν•©λ‹ˆλ‹€ (νšŒκ·€ λͺ¨λΈ). 클래슀 λ³€ν™˜μ€ `to_priority()` ν•¨μˆ˜ μ‚¬μš©
2. **μŠ€μΌ€μΌ 볡원 ν•„μˆ˜**: λͺ¨λΈ 좜λ ₯은 0~1 λ²”μœ„μž…λ‹ˆλ‹€. `score_thresholds.json`으둜 μ›λž˜ μŠ€μΌ€μΌ 볡원 ν•„μš”
3. **μƒλŒ€μ  μˆœμœ„**: μ ˆλŒ€ μ μˆ˜λ³΄λ‹€λŠ” 배치 λ‚΄ μƒλŒ€ 비ꡐ가 더 μ •ν™•
4. **ν›„μ²˜λ¦¬ κ·œμΉ™**: `priority_rules.yaml`은 μ˜΅μ…˜μž…λ‹ˆλ‹€. ν•„μš”μ‹œμ—λ§Œ μ‚¬μš©
5. **도메인 적응**: μƒˆλ‘œμš΄ λ„λ©”μΈμ—μ„œλŠ” μž¬ν•™μŠ΅ λ˜λŠ” νŒŒμΈνŠœλ‹ ꢌμž₯
## πŸ“š 예제
μ‹€μ œ μ‚¬μš© μ˜ˆμ œλŠ” [`examples/`](examples/) 폴더λ₯Ό μ°Έκ³ ν•˜μ„Έμš”.
- `input.json`: μž…λ ₯ 예제
- `output.json`: 좜λ ₯ 예제
## πŸ”— κ΄€λ ¨ 자료
- **λ³€ν™˜ ν•¨μˆ˜**: [`postprocess/to_priority.py`](postprocess/to_priority.py) - μ μˆ˜β†’ν΄λž˜μŠ€ λ³€ν™˜
- **ν›„μ²˜λ¦¬ κ·œμΉ™ (μ˜΅μ…˜)**: [`postprocess/priority_rules.yaml`](postprocess/priority_rules.yaml)
- **ν›„μ²˜λ¦¬ μ„€λͺ…**: [`postprocess/README.md`](postprocess/README.md)
## πŸ“„ λΌμ΄μ„ΌμŠ€
- Apache 2.0