---
language:
  - ko
  - en
tags:
  - text-classification
  - regression
  - commit-priority
  - issue-priority
license: apache-2.0
datasets:
  - custom
metrics:
  - mae
  - rmse
  - spearman
---

# Issue Priority Predictor (Korean)

**커밋/이슈의 우선순위를 자동으로 예측하는 한국어/영어 지원 모델**

## Model Details

이 모델은 GitHub 커밋 텍스트를 기반으로
우선순위 점수(priority score)를 예측하는 다국어 모델입니다.

distilbert-base-multilingual-cased를 기반으로 하여,
한국어와 영어로 작성된 커밋 데이터를 사용해 파인튜닝되었습니다.

모델은 입력 텍스트에 대해 0~1 범위의 연속적인 점수를 출력하며,
점수가 높을수록 상대적으로 우선순위가 높음을 의미합니다.
최종적인 우선순위 클래스(HIGH / MED / LOW)는
서비스 환경에 맞는 후처리 정책을 통해 결정하는 것을 전제로 합니다.

Evaluation Metrics

아래 평가지표는 0~1로 스케일링된 우선순위 점수를 기준으로 산출되었습니다.

Loss: 0.0045

MAE (평균 절대 오차): 0.0122

RMSE (평균 제곱근 오차): 0.0150

Spearman 상관계수: 0.8473

**Note**
본 모델은 우선순위를 직접 분류(classification)하지 않고, 모델이 예측한 점수를 기반으로
도메인 정책(보안, 결제, 장애, 문서 변경 등)을 반영한 후처리를 적용하도록 설계되었습니다.

## 🚀 빠른 시작

### 모델 예측 (점수만 출력)

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import json

# 모델 로드
model_name = "your-username/issue-priority-ko"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# 예측 (점수만 출력)
text = "로그인 안됨, 토큰 만료 처리 필요"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    score_raw = model(**inputs).logits.item()  # 0~1 범위 점수

# 원래 스케일로 복원
with open("score_thresholds.json", "r", encoding="utf-8") as f:
    thresholds = json.load(f)

score = score_raw * (thresholds["train_max"] - thresholds["train_min"]) + thresholds["train_min"]

print(f"Predicted Score: {score:.4f}")
```

### 점수 → 클래스 변환 (후처리)

```python
# 방법 1: to_priority 함수 사용 (권장)
from postprocess.to_priority import to_priority

# 기본 변환 (후처리 규칙 없음)
priority = to_priority(score=score, text=text)
print(f"Priority: {priority}")

# 후처리 규칙 포함 (옵션)
priority = to_priority(score=score, text=text, use_rules=True)
print(f"Priority (with rules): {priority}")
```

```python
# 방법 2: 직접 변환
if score >= thresholds["q_high"]:
    priority = "HIGH"
elif score <= thresholds["q_low"]:
    priority = "LOW"
else:
    priority = "MED"
```

## 📋 모델 정보

| 항목 | 내용 |
|------|------|
| **기반 모델** | `distilbert-base-multilingual-cased` |
| **작업 유형** | 회귀 (Regression) |
| **입력** | 커밋/이슈 제목 + 본문 텍스트 |
| **출력** | 우선순위 점수 (float) |
| **클래스 변환** | 후처리로 수행 (`to_priority()` 함수) |
| **언어** | 한국어, 영어 |
| **최대 길이** | 256 토큰 |

> **중요**: 모델은 점수만 출력합니다. HIGH/MED/LOW 클래스 변환은 `to_priority()` 함수를 사용하세요.

## 🎯 주요 특징

1. **다국어 지원**: 한국어와 영어 커밋/이슈 모두 처리 가능
2. **키워드 기반 후처리**: `postprocess/priority_rules.yaml`로 규칙 커스터마이징
3. **배치 내 상대 정렬**: 여러 이슈를 함께 비교하여 더 정확한 우선순위 예측
4. **경량 모델**: DistilBERT 기반으로 빠른 추론 속도

## 📁 폴더 구조

```
issue-priority-ko/
├── README.md                # 이 파일
├── config.json              # 모델 설정
├── model.safetensors        # 모델 가중치
├── tokenizer.json           # 토크나이저
├── tokenizer_config.json
├── vocab.txt
├── score_thresholds.json    # 우선순위 변환 임계값
│
├── postprocess/             # 후처리 규칙 (옵션)
│   ├── to_priority.py        # 점수→클래스 변환 함수
│   ├── priority_rules.yaml  # 키워드 기반 규칙 (옵션)
│   └── README.md            # 후처리 설명
│
├── examples/                # 사용 예제
│   ├── input.json
│   └── output.json
│
└── requirements.txt         # 의존성 패키지
```

## 🔄 점수 → 클래스 변환

### `to_priority()` 함수 사용

```python
from postprocess.to_priority import to_priority

# 기본 변환 (threshold 기반)
priority = to_priority(score=0.82, text="로그인 에러 발생")

# 후처리 규칙 포함 (옵션)
priority = to_priority(score=0.82, text="로그인 에러 발생", use_rules=True)

# 배치 변환
from postprocess.to_priority import to_priority_batch
scores = [0.82, 0.75, 0.90]
texts = ["로그인 에러", "README 수정", "서버 다운"]
priorities = to_priority_batch(scores, texts, use_rules=True)
```

### 후처리 규칙 (옵션)

`postprocess/priority_rules.yaml`을 사용하여 키워드 기반 규칙을 적용할 수 있습니다.

**규칙 예시:**
- **LOW 강제**: `readme`, `typo`, `문서` → 무조건 LOW
- **최소 MED 보장**: `장애`, `에러`, `로그인`, `결제` → 최소 MED
- **HIGH 부스트**: `데이터 손실`, `무한`, `critical` → HIGH

자세한 내용은 [`postprocess/README.md`](postprocess/README.md)를 참고하세요.

## 📊 성능 지표

| 지표 | 값 |
|------|-----|
| **MAE** | 0.009 (스케일된 값 기준) |
| **RMSE** | 0.015 (스케일된 값 기준) |
| **Spearman Correlation** | 0.85 |

> **참고**: 모델은 상대적 순위 예측에 더 적합합니다. 절대 점수보다는 배치 내 비교를 권장합니다.

## 💡 사용 팁

### 1. 단일 예측
```python
# 모델 예측
text = "로그인 안됨"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    score_raw = model(**inputs).logits.item()

# 스케일 복원
score = score_raw * (thresholds["train_max"] - thresholds["train_min"]) + thresholds["train_min"]

# 클래스 변환
from postprocess.to_priority import to_priority
priority = to_priority(score=score, text=text, use_rules=True)
```

### 2. 배치 예측 (권장)
```python
texts = ["이슈1", "이슈2", "이슈3"]
inputs = tokenizer(texts, return_tensors="pt", truncation=True, max_length=256, padding=True)

with torch.no_grad():
    scores_raw = model(**inputs).logits.squeeze(-1).numpy()

# 스케일 복원
scores = scores_raw * (train_max - train_min) + train_min

# 배치 내 상대 정렬 (quantile 기반)
from scipy.stats import rankdata
normalized = rankdata(scores, method='average') / len(scores)

# 상위 30% = HIGH, 하위 30% = LOW
q_high = np.percentile(normalized, 70)
q_low = np.percentile(normalized, 30)
```

### 3. 배치 예측 + 클래스 변환
```python
# 배치 예측
texts = ["이슈1", "이슈2", "이슈3"]
inputs = tokenizer(texts, return_tensors="pt", truncation=True, max_length=256, padding=True)

with torch.no_grad():
    scores_raw = model(**inputs).logits.squeeze(-1).numpy()

# 스케일 복원
scores = scores_raw * (thresholds["train_max"] - thresholds["train_min"]) + thresholds["train_min"]

# 배치 클래스 변환
from postprocess.to_priority import to_priority_batch
priorities = to_priority_batch(scores, texts, use_rules=True)

for text, score, priority in zip(texts, scores, priorities):
    print(f"{priority}: {score:.4f} - {text}")
```

## ⚠️ 주의사항

1. **모델 출력**: 모델은 점수만 출력합니다 (회귀 모델). 클래스 변환은 `to_priority()` 함수 사용
2. **스케일 복원 필수**: 모델 출력은 0~1 범위입니다. `score_thresholds.json`으로 원래 스케일 복원 필요
3. **상대적 순위**: 절대 점수보다는 배치 내 상대 비교가 더 정확
4. **후처리 규칙**: `priority_rules.yaml`은 옵션입니다. 필요시에만 사용
5. **도메인 적응**: 새로운 도메인에서는 재학습 또는 파인튜닝 권장

## 📚 예제

실제 사용 예제는 [`examples/`](examples/) 폴더를 참고하세요.

- `input.json`: 입력 예제
- `output.json`: 출력 예제

## 🔗 관련 자료

- **변환 함수**: [`postprocess/to_priority.py`](postprocess/to_priority.py) - 점수→클래스 변환
- **후처리 규칙 (옵션)**: [`postprocess/priority_rules.yaml`](postprocess/priority_rules.yaml)
- **후처리 설명**: [`postprocess/README.md`](postprocess/README.md)

## 📄 라이센스

- Apache 2.0