Smishing Detection RoBERTa Base ๐Ÿ›ก๏ธ๐Ÿ“ฑ

๐Ÿ“‘ Model Description

์ด ๋ชจ๋ธ์€ ์Šค๋ฏธ์‹ฑ(Smishing, SMS Phishing) ๋ฌธ์ž๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ํƒ์ง€ํ•˜๊ธฐ ์œ„ํ•ด klue/roberta-base๋ฅผ ํŒŒ์ธํŠœ๋‹(Fine-tuning)ํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
ํ•œ๊ตญ์–ด ๋ฌธ์ž ๋ฉ”์‹œ์ง€์˜ ๋ฌธ๋งฅ์„ ๋ถ„์„ํ•˜์—ฌ ํ•ด๋‹น ๋ฉ”์‹œ์ง€๊ฐ€ ์ •์ƒ์ ์ธ ๋Œ€ํ™”์ธ์ง€, ์•„๋‹ˆ๋ฉด ์•…์˜์ ์ธ ์Šค๋ฏธ์‹ฑ ์‹œ๋„์ธ์ง€ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ "Smishing Forecast: Self-Evolving AI-Powered Smishing Defense System" ํ”„๋กœ์ ํŠธ์˜ ์ผํ™˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์œผ๋ฉฐ, ์ตœ์‹  ๋‰ด์Šค ๊ธฐ๋ฐ˜์˜ ๊ณต๊ฒฉ ์‹œ๋‚˜๋ฆฌ์˜ค(Red Team)์™€ ์ด์— ๋Œ€์‘ํ•˜๋Š” ๋ฐฉ์–ด ์‹œ์Šคํ…œ(Blue Team) ๊ฐ„์˜ ์ ๋Œ€์  ํ•™์Šต(Adversarial Training)์„ ํ†ตํ•ด ์„ฑ๋Šฅ์ด ๊ณ ๋„ํ™”๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • Developed by: Donghyun Hwang (and Smishing Forecast Team)
  • Model Type: Text Classification (Binary)
  • Language: Korean
  • Base Model: klue/roberta-base

๐ŸŽฏ Intended Uses & Limitations

์‚ฌ์šฉ ๋ชฉ์  (Intended Use)

  • ์Šค๋ฏธ์‹ฑ ํƒ์ง€: SMS, ๋ฉ”์‹ ์ € ๋“ฑ์—์„œ ์ˆ˜์‹ ๋œ ํ…์ŠคํŠธ๊ฐ€ ์Šค๋ฏธ์‹ฑ์ธ์ง€ ํŒ๋ณ„
  • ๋ณด์•ˆ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜: ๋ชจ๋ฐ”์ผ ๋ณด์•ˆ ์•ฑ, ์ŠคํŒธ ํ•„ํ„ฐ๋ง ์‹œ์Šคํ…œ์˜ ๋ฐฑ์—”๋“œ ๋ชจ๋ธ
  • ๊ธˆ์œต ์‚ฌ๊ธฐ ์˜ˆ๋ฐฉ: ์€ํ–‰ ์‚ฌ์นญ, ๋Œ€์ถœ ์‚ฌ๊ธฐ, ์นด์นด์˜คํ†ก ์ง€์ธ ์‚ฌ์นญ ๋“ฑ์˜ ํƒ์ง€

์ œํ•œ ์‚ฌํ•ญ (Limitations)

  • ๋ฐ์ดํ„ฐ ํŽธํ–ฅ: ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๋Œ€๋ถ€๋ถ„์ด GPT-4๋ฅผ ํ†ตํ•ด ์ƒ์„ฑ๋œ **ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ(Synthetic Data)**์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‹ค์ œ ๋ฆฌ์–ผ์›”๋“œ ๋ฐ์ดํ„ฐ(Wild Data)์— ๋Œ€ํ•ด์„œ๋Š” ์„ฑ๋Šฅ์ด ๋‹ค์†Œ ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(Overfitting possibility).
  • ์ตœ์‹  ๊ณต๊ฒฉ ์œ ํ˜•: ํ•™์Šต๋˜์ง€ ์•Š์€ ์‹ ์ข… ๊ณต๊ฒฉ ํŒจํ„ด์— ๋Œ€ํ•ด์„œ๋Š” ํƒ์ง€์œจ์ด ๋‚ฎ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“š Training Data

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” GPT-4๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ƒ์„ฑ๋œ 3,000๊ฑด ์ด์ƒ์˜ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

  • Normal (Label 0): ์ผ์ƒ ๋Œ€ํ™”, ํƒ๋ฐฐ ์•Œ๋ฆผ, ์นด๋“œ ๊ฒฐ์ œ ๋ฌธ์ž, ๊ธฐ์ƒ์ฒญ ์•Œ๋ฆผ ๋“ฑ
  • Smishing (Label 1):
    • ์ •๋ถ€ ๊ธฐ๊ด€ ์‚ฌ์นญ (์ง€์›๊ธˆ์‹ ์ฒญ ๋“ฑ)
    • ๊ฐ€์กฑ/์ง€์ธ ์‚ฌ์นญ (์•ก์ • ํŒŒ์†, ๊ธ‰์ „ ์š”์ฒญ)
    • ๊ธˆ์œต ๊ธฐ๊ด€ ์‚ฌ์นญ (์ €๊ธˆ๋ฆฌ ๋Œ€์ถœ, ํ—ˆ์œ„ ๊ฒฐ์ œ ์Šน์ธ)
    • ๊ฒฝ์กฐ์‚ฌ ์‚ฌ์นญ (๋ชจ๋ฐ”์ผ ์ฒญ์ฒฉ์žฅ, ๋ถ€๊ณ ์žฅ)

๐Ÿ“Š Evaluation Results

ํ•ฉ์„ฑ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹(100๊ฑด) ๊ธฐ์ค€ ์„ฑ๋Šฅ์ž…๋‹ˆ๋‹ค. (์ฃผ์˜: ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์— ์ตœ์ ํ™”๋œ ๊ฒฐ๊ณผ์ด๋ฏ€๋กœ ์‹ค์ œ ํ™˜๊ฒฝ ์„ฑ๋Šฅ๊ณผ๋Š” ์ฐจ์ด๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

Metric Score
Precision 1.00
Recall 1.00
F1-Score 1.00

๐Ÿš€ How to Use

Python์˜ transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import re

# 1. ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
model_name = "donghyun95/smishing-detection-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# 2. ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜ (ํŠน์ˆ˜๋ฌธ์ž ์ œ๊ฑฐ ๋“ฑ ๊ถŒ์žฅ)
def preprocess(text):
    text = re.sub(r'[^๊ฐ€-ํžฃa-zA-Z0-9\s]', '', text) # ํŠน์ˆ˜๋ฌธ์ž ์ œ๊ฑฐ
    return text.strip()

# 3. ์˜ˆ์ธก ํ•จ์ˆ˜
def predict_smishing(text):
    clean_text = preprocess(text)
    inputs = tokenizer(clean_text, return_tensors="pt", truncation=True, max_length=128)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    probs = torch.softmax(outputs.logits, dim=1)
    smishing_prob = probs[0][1].item() # Label 1์ด ์Šค๋ฏธ์‹ฑ
    
    return smishing_prob

# 4. ํ…Œ์ŠคํŠธ
sample_text = "์—„๋งˆ ๋‚˜ ํฐ ๊ณ ์žฅ๋‚˜์„œ ์ˆ˜๋ฆฌ๋งก๊ฒผ์–ด. ์ด ๋ฒˆํ˜ธ๋กœ ๋ฌธ์ž์ค˜."
probability = predict_smishing(sample_text)

print(f"์Šค๋ฏธ์‹ฑ ํ™•๋ฅ : {probability * 100:.2f}%")
if probability > 0.7:
    print("๐Ÿšจ ์Šค๋ฏธ์‹ฑ ์˜์‹ฌ ๋ฌธ์ž์ž…๋‹ˆ๋‹ค!")
else:
    print("โœ… ์ •์ƒ ๋ฌธ์ž์ž…๋‹ˆ๋‹ค.")

โš ๏ธ Disclaimer

์ด ๋ชจ๋ธ์€ ์—ฐ๊ตฌ ๋ฐ ๊ต์œก ๋ชฉ์ ์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ ๊ธˆ์œต ๊ฑฐ๋ž˜๋‚˜ ๋ณด์•ˆ ์‹œ์Šคํ…œ์— ๋‹จ๋…์œผ๋กœ ์˜์กดํ•˜์—ฌ ์‚ฌ์šฉํ•˜๊ธฐ์—๋Š” ์œ„ํ—˜์ด ๋”ฐ๋ฅผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ณด์กฐ์ ์ธ ์ˆ˜๋‹จ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ–Š๏ธ Citation

If you use this model in your research or project, please cite it as follows:

BibTeX:

@misc{smishing-forecast-2026,
  author = {Hwang, Donghyun and Cho, Eunkyung and Ahn, Seongmin and Hwang, Sunwoo},
  title = {Smishing Forecast: Self-Evolving AI-Powered Smishing Defense System},
  year = {2026},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/DongHyun925/SmishingForecast}}
}

APA: Hwang, D., Cho, E., Ahn, S., & Hwang, S. (2026). Smishing Forecast: Self-Evolving AI-Powered Smishing Defense System. GitHub. https://github.com/DongHyun925/SmishingForecast

๐Ÿ“œ License

MIT License

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for donghyun95/smishing-detection-roberta-base

Finetuned
(436)
this model