blockenters commited on
Commit
82707f6
ยท
verified ยท
1 Parent(s): a9e8324

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -33
README.md CHANGED
@@ -14,73 +14,73 @@ metrics:
14
  license: apache-2.0
15
  ---
16
 
17
- # SMS Spam Classifier
18
 
19
- This is a fine-tuned **BERT-based multilingual model** designed for SMS spam detection. The model can classify SMS messages as either **ham (non-spam)** or **spam**. It was trained using the **`bert-base-multilingual-cased`** model from Hugging Face Transformers library.
20
 
21
  ---
22
 
23
- ## Model Details
24
 
25
- - **Base Model**: `bert-base-multilingual-cased`
26
- - **Task**: Sequence Classification
27
- - **Languages Supported**: Multilingual
28
- - **Number of Labels**: 2 (`ham`, `spam`)
29
- - **Dataset**: A cleaned SMS spam dataset.
30
 
31
  ---
32
 
33
- ## Dataset
34
 
35
- The dataset used for training and evaluation contains SMS messages labeled as `ham` (non-spam) or `spam`. The dataset was preprocessed for tokenization and split into training and evaluation subsets:
36
- - **Training Set**: 80%
37
- - **Evaluation Set**: 20%
38
 
39
  ---
40
 
41
- ## Training Configuration
42
 
43
- - **Learning Rate**: 2e-5
44
- - **Batch Size**: 8 (per device)
45
- - **Epochs**: 1
46
- - **Evaluation Strategy**: Per epoch
47
- - **Tokenizer**: `bert-base-multilingual-cased`
48
 
49
- The model was trained using the Hugging Face `Trainer` API for efficient fine-tuning.
50
 
51
  ---
52
 
53
- ## Evaluation Results
54
 
55
- The model achieved the following performance metrics during evaluation:
56
 
57
- - **Evaluation Loss**: `<add_results_here>`
58
- - **Accuracy**: `<add_results_here>`
59
- - **F1 Score**: `<add_results_here>`
60
 
61
- (Note: Replace `<add_results_here>` with actual values from the `trainer.evaluate()` results.)
62
 
63
  ---
64
 
65
- ## How to Use
66
 
67
- You can use this model directly with the Hugging Face Transformers library:
68
 
69
  ```python
70
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
71
 
72
- # Load the model and tokenizer
73
  tokenizer = AutoTokenizer.from_pretrained("blockenters/sms-spam-classifier")
74
  model = AutoModelForSequenceClassification.from_pretrained("blockenters/sms-spam-classifier")
75
 
76
- # Sample input
77
- text = "Congratulations! You've won a free ticket to Bali. Reply WIN to claim."
78
 
79
- # Tokenize and predict
80
  inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
81
  outputs = model(**inputs)
82
  predictions = outputs.logits.argmax(dim=-1)
83
 
84
- # Decode prediction
85
  label_map = {0: "ham", 1: "spam"}
86
- print(f"Prediction: {label_map[predictions.item()]}")
 
14
  license: apache-2.0
15
  ---
16
 
17
+ # SMS ์ŠคํŒธ ๋ถ„๋ฅ˜๊ธฐ
18
 
19
+ ์ด ๋ชจ๋ธ์€ SMS ์ŠคํŒธ ํƒ์ง€๋ฅผ ์œ„ํ•ด ๋ฏธ์„ธ ์กฐ์ •๋œ **BERT ๊ธฐ๋ฐ˜ ๋‹ค๊ตญ์–ด ๋ชจ๋ธ**์ž…๋‹ˆ๋‹ค. SMS ๋ฉ”์‹œ์ง€๋ฅผ **ham(๋น„์ŠคํŒธ)** ๋˜๋Š” **spam(์ŠคํŒธ)**์œผ๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Hugging Face Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ **`bert-base-multilingual-cased`** ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
20
 
21
  ---
22
 
23
+ ## ๋ชจ๋ธ ์„ธ๋ถ€์ •๋ณด
24
 
25
+ - **๊ธฐ๋ณธ ๋ชจ๋ธ**: `bert-base-multilingual-cased`
26
+ - **ํƒœ์Šคํฌ**: ๋ฌธ์žฅ ๋ถ„๋ฅ˜(Sequence Classification)
27
+ - **์ง€์› ์–ธ์–ด**: ๋‹ค๊ตญ์–ด
28
+ - **๋ผ๋ฒจ ์ˆ˜**: 2 (`ham`, `spam`)
29
+ - **๋ฐ์ดํ„ฐ์…‹**: ํด๋ฆฐ๋œ SMS ์ŠคํŒธ ๋ฐ์ดํ„ฐ์…‹
30
 
31
  ---
32
 
33
+ ## ๋ฐ์ดํ„ฐ์…‹ ์ •๋ณด
34
 
35
+ ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ `ham`(๋น„์ŠคํŒธ) ๋˜๋Š” `spam`(์ŠคํŒธ)์œผ๋กœ ๋ผ๋ฒจ๋ง๋œ SMS ๋ฉ”์‹œ์ง€๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋Š” ์ „์ฒ˜๋ฆฌ๋ฅผ ๊ฑฐ์นœ ํ›„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ถ„๋ฆฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:
36
+ - **ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ**: 80%
37
+ - **๊ฒ€์ฆ ๋ฐ์ดํ„ฐ**: 20%
38
 
39
  ---
40
 
41
+ ## ํ•™์Šต ์„ค์ •
42
 
43
+ - **ํ•™์Šต๋ฅ (Learning Rate)**: 2e-5
44
+ - **๋ฐฐ์น˜ ํฌ๊ธฐ(Batch Size)**: 8 (๋””๋ฐ”์ด์Šค ๋‹น)
45
+ - **์—ํฌํฌ(Epochs)**: 1
46
+ - **ํ‰๊ฐ€ ์ „๋žต**: ์—ํฌํฌ ๋‹จ์œ„
47
+ - **ํ† ํฌ๋‚˜์ด์ €**: `bert-base-multilingual-cased`
48
 
49
+ ์ด ๋ชจ๋ธ์€ Hugging Face์˜ `Trainer` API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํšจ์œจ์ ์œผ๋กœ ๋ฏธ์„ธ ์กฐ์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
50
 
51
  ---
52
 
53
+ ## ํ‰๊ฐ€ ๊ฒฐ๊ณผ
54
 
55
+ ๋ชจ๋ธ์€ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค:
56
 
57
+ - **ํ‰๊ฐ€ ์†์‹ค(Evaluation Loss)**: `<๊ฒฐ๊ณผ๋ฅผ ์ถ”๊ฐ€ํ•˜์„ธ์š”>`
58
+ - **์ •ํ™•๋„(Accuracy)**: `<๊ฒฐ๊ณผ๋ฅผ ์ถ”๊ฐ€ํ•˜์„ธ์š”>`
59
+ - **F1 ์ ์ˆ˜(F1 Score)**: `<๊ฒฐ๊ณผ๋ฅผ ์ถ”๊ฐ€ํ•˜์„ธ์š”>`
60
 
61
+ (์ฐธ๊ณ : `<๊ฒฐ๊ณผ๋ฅผ ์ถ”๊ฐ€ํ•˜์„ธ์š”>` ๋ถ€๋ถ„์— `trainer.evaluate()` ๊ฒฐ๊ณผ๋ฅผ ์ž…๋ ฅํ•˜์„ธ์š”.)
62
 
63
  ---
64
 
65
+ ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
66
 
67
+ ์ด ๋ชจ๋ธ์€ Hugging Face Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
68
 
69
  ```python
70
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
71
 
72
+ # ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
73
  tokenizer = AutoTokenizer.from_pretrained("blockenters/sms-spam-classifier")
74
  model = AutoModelForSequenceClassification.from_pretrained("blockenters/sms-spam-classifier")
75
 
76
+ # ์ž…๋ ฅ ์ƒ˜ํ”Œ
77
+ text = "์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค! ๋ฌด๋ฃŒ ๋ฐœ๋ฆฌ ์—ฌํ–‰ ํ‹ฐ์ผ“์„ ๋ฐ›์œผ์…จ์Šต๋‹ˆ๋‹ค. WIN์ด๋ผ๊ณ  ํšŒ์‹ ํ•˜์„ธ์š”."
78
 
79
+ # ํ† ํฐํ™” ๋ฐ ์˜ˆ์ธก
80
  inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
81
  outputs = model(**inputs)
82
  predictions = outputs.logits.argmax(dim=-1)
83
 
84
+ # ์˜ˆ์ธก ๊ฒฐ๊ณผ ๋””์ฝ”๋”ฉ
85
  label_map = {0: "ham", 1: "spam"}
86
+ print(f"์˜ˆ์ธก ๊ฒฐ๊ณผ: {label_map[predictions.item()]}")