Update README.md
Browse files
README.md
CHANGED
|
@@ -5,7 +5,7 @@ language:
|
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
---
|
| 7 |
<p align="left">
|
| 8 |
-
<img src="
|
| 9 |
</p>
|
| 10 |
|
| 11 |
# Devocean-06/Spam_Filter-gemma
|
|
@@ -29,42 +29,40 @@ pipeline_tag: text-generation
|
|
| 29 |
|
| 30 |
**Model Developers**: SK Devoceon-06 On device LLM
|
| 31 |
|
| 32 |
-
# Model Information
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
|
| 37 |
---
|
| 38 |
|
| 39 |
## ๐ง Description
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
###
|
| 44 |
-
|
| 45 |
-
-
|
| 46 |
-
-
|
| 47 |
-
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
|
| 60 |
-
|
|
| 61 |
-
|
|
| 62 |
-
|
|
| 63 |
-
| **ํ์ต ๋ฐฉ์** | Off-policy knowledge distillation + Curriculum sampling |
|
| 64 |
-
| **๋ฐ์ดํฐ ์ ์ ** | SimHash ๊ธฐ๋ฐ ์ค๋ณต ์ ๊ฑฐ ๋ฐ ํ์ง ํํฐ๋ง |
|
| 65 |
-
| **๋ชฉํ** | ๋จ์ ๋ถ๋ฅ๋ฅผ ๋์ด โ์ ์คํธ์ธ์งโ๋ฅผ ์ค๋ช
ํ ์ ์๋ ๋ชจ๋ธ |
|
| 66 |
|
| 67 |
---
|
|
|
|
| 68 |
## ๐ Quick Start
|
| 69 |
|
| 70 |
```python
|
|
|
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
---
|
| 7 |
<p align="left">
|
| 8 |
+
<img src="https://huggingface.co/Devocean-06/Spam_Filter-gemma/edit/main/skitty.png" width="50%"/>
|
| 9 |
</p>
|
| 10 |
|
| 11 |
# Devocean-06/Spam_Filter-gemma
|
|
|
|
| 29 |
|
| 30 |
**Model Developers**: SK Devoceon-06 On device LLM
|
| 31 |
|
| 32 |
+
## Model Information
|
| 33 |
+
|
| 34 |
+
Skitty is an explainable small language model (sLLM) designed to classify various types of spam messages and provide concise reasoning for its decisions.
|
| 35 |
+
Instead of only labeling text as "spam" or "not spam", the model outputs short natural-language explanations describing why the message was identified as spam.
|
| 36 |
|
| 37 |
---
|
| 38 |
|
| 39 |
## ๐ง Description
|
| 40 |
+
Skitty was trained on an updated 2025 spam message dataset collected through the Smart Police Big Data Platform in South Korea.
|
| 41 |
+
The model leverages deduplication, curriculum sampling, and off-policy distillation to improve both classification accuracy and interpretability.
|
| 42 |
+
|
| 43 |
+
### Data and Preprocessing
|
| 44 |
+
- Data source: 2025 Smart Police Big Data Platform spam message dataset
|
| 45 |
+
- Deduplication: Performed near-duplicate removal using SimHash filtering
|
| 46 |
+
- Sampling strategy: Applied curriculum-based sampling to control difficulty and improve generalization
|
| 47 |
+
- Labeling: Trained using hard-label supervision after label confidence refinement
|
| 48 |
+
|
| 49 |
+
### Training and Distillation
|
| 50 |
+
- Utilized off-policy distillation to compress the decision process of a large teacher LLM into a smaller student model
|
| 51 |
+
- Instead of directly mimicking the teacherโs text generation, the model distills the reasoning trace for spam detection
|
| 52 |
+
- Combined curriculum learning with hard-label distillation to balance accuracy, interpretability, and generalization
|
| 53 |
+
|
| 54 |
+
### Key Features
|
| 55 |
+
|
| 56 |
+
| Category | Description |
|
| 57 |
+
|-----------|-------------|
|
| 58 |
+
| Model Type | sLLM (Small Language Model for Spam Classification & Explanation) |
|
| 59 |
+
| Main Function | Spam / Non-spam classification with reasoning |
|
| 60 |
+
| Training Approach | Off-policy knowledge distillation + curriculum sampling |
|
| 61 |
+
| Data Cleaning | SimHash-based deduplication and quality filtering |
|
| 62 |
+
| Objective | Build a model that not only classifies spam but also explains its rationale |
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
---
|
| 65 |
+
|
| 66 |
## ๐ Quick Start
|
| 67 |
|
| 68 |
```python
|