Update README.md
Browse files
README.md
CHANGED
|
@@ -5,7 +5,7 @@ language:
|
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
---
|
| 7 |
<p align="left">
|
| 8 |
-
<img src="https://huggingface.co/Devocean-06/Spam_Filter-gemma/
|
| 9 |
</p>
|
| 10 |
|
| 11 |
# Devocean-06/Spam_Filter-gemma
|
|
@@ -30,37 +30,24 @@ pipeline_tag: text-generation
|
|
| 30 |
**Model Developers**: SK Devoceon-06 On device LLM
|
| 31 |
|
| 32 |
## Model Information
|
| 33 |
-
|
| 34 |
-
Skitty is an explainable small language model (sLLM) designed to classify various types of spam messages and provide concise reasoning for its decisions.
|
| 35 |
-
Instead of only labeling text as "spam" or "not spam", the model outputs short natural-language explanations describing why the message was identified as spam.
|
| 36 |
-
|
| 37 |
---
|
| 38 |
|
| 39 |
## 🧠 Description
|
| 40 |
Skitty was trained on an updated 2025 spam message dataset collected through the Smart Police Big Data Platform in South Korea.
|
| 41 |
The model leverages deduplication, curriculum sampling, and off-policy distillation to improve both classification accuracy and interpretability.
|
| 42 |
|
| 43 |
-
|
| 44 |
- Data source: 2025 Smart Police Big Data Platform spam message dataset
|
| 45 |
- Deduplication: Performed near-duplicate removal using SimHash filtering
|
| 46 |
- Sampling strategy: Applied curriculum-based sampling to control difficulty and improve generalization
|
| 47 |
- Labeling: Trained using hard-label supervision after label confidence refinement
|
| 48 |
|
| 49 |
-
|
| 50 |
- Utilized off-policy distillation to compress the decision process of a large teacher LLM into a smaller student model
|
| 51 |
- Instead of directly mimicking the teacher’s text generation, the model distills the reasoning trace for spam detection
|
| 52 |
- Combined curriculum learning with hard-label distillation to balance accuracy, interpretability, and generalization
|
| 53 |
|
| 54 |
-
### Key Features
|
| 55 |
-
|
| 56 |
-
| Category | Description |
|
| 57 |
-
|-----------|-------------|
|
| 58 |
-
| Model Type | sLLM (Small Language Model for Spam Classification & Explanation) |
|
| 59 |
-
| Main Function | Spam / Non-spam classification with reasoning |
|
| 60 |
-
| Training Approach | Off-policy knowledge distillation + curriculum sampling |
|
| 61 |
-
| Data Cleaning | SimHash-based deduplication and quality filtering |
|
| 62 |
-
| Objective | Build a model that not only classifies spam but also explains its rationale |
|
| 63 |
-
|
| 64 |
---
|
| 65 |
|
| 66 |
## 🚀 Quick Start
|
|
|
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
---
|
| 7 |
<p align="left">
|
| 8 |
+
<img src="https://huggingface.co/Devocean-06/Spam_Filter-gemma/resolve/main/skitty.png" width="50%"/>
|
| 9 |
</p>
|
| 10 |
|
| 11 |
# Devocean-06/Spam_Filter-gemma
|
|
|
|
| 30 |
**Model Developers**: SK Devoceon-06 On device LLM
|
| 31 |
|
| 32 |
## Model Information
|
| 33 |
+
Skitty is an explainable small language model (sLLM) that classifies spam messages and provides brief reasoning for each decision.
|
|
|
|
|
|
|
|
|
|
| 34 |
---
|
| 35 |
|
| 36 |
## 🧠 Description
|
| 37 |
Skitty was trained on an updated 2025 spam message dataset collected through the Smart Police Big Data Platform in South Korea.
|
| 38 |
The model leverages deduplication, curriculum sampling, and off-policy distillation to improve both classification accuracy and interpretability.
|
| 39 |
|
| 40 |
+
## Data and Preprocessing
|
| 41 |
- Data source: 2025 Smart Police Big Data Platform spam message dataset
|
| 42 |
- Deduplication: Performed near-duplicate removal using SimHash filtering
|
| 43 |
- Sampling strategy: Applied curriculum-based sampling to control difficulty and improve generalization
|
| 44 |
- Labeling: Trained using hard-label supervision after label confidence refinement
|
| 45 |
|
| 46 |
+
## Training and Distillation
|
| 47 |
- Utilized off-policy distillation to compress the decision process of a large teacher LLM into a smaller student model
|
| 48 |
- Instead of directly mimicking the teacher’s text generation, the model distills the reasoning trace for spam detection
|
| 49 |
- Combined curriculum learning with hard-label distillation to balance accuracy, interpretability, and generalization
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
---
|
| 52 |
|
| 53 |
## 🚀 Quick Start
|