nebchi commited on
Commit
286eeb4
·
verified ·
1 Parent(s): 8d956e9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -22
README.md CHANGED
@@ -3,53 +3,113 @@ license: gemma
3
  language:
4
  - ko
5
  pipeline_tag: text-generation
 
 
 
 
 
 
 
6
  ---
 
7
  <p align="left">
8
  <img src="https://huggingface.co/Devocean-06/Spam_Filter-gemma/resolve/main/skitty.png" width="50%"/>
9
  </p>
10
 
11
  # Devocean-06/Spam_Filter-gemma
 
12
  > Update @ 2025.10.19: First release of Spam filter XAI
 
13
  <!-- Provide a quick summary of what the model is/does. -->
14
 
15
  **Resources and Technical Documentation**:
16
  * [Gemma3 Model](https://huggingface.co/google/gemma-3-4b-it)
 
17
 
18
  **Model Developers**: SK Devoceon-06 On device LLM
19
 
20
  ## Model Information
 
21
  - Skitty is an explainable small language model (sLLM) that classifies spam messages and provides brief reasoning for each decision.
 
22
  ---
23
 
24
  ## Description
 
25
  - Skitty was trained on an updated 2025 spam message dataset collected through the Smart Police Big Data Platform in South Korea.
26
  - The model leverages deduplication, curriculum sampling, and off-policy distillation to improve both classification accuracy and interpretability.
27
 
28
  ## Data and Preprocessing
29
- - Data source: 2025 Smart Police Big Data Platform spam message dataset
30
- - Deduplication: Performed near-duplicate removal using SimHash filtering
31
- - Sampling strategy: Applied curriculum-based sampling to control difficulty and improve generalization
32
- - Labeling: Trained using hard-label supervision after label confidence refinement
 
 
 
33
 
34
  ## Training and Distillation
 
35
  - Utilized off-policy distillation to compress the decision process of a large teacher LLM into a smaller student model
36
- - Instead of directly mimicking the teachers text generation, the model distills the reasoning trace for spam detection
37
  - Combined curriculum learning with hard-label distillation to balance accuracy, interpretability, and generalization
38
 
39
  ---
40
 
41
- ## Running with the ```pipeline``` API
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
- You can initialize the model and processor for inference with ```pipeline``` as follows.
44
 
45
  ```python
46
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
47
 
48
  MODEL_ID = "Devocean-06/Spam_Filter-gemma"
49
-
50
  tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
51
  model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
52
-
53
  pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
54
 
55
  text = "무��� 쿠폰 지급! 지금 바로 클릭하세요 👉 https://spam.link 해당 문자 스팸인가요?"
@@ -57,27 +117,45 @@ result = pipe(text, top_k=2)
57
  print(result)
58
  ```
59
 
60
- ## Runnig with the vLLM
 
61
  ```sh
62
  vllm serve Devocean-06/Spam_Filter-gemma
63
  ```
64
 
65
- **Citation**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ```bibtex
68
- @misc {Devocean-06/Spam_Filter-gemma,
69
- author = { {SK Devoceon-06 On device LLM} },
70
- title = { Spam filter & XAI },
71
- year = 2025,
72
- url = { https://huggingface.co/Devocean-06/Spam_Filter-gemma },
73
- publisher = { Hugging Face }
74
  }
75
  ```
76
 
77
- ## Software
78
- Training was conducted using the Axolotl framework, a flexible and efficient fine-tuning system designed for large language models.
79
 
80
- Axolotl enables seamless configuration and execution of full fine-tuning, LoRA, and DPO pipelines through simple YAML-based workflows.
81
- It integrates with PyTorch and Hugging Face Transformers, supporting distributed strategies such as FSDP and DeepSpeed for optimized performance on multi-GPU environments.
82
 
83
- This framework streamlines experimentation and scaling by allowing researchers to define training parameters, datasets, and model behaviors declaratively — reducing boilerplate and ensuring reproducible results across setups.
 
3
  language:
4
  - ko
5
  pipeline_tag: text-generation
6
+ tags:
7
+ - spam-detection
8
+ - explainable-ai
9
+ - on-device
10
+ - korean
11
+ datasets:
12
+ - Devocean-06/Spam_QA-Corpus
13
  ---
14
+
15
  <p align="left">
16
  <img src="https://huggingface.co/Devocean-06/Spam_Filter-gemma/resolve/main/skitty.png" width="50%"/>
17
  </p>
18
 
19
  # Devocean-06/Spam_Filter-gemma
20
+
21
  > Update @ 2025.10.19: First release of Spam filter XAI
22
+
23
  <!-- Provide a quick summary of what the model is/does. -->
24
 
25
  **Resources and Technical Documentation**:
26
  * [Gemma3 Model](https://huggingface.co/google/gemma-3-4b-it)
27
+ * [Training Dataset](https://huggingface.co/datasets/Devocean-06/Spam_QA-Corpus)
28
 
29
  **Model Developers**: SK Devoceon-06 On device LLM
30
 
31
  ## Model Information
32
+
33
  - Skitty is an explainable small language model (sLLM) that classifies spam messages and provides brief reasoning for each decision.
34
+
35
  ---
36
 
37
  ## Description
38
+
39
  - Skitty was trained on an updated 2025 spam message dataset collected through the Smart Police Big Data Platform in South Korea.
40
  - The model leverages deduplication, curriculum sampling, and off-policy distillation to improve both classification accuracy and interpretability.
41
 
42
  ## Data and Preprocessing
43
+
44
+ - **Data source**: 2025 Smart Police Big Data Platform spam message dataset
45
+ - **Dataset**: [Devocean-06/Spam_QA-Corpus](https://huggingface.co/datasets/Devocean-06/Spam_QA-Corpus)
46
+ - **Format**: Alpaca instruction format (instruction, input, output)
47
+ - **Deduplication**: Performed near-duplicate removal using SimHash filtering
48
+ - **Sampling strategy**: Applied curriculum-based sampling to control difficulty and improve generalization
49
+ - **Labeling**: Trained using hard-label supervision after label confidence refinement
50
 
51
  ## Training and Distillation
52
+
53
  - Utilized off-policy distillation to compress the decision process of a large teacher LLM into a smaller student model
54
+ - Instead of directly mimicking the teacher's text generation, the model distills the reasoning trace for spam detection
55
  - Combined curriculum learning with hard-label distillation to balance accuracy, interpretability, and generalization
56
 
57
  ---
58
 
59
+ ## Training Configuration
60
+
61
+ ### Base Model
62
+ - **Base Model**: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
63
+ - **Training Framework**: [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
64
+ - **Fine-tuning Method**: QLoRA (Quantized Low-Rank Adaptation)
65
+
66
+ ### Hyperparameters
67
+
68
+ | Parameter | Value | Description |
69
+ |-----------|-------|-------------|
70
+ | **Quantization** | 4-bit | Load pretrained model in 4-bit |
71
+ | **Adapter** | QLoRA | Low-rank adaptation method |
72
+ | **LoRA Rank (r)** | 16 | Rank of low-rank matrices |
73
+ | **LoRA Alpha** | 32 | Scaling factor for LoRA |
74
+ | **LoRA Dropout** | 0.05 | Dropout rate for LoRA layers |
75
+ | **Target Modules** | attention + MLP | Applied to q,k,v,o,up,down,gate projections |
76
+ | **Sequence Length** | 1500 | Maximum input sequence length |
77
+ | **Sample Packing** | True | Pack multiple samples into one sequence |
78
+ | **Micro Batch Size** | 10 | Batch size per GPU |
79
+ | **Gradient Accumulation** | 15 | Effective batch size: 150 |
80
+ | **Number of Epochs** | 5 | Total training epochs |
81
+ | **Learning Rate** | 2e-5 | Peak learning rate |
82
+ | **LR Scheduler** | Cosine | Cosine annealing schedule |
83
+ | **Warmup Steps** | 10 | Learning rate warmup steps |
84
+ | **Optimizer** | AdamW (8-bit) | 8-bit quantized AdamW |
85
+ | **Weight Decay** | 0.0 | L2 regularization |
86
+ | **Precision** | BF16 | Brain floating point 16 |
87
+ | **Gradient Checkpointing** | True | Save memory by recomputing gradients |
88
+ | **Flash Attention** | True | Optimized attention kernel |
89
+
90
+ ### Training Monitoring
91
+ - **Logging Steps**: 100
92
+ - **Evaluation Steps**: 50
93
+ - **Save Steps**: 50
94
+ - **Evaluation Strategy**: Steps-based
95
+ - **Tracking**: Weights & Biases (wandb)
96
+
97
+ ### Compute Resources
98
+ - Distributed training support via FSDP and DeepSpeed
99
+ - Multi-GPU optimization with DDP (Distributed Data Parallel)
100
+
101
+ ---
102
+
103
+ ## Running with the `pipeline` API
104
 
105
+ You can initialize the model and processor for inference with `pipeline` as follows.
106
 
107
  ```python
108
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
109
 
110
  MODEL_ID = "Devocean-06/Spam_Filter-gemma"
 
111
  tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
112
  model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
 
113
  pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
114
 
115
  text = "무��� 쿠폰 지급! 지금 바로 클릭하세요 👉 https://spam.link 해당 문자 스팸인가요?"
 
117
  print(result)
118
  ```
119
 
120
+ ## Running with vLLM
121
+
122
  ```sh
123
  vllm serve Devocean-06/Spam_Filter-gemma
124
  ```
125
 
126
+ ---
127
+
128
+ ## Software
129
+
130
+ Training was conducted using the **Axolotl framework**, a flexible and efficient fine-tuning system designed for large language models.
131
+
132
+ Axolotl enables seamless configuration and execution of full fine-tuning, LoRA, and DPO pipelines through simple YAML-based workflows. It integrates with PyTorch and Hugging Face Transformers, supporting distributed strategies such as FSDP and DeepSpeed for optimized performance on multi-GPU environments.
133
+
134
+ This framework streamlines experimentation and scaling by allowing researchers to define training parameters, datasets, and model behaviors declaratively — reducing boilerplate and ensuring reproducible results across setups.
135
+
136
+ **Key Features Used:**
137
+ - QLoRA for parameter-efficient fine-tuning
138
+ - 4-bit quantization during training
139
+ - Flash Attention for faster training
140
+ - Gradient checkpointing for memory efficiency
141
+ - Alpaca dataset format support
142
+
143
+ ---
144
+
145
+ ## Citation
146
 
147
  ```bibtex
148
+ @misc{Devocean-06/Spam_Filter-gemma,
149
+ author = { {SK Devoceon-06 On device LLM} },
150
+ title = { Spam filter & XAI },
151
+ year = 2025,
152
+ url = { https://huggingface.co/Devocean-06/Spam_Filter-gemma },
153
+ publisher = { Hugging Face }
154
  }
155
  ```
156
 
157
+ ---
 
158
 
159
+ ## License
 
160
 
161
+ This model is released under the Gemma license. Please refer to the original [Gemma license](https://ai.google.dev/gemma/terms) for usage terms and conditions.