Devocean-06
/

Spam_Filter-gemma

@@ -3,53 +3,113 @@ license: gemma
 language:
 - ko
 pipeline_tag: text-generation
 ---
 <p align="left">
   <img src="https://huggingface.co/Devocean-06/Spam_Filter-gemma/resolve/main/skitty.png" width="50%"/>
 </p>
 # Devocean-06/Spam_Filter-gemma
 > Update @ 2025.10.19: First release of Spam filter XAI
 <!-- Provide a quick summary of what the model is/does. -->
 **Resources and Technical Documentation**:
 * [Gemma3 Model](https://huggingface.co/google/gemma-3-4b-it)
 **Model Developers**: SK Devoceon-06 On device LLM
 ## Model Information
 - Skitty is an explainable small language model (sLLM) that classifies spam messages and provides brief reasoning for each decision.
 ---
 ## Description
 - Skitty was trained on an updated 2025 spam message dataset collected through the Smart Police Big Data Platform in South Korea.
 - The model leverages deduplication, curriculum sampling, and off-policy distillation to improve both classification accuracy and interpretability.
 ## Data and Preprocessing
-- Data source: 2025 Smart Police Big Data Platform spam message dataset
-- Deduplication: Performed near-duplicate removal using SimHash filtering
-- Sampling strategy: Applied curriculum-based sampling to control difficulty and improve generalization
-- Labeling: Trained using hard-label supervision after label confidence refinement
 ## Training and Distillation
 - Utilized off-policy distillation to compress the decision process of a large teacher LLM into a smaller student model
-- Instead of directly mimicking the teacher’s text generation, the model distills the reasoning trace for spam detection
 - Combined curriculum learning with hard-label distillation to balance accuracy, interpretability, and generalization
 ---
-## Running with the ```pipeline``` API
-You can initialize the model and processor for inference with ```pipeline``` as follows.
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
 MODEL_ID = "Devocean-06/Spam_Filter-gemma"
 tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
 model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
 pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
 text = "무��� 쿠폰 지급! 지금 바로 클릭하세요 👉 https://spam.link 해당 문자 스팸인가요?"
@@ -57,27 +117,45 @@ result = pipe(text, top_k=2)
 print(result)
 ```
-## Runnig with the vLLM
 ```sh
 vllm serve Devocean-06/Spam_Filter-gemma
 ```
-**Citation**
 ```bibtex
-@misc {Devocean-06/Spam_Filter-gemma,
-	author       = { {SK Devoceon-06 On device LLM} },
-	title        = { Spam filter & XAI },
-	year         = 2025,
-	url          = { https://huggingface.co/Devocean-06/Spam_Filter-gemma },
-	publisher    = { Hugging Face }
 }
 ```
-## Software
-Training was conducted using the Axolotl framework, a flexible and efficient fine-tuning system designed for large language models.
-Axolotl enables seamless configuration and execution of full fine-tuning, LoRA, and DPO pipelines through simple YAML-based workflows.
-It integrates with PyTorch and Hugging Face Transformers, supporting distributed strategies such as FSDP and DeepSpeed for optimized performance on multi-GPU environments.
-This framework streamlines experimentation and scaling by allowing researchers to define training parameters, datasets, and model behaviors declaratively — reducing boilerplate and ensuring reproducible results across setups.

 language:
 - ko
 pipeline_tag: text-generation
+tags:
+- spam-detection
+- explainable-ai
+- on-device
+- korean
+datasets:
+- Devocean-06/Spam_QA-Corpus
 ---
 <p align="left">
   <img src="https://huggingface.co/Devocean-06/Spam_Filter-gemma/resolve/main/skitty.png" width="50%"/>
 </p>
 # Devocean-06/Spam_Filter-gemma
 > Update @ 2025.10.19: First release of Spam filter XAI
 <!-- Provide a quick summary of what the model is/does. -->
 **Resources and Technical Documentation**:
 * [Gemma3 Model](https://huggingface.co/google/gemma-3-4b-it)
+* [Training Dataset](https://huggingface.co/datasets/Devocean-06/Spam_QA-Corpus)
 **Model Developers**: SK Devoceon-06 On device LLM
 ## Model Information
 - Skitty is an explainable small language model (sLLM) that classifies spam messages and provides brief reasoning for each decision.
 ---
 ## Description
 - Skitty was trained on an updated 2025 spam message dataset collected through the Smart Police Big Data Platform in South Korea.
 - The model leverages deduplication, curriculum sampling, and off-policy distillation to improve both classification accuracy and interpretability.
 ## Data and Preprocessing
+- **Data source**: 2025 Smart Police Big Data Platform spam message dataset
+- **Dataset**: [Devocean-06/Spam_QA-Corpus](https://huggingface.co/datasets/Devocean-06/Spam_QA-Corpus)
+- **Format**: Alpaca instruction format (instruction, input, output)
+- **Deduplication**: Performed near-duplicate removal using SimHash filtering
+- **Sampling strategy**: Applied curriculum-based sampling to control difficulty and improve generalization
+- **Labeling**: Trained using hard-label supervision after label confidence refinement
 ## Training and Distillation
 - Utilized off-policy distillation to compress the decision process of a large teacher LLM into a smaller student model
+- Instead of directly mimicking the teacher's text generation, the model distills the reasoning trace for spam detection
 - Combined curriculum learning with hard-label distillation to balance accuracy, interpretability, and generalization
 ---
+## Training Configuration
+### Base Model
+- **Base Model**: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
+- **Training Framework**: [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
+- **Fine-tuning Method**: QLoRA (Quantized Low-Rank Adaptation)
+### Hyperparameters
+| Parameter | Value | Description |
+|-----------|-------|-------------|
+| **Quantization** | 4-bit | Load pretrained model in 4-bit |
+| **Adapter** | QLoRA | Low-rank adaptation method |
+| **LoRA Rank (r)** | 16 | Rank of low-rank matrices |
+| **LoRA Alpha** | 32 | Scaling factor for LoRA |
+| **LoRA Dropout** | 0.05 | Dropout rate for LoRA layers |
+| **Target Modules** | attention + MLP | Applied to q,k,v,o,up,down,gate projections |
+| **Sequence Length** | 1500 | Maximum input sequence length |
+| **Sample Packing** | True | Pack multiple samples into one sequence |
+| **Micro Batch Size** | 10 | Batch size per GPU |
+| **Gradient Accumulation** | 15 | Effective batch size: 150 |
+| **Number of Epochs** | 5 | Total training epochs |
+| **Learning Rate** | 2e-5 | Peak learning rate |
+| **LR Scheduler** | Cosine | Cosine annealing schedule |
+| **Warmup Steps** | 10 | Learning rate warmup steps |
+| **Optimizer** | AdamW (8-bit) | 8-bit quantized AdamW |
+| **Weight Decay** | 0.0 | L2 regularization |
+| **Precision** | BF16 | Brain floating point 16 |
+| **Gradient Checkpointing** | True | Save memory by recomputing gradients |
+| **Flash Attention** | True | Optimized attention kernel |
+### Training Monitoring
+- **Logging Steps**: 100
+- **Evaluation Steps**: 50
+- **Save Steps**: 50
+- **Evaluation Strategy**: Steps-based
+- **Tracking**: Weights & Biases (wandb)
+### Compute Resources
+- Distributed training support via FSDP and DeepSpeed
+- Multi-GPU optimization with DDP (Distributed Data Parallel)
+---
+## Running with the `pipeline` API
+You can initialize the model and processor for inference with `pipeline` as follows.
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
 MODEL_ID = "Devocean-06/Spam_Filter-gemma"
 tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
 model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
 pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
 text = "무��� 쿠폰 지급! 지금 바로 클릭하세요 👉 https://spam.link 해당 문자 스팸인가요?"
 print(result)
 ```
+## Running with vLLM
 ```sh
 vllm serve Devocean-06/Spam_Filter-gemma
 ```
+---
+## Software
+Training was conducted using the **Axolotl framework**, a flexible and efficient fine-tuning system designed for large language models.
+Axolotl enables seamless configuration and execution of full fine-tuning, LoRA, and DPO pipelines through simple YAML-based workflows. It integrates with PyTorch and Hugging Face Transformers, supporting distributed strategies such as FSDP and DeepSpeed for optimized performance on multi-GPU environments.
+This framework streamlines experimentation and scaling by allowing researchers to define training parameters, datasets, and model behaviors declaratively — reducing boilerplate and ensuring reproducible results across setups.
+**Key Features Used:**
+- QLoRA for parameter-efficient fine-tuning
+- 4-bit quantization during training
+- Flash Attention for faster training
+- Gradient checkpointing for memory efficiency
+- Alpaca dataset format support
+---
+## Citation
 ```bibtex
+@misc{Devocean-06/Spam_Filter-gemma,
+  author       = { {SK Devoceon-06 On device LLM} },
+  title        = { Spam filter & XAI },
+  year         = 2025,
+  url          = { https://huggingface.co/Devocean-06/Spam_Filter-gemma },
+  publisher    = { Hugging Face }
 }
 ```
+---
+## License
+This model is released under the Gemma license. Please refer to the original [Gemma license](https://ai.google.dev/gemma/terms) for usage terms and conditions.