|
|
--- |
|
|
language: |
|
|
- en |
|
|
- ru |
|
|
- zh |
|
|
- de |
|
|
- es |
|
|
- fr |
|
|
- ja |
|
|
- it |
|
|
- pt |
|
|
- ar |
|
|
tags: |
|
|
- text-classification |
|
|
- spam-detection |
|
|
- content-filtering |
|
|
- security |
|
|
- nlp |
|
|
- efficiency |
|
|
license: apache-2.0 |
|
|
base_model: FacebookAI/xlm-roberta-base |
|
|
metrics: |
|
|
- accuracy |
|
|
- latency |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# 🦅 Nickup Swallow (v2) - Optimized Edition |
|
|
|
|
|
> **"Focused Filtering for Efficient Deployment."** |
|
|
|
|
|
**Nickup Swallow v2** is a refined, optimized version of our multilingual text classification model. While many classification models exist, V2 focuses specifically on **reducing memory footprint and enhancing inference latency** for production environments where resource allocation is critical. |
|
|
|
|
|
This model is ideal for acting as a robust **Gatekeeper** to filter aggressive spam, promotional content, and digital junk before data reaches larger Language Models (LLMs). |
|
|
|
|
|
## ✨ Key Advantages |
|
|
|
|
|
* **📏 Resource Reduction:** Achieved a **50% reduction in model size** (270M parameters) compared to the original V1 (550M). |
|
|
* **🌍 Multilingual Coverage:** Based on the strong, multilingual foundation of the `XLM-RoBERTa-Base` architecture. |
|
|
* **🎯 Enhanced Robustness:** The training process led to significant functional improvements, particularly in achieving high confidence on verifiable spam while maintaining stable judgment on ambiguous content. |
|
|
* **⏱️ High Latency Gain:** Optimized for faster inference speed on standard CPU and mobile hardware due to its compact size. |
|
|
|
|
|
## 📊 Performance Comparison |
|
|
|
|
|
| Metric | V1 (Large) | **V2 (Optimized)** | Notes | |
|
|
| :--- | :---: | :---: | :--- | |
|
|
| **Model Size** | 550M params | **270M params** | Substantially reduced memory requirement. | |
|
|
| **Accuracy (Est.)** | 89.32% | **~90.5%** | Achieved comparable or better accuracy on the downstream task. | |
|
|
| **Base Architecture** | XLM-RoBERTa-Large | **XLM-RoBERTa-Base** | | |
|
|
|
|
|
## 🧪 Comparative Analysis (Functionality Check) |
|
|
|
|
|
We compare V2's performance against V1 on critical filtering cases: |
|
|
|
|
|
| Input Text | V1 Verdict (550M) | **V2 Verdict (270M)** | V2 Useless Confidence | Functional Result | |
|
|
| :--- | :---: | :---: | :---: | :--- | |
|
|
| *"Срочно! Уникальный товар: https://tinyurl.com/sale_forever..."* | LABEL_0 (0.25%) | **🗑️ USELESS** | **99.51%** | **V2 Superiority:** Achieved near-perfect confidence on malicious spam. | |
|
|
| *"98523498230578509375023957029578239057239057"* | LABEL_0 (1.30%) | **🗑️ USELESS** | **98.56%** | Correctly flags raw digital noise as high-priority junk. | |
|
|
| *"Привет, как дела? Что ешь?"* | LABEL_1 (65.45%) | **🗑️ USELESS** | **86.34%** | **Pragmatic Filtering:** Correctly categorizes conversational filler as non-factual (USELESS). | |
|
|
| *"Солнце в 330 тысяч раз массивнее Земли..."* | LABEL_1 (98.99%) | **✅ USEFUL** | **99.77%** | Both models confidently preserve valuable facts. | |
|
|
|
|
|
## 💻 Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch.nn.functional as F |
|
|
import torch |
|
|
|
|
|
# Load from Hugging Face |
|
|
model_name = "NickupAI/Nickup-Swallow-v2" # Recommended path |
|
|
|
|
|
# Load the model and tokenizer (V2 uses clear labels: 0=USELESS, 1=USEFUL) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
model.to(device).eval() |
|
|
|
|
|
|
|
|
def classify(text, threshold=0.90): |
|
|
"""Classifies text and returns verdict based on a defined confidence threshold.""" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device) |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = F.softmax(outputs.logits, dim=-1) |
|
|
|
|
|
# Label 0 = USELESS/Spam (the target class for filtering) |
|
|
useless_prob = probs[0][0].item() |
|
|
useful_prob = probs[0][1].item() |
|
|
|
|
|
# Applying the pragmatic filtering threshold (90% confidence required to block) |
|
|
if useless_prob > threshold: |
|
|
return f"⛔ Blocked (Useless Confidence: {useless_prob:.2%})" |
|
|
else: |
|
|
return f"✅ Allowed (Useful Confidence: {useful_prob:.2%})" |
|
|
|
|
|
# Example usage |
|
|
text_spam = "BUY CRYPTO NOW! Click this link to get rich: https://scam-link.net" |
|
|
text_fact = "The most popular Linux distribution used for servers is generally Ubuntu or CentOS." |
|
|
|
|
|
print(classify(text_spam)) |
|
|
print(classify(text_fact)) |
|
|
``` |
|
|
|