Nickup-Swallow-v2 / README.md
NickupAI's picture
Initial release of Nickup Swallow v1 🦅
fdf5b38 verified
---
language:
- en
- ru
- zh
- de
- es
- fr
- ja
- it
- pt
- ar
tags:
- text-classification
- spam-detection
- content-filtering
- security
- nlp
- efficiency
license: apache-2.0
base_model: FacebookAI/xlm-roberta-base
metrics:
- accuracy
- latency
library_name: transformers
---
# 🦅 Nickup Swallow (v2) - Optimized Edition
> **"Focused Filtering for Efficient Deployment."**
**Nickup Swallow v2** is a refined, optimized version of our multilingual text classification model. While many classification models exist, V2 focuses specifically on **reducing memory footprint and enhancing inference latency** for production environments where resource allocation is critical.
This model is ideal for acting as a robust **Gatekeeper** to filter aggressive spam, promotional content, and digital junk before data reaches larger Language Models (LLMs).
## ✨ Key Advantages
* **📏 Resource Reduction:** Achieved a **50% reduction in model size** (270M parameters) compared to the original V1 (550M).
* **🌍 Multilingual Coverage:** Based on the strong, multilingual foundation of the `XLM-RoBERTa-Base` architecture.
* **🎯 Enhanced Robustness:** The training process led to significant functional improvements, particularly in achieving high confidence on verifiable spam while maintaining stable judgment on ambiguous content.
* **⏱️ High Latency Gain:** Optimized for faster inference speed on standard CPU and mobile hardware due to its compact size.
## 📊 Performance Comparison
| Metric | V1 (Large) | **V2 (Optimized)** | Notes |
| :--- | :---: | :---: | :--- |
| **Model Size** | 550M params | **270M params** | Substantially reduced memory requirement. |
| **Accuracy (Est.)** | 89.32% | **~90.5%** | Achieved comparable or better accuracy on the downstream task. |
| **Base Architecture** | XLM-RoBERTa-Large | **XLM-RoBERTa-Base** | |
## 🧪 Comparative Analysis (Functionality Check)
We compare V2's performance against V1 on critical filtering cases:
| Input Text | V1 Verdict (550M) | **V2 Verdict (270M)** | V2 Useless Confidence | Functional Result |
| :--- | :---: | :---: | :---: | :--- |
| *"Срочно! Уникальный товар: https://tinyurl.com/sale_forever..."* | LABEL_0 (0.25%) | **🗑️ USELESS** | **99.51%** | **V2 Superiority:** Achieved near-perfect confidence on malicious spam. |
| *"98523498230578509375023957029578239057239057"* | LABEL_0 (1.30%) | **🗑️ USELESS** | **98.56%** | Correctly flags raw digital noise as high-priority junk. |
| *"Привет, как дела? Что ешь?"* | LABEL_1 (65.45%) | **🗑️ USELESS** | **86.34%** | **Pragmatic Filtering:** Correctly categorizes conversational filler as non-factual (USELESS). |
| *"Солнце в 330 тысяч раз массивнее Земли..."* | LABEL_1 (98.99%) | **✅ USEFUL** | **99.77%** | Both models confidently preserve valuable facts. |
## 💻 Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F
import torch
# Load from Hugging Face
model_name = "NickupAI/Nickup-Swallow-v2" # Recommended path
# Load the model and tokenizer (V2 uses clear labels: 0=USELESS, 1=USEFUL)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device).eval()
def classify(text, threshold=0.90):
"""Classifies text and returns verdict based on a defined confidence threshold."""
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)
with torch.no_grad():
outputs = model(**inputs)
probs = F.softmax(outputs.logits, dim=-1)
# Label 0 = USELESS/Spam (the target class for filtering)
useless_prob = probs[0][0].item()
useful_prob = probs[0][1].item()
# Applying the pragmatic filtering threshold (90% confidence required to block)
if useless_prob > threshold:
return f"⛔ Blocked (Useless Confidence: {useless_prob:.2%})"
else:
return f"✅ Allowed (Useful Confidence: {useful_prob:.2%})"
# Example usage
text_spam = "BUY CRYPTO NOW! Click this link to get rich: https://scam-link.net"
text_fact = "The most popular Linux distribution used for servers is generally Ubuntu or CentOS."
print(classify(text_spam))
print(classify(text_fact))
```