--- language: - en - ru - zh - de - es - fr - ja - it - pt - ar tags: - text-classification - spam-detection - content-filtering - security - nlp - efficiency license: apache-2.0 base_model: FacebookAI/xlm-roberta-base metrics: - accuracy - latency library_name: transformers --- # 🦅 Nickup Swallow (v2) - Optimized Edition > **"Focused Filtering for Efficient Deployment."** **Nickup Swallow v2** is a refined, optimized version of our multilingual text classification model. While many classification models exist, V2 focuses specifically on **reducing memory footprint and enhancing inference latency** for production environments where resource allocation is critical. This model is ideal for acting as a robust **Gatekeeper** to filter aggressive spam, promotional content, and digital junk before data reaches larger Language Models (LLMs). ## ✨ Key Advantages * **📏 Resource Reduction:** Achieved a **50% reduction in model size** (270M parameters) compared to the original V1 (550M). * **🌍 Multilingual Coverage:** Based on the strong, multilingual foundation of the `XLM-RoBERTa-Base` architecture. * **🎯 Enhanced Robustness:** The training process led to significant functional improvements, particularly in achieving high confidence on verifiable spam while maintaining stable judgment on ambiguous content. * **⏱️ High Latency Gain:** Optimized for faster inference speed on standard CPU and mobile hardware due to its compact size. ## 📊 Performance Comparison | Metric | V1 (Large) | **V2 (Optimized)** | Notes | | :--- | :---: | :---: | :--- | | **Model Size** | 550M params | **270M params** | Substantially reduced memory requirement. | | **Accuracy (Est.)** | 89.32% | **~90.5%** | Achieved comparable or better accuracy on the downstream task. | | **Base Architecture** | XLM-RoBERTa-Large | **XLM-RoBERTa-Base** | | ## 🧪 Comparative Analysis (Functionality Check) We compare V2's performance against V1 on critical filtering cases: | Input Text | V1 Verdict (550M) | **V2 Verdict (270M)** | V2 Useless Confidence | Functional Result | | :--- | :---: | :---: | :---: | :--- | | *"Срочно! Уникальный товар: https://tinyurl.com/sale_forever..."* | LABEL_0 (0.25%) | **🗑️ USELESS** | **99.51%** | **V2 Superiority:** Achieved near-perfect confidence on malicious spam. | | *"98523498230578509375023957029578239057239057"* | LABEL_0 (1.30%) | **🗑️ USELESS** | **98.56%** | Correctly flags raw digital noise as high-priority junk. | | *"Привет, как дела? Что ешь?"* | LABEL_1 (65.45%) | **🗑️ USELESS** | **86.34%** | **Pragmatic Filtering:** Correctly categorizes conversational filler as non-factual (USELESS). | | *"Солнце в 330 тысяч раз массивнее Земли..."* | LABEL_1 (98.99%) | **✅ USEFUL** | **99.77%** | Both models confidently preserve valuable facts. | ## 💻 Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch.nn.functional as F import torch # Load from Hugging Face model_name = "NickupAI/Nickup-Swallow-v2" # Recommended path # Load the model and tokenizer (V2 uses clear labels: 0=USELESS, 1=USEFUL) tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device).eval() def classify(text, threshold=0.90): """Classifies text and returns verdict based on a defined confidence threshold.""" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device) with torch.no_grad(): outputs = model(**inputs) probs = F.softmax(outputs.logits, dim=-1) # Label 0 = USELESS/Spam (the target class for filtering) useless_prob = probs[0][0].item() useful_prob = probs[0][1].item() # Applying the pragmatic filtering threshold (90% confidence required to block) if useless_prob > threshold: return f"⛔ Blocked (Useless Confidence: {useless_prob:.2%})" else: return f"✅ Allowed (Useful Confidence: {useful_prob:.2%})" # Example usage text_spam = "BUY CRYPTO NOW! Click this link to get rich: https://scam-link.net" text_fact = "The most popular Linux distribution used for servers is generally Ubuntu or CentOS." print(classify(text_spam)) print(classify(text_fact)) ```