--- language: - en - ru - zh - de - es - fr - ja - it - pt - ar tags: - text-classification - spam-detection - content-filtering - security - nlp license: apache-2.0 base_model: FacebookAI/xlm-roberta-large metrics: - accuracy library_name: transformers --- # 🦅 Nickup Swallow (v1) > **"Swallows spam, leaves the essence."** **Nickup Swallow** is a high-performance, multilingual text classification model specifically engineered to act as a **Gatekeeper** for modern AI search engines and browsers. It filters out aggressive spam, SEO content farms, adult content, and scams, ensuring that downstream LLMs (like GPT/T5) process only high-quality, relevant data. ## ✨ Key Features * **🌍 True Multilingualism:** Built on `XLM-RoBERTa-Large`. * **Verified & Tested:** English, Russian, Chinese, German, Spanish, French, Japanese. * **Supported:** 100+ other languages supported by the base architecture. * **🏎️ Blazing Fast:** Optimized for low-latency inference. Ideal for real-time filtering layers. * **💎 Exclusive Dataset:** Trained on a unique, custom-parsed dataset of **230,000+** search snippets. The data was meticulously collected and labeled using Knowledge Distillation techniques specifically for this project. * **🛡️ High Recall Philosophy:** The model is tuned to be a strict filter against "Hard Spam" (Casinos, Malware, Adult) while being lenient enough to preserve valuable information. ## 📊 Model Performance | Metric | Value | Notes | | :--- | :--- | :--- | | **Accuracy** | **89.32%** | On a strict, balanced validation set | | **Training Time** | ~3 hours | Trained on NVIDIA T4 (Google Colab FREE!) | | **Base Model** | XLM-RoBERTa-Large | 550M params | ## 🧪 Examples (Real-world Tests) The model is highly confident in distinguishing academic/technical content from low-quality spam. | Input Text | Language | Verdict | Confidence | | :--- | :---: | :---: | :---: | | *"Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability."* | EN 🇺🇸 | **✅ Useful** | **99.9%** | | *"История правления Петра I: краткая биография и реформы"* | RU 🇷🇺 | **✅ Useful** | **97.4%** | | *"Die Relativitätstheorie beschäftigt sich mit der Struktur von Raum und Zeit."* | DE 🇩🇪 | **✅ Useful** | **99.7%** | | *"人工智能是计算机科学的一个分支。"* | CN 🇨🇳 | **✅ Useful** | **99.6%** | | *"BUY VIAGRA!!! BEST CASINO 100% FREE SPINS CLICK HERE"* | EN 🇺🇸 | **🗑️ Spam** | **99.4%** (Spam) | | *"СКАЧАТЬ БЕСПЛАТНО БЕЗ СМС РЕГИСТРАЦИИ КЛЮЧИ АКТИВАЦИИ"* | RU 🇷🇺 | **🗑️ Spam** | **97.5%** (Spam) | | *"ALARGA TU PENE 5 CM EN UNA SEMANA"* | ES 🇪🇸 | **🗑️ Spam** | **92.3%** (Spam) | ## 💻 Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch.nn.functional as F # Load from Hugging Face model_name = "Nickup-Swallow-v1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) def classify(text): inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) with torch.no_grad(): outputs = model(**inputs) probs = F.softmax(outputs.logits, dim=-1) # Label 1 = Useful, Label 0 = Spam spam_prob = probs[0][0].item() useful_prob = probs[0][1].item() return useful_prob # Try it out text = "Download free cracked software no virus 2024" score = classify(text) if score < 0.15: # Threshold can be adjusted for higher recall print(f"⛔ Blocked (Confidence: {1-score:.2%})") else: print(f"✅ Allowed (Confidence: {score:.2%})")