Nickup-Swallow-v2 / README.md

Initial release of Nickup Swallow v1 🦅

fdf5b38 verified about 1 month ago

4.42 kB

	---
	language:
	- en
	- ru
	- zh
	- de
	- es
	- fr
	- ja
	- it
	- pt
	- ar
	tags:
	- text-classification
	- spam-detection
	- content-filtering
	- security
	- nlp
	- efficiency
	license: apache-2.0
	base_model: FacebookAI/xlm-roberta-base
	metrics:
	- accuracy
	- latency
	library_name: transformers
	---

	# 🦅 Nickup Swallow (v2) - Optimized Edition

	> "Focused Filtering for Efficient Deployment."

	Nickup Swallow v2 is a refined, optimized version of our multilingual text classification model. While many classification models exist, V2 focuses specifically on reducing memory footprint and enhancing inference latency for production environments where resource allocation is critical.

	This model is ideal for acting as a robust Gatekeeper to filter aggressive spam, promotional content, and digital junk before data reaches larger Language Models (LLMs).

	## ✨ Key Advantages

	* 📏 Resource Reduction: Achieved a 50% reduction in model size (270M parameters) compared to the original V1 (550M).
	* 🌍 Multilingual Coverage: Based on the strong, multilingual foundation of the `XLM-RoBERTa-Base` architecture.
	* 🎯 Enhanced Robustness: The training process led to significant functional improvements, particularly in achieving high confidence on verifiable spam while maintaining stable judgment on ambiguous content.
	* ⏱️ High Latency Gain: Optimized for faster inference speed on standard CPU and mobile hardware due to its compact size.

	## 📊 Performance Comparison

	\| Metric \| V1 (Large) \| V2 (Optimized) \| Notes \|
	\| :--- \| :---: \| :---: \| :--- \|
	\| Model Size \| 550M params \| 270M params \| Substantially reduced memory requirement. \|
	\| Accuracy (Est.) \| 89.32% \| ~90.5% \| Achieved comparable or better accuracy on the downstream task. \|
	\| Base Architecture \| XLM-RoBERTa-Large \| XLM-RoBERTa-Base \| \|

	## 🧪 Comparative Analysis (Functionality Check)

	We compare V2's performance against V1 on critical filtering cases:

	\| Input Text \| V1 Verdict (550M) \| V2 Verdict (270M) \| V2 Useless Confidence \| Functional Result \|
	\| :--- \| :---: \| :---: \| :---: \| :--- \|
	\| "Срочно! Уникальный товар: https://tinyurl.com/sale_forever..." \| LABEL_0 (0.25%) \| 🗑️ USELESS \| 99.51% \| V2 Superiority: Achieved near-perfect confidence on malicious spam. \|
	\| "98523498230578509375023957029578239057239057" \| LABEL_0 (1.30%) \| 🗑️ USELESS \| 98.56% \| Correctly flags raw digital noise as high-priority junk. \|
	\| "Привет, как дела? Что ешь?" \| LABEL_1 (65.45%) \| 🗑️ USELESS \| 86.34% \| Pragmatic Filtering: Correctly categorizes conversational filler as non-factual (USELESS). \|
	\| "Солнце в 330 тысяч раз массивнее Земли..." \| LABEL_1 (98.99%) \| ✅ USEFUL \| 99.77% \| Both models confidently preserve valuable facts. \|

	## 💻 Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch.nn.functional as F
	import torch

	# Load from Hugging Face
	model_name = "NickupAI/Nickup-Swallow-v2" # Recommended path

	# Load the model and tokenizer (V2 uses clear labels: 0=USELESS, 1=USEFUL)
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device).eval()


	def classify(text, threshold=0.90):
	"""Classifies text and returns verdict based on a defined confidence threshold."""
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)
	with torch.no_grad():
	outputs = model(**inputs)
	probs = F.softmax(outputs.logits, dim=-1)

	# Label 0 = USELESS/Spam (the target class for filtering)
	useless_prob = probs[0][0].item()
	useful_prob = probs[0][1].item()

	# Applying the pragmatic filtering threshold (90% confidence required to block)
	if useless_prob > threshold:
	return f"⛔ Blocked (Useless Confidence: {useless_prob:.2%})"
	else:
	return f"✅ Allowed (Useful Confidence: {useful_prob:.2%})"

	# Example usage
	text_spam = "BUY CRYPTO NOW! Click this link to get rich: https://scam-link.net"
	text_fact = "The most popular Linux distribution used for servers is generally Ubuntu or CentOS."

	print(classify(text_spam))
	print(classify(text_fact))
	```