--- license: apache-2.0 library_name: sentence-transformers tags: - sentence-transformers - semantic-search - feature-extraction - sentence-similarity - cybersecurity pipeline_tag: sentence-similarity --- # MrZaper/LiteModel **MrZaper/LiteModel** is a lightweight [sentence-transformers](https://www.SBERT.net) model fine-tuned for **semantic search and retrieval of academic articles in cybersecurity**. It maps queries and article phrases into a 384-dimensional dense vector space for similarity search, clustering, and semantic matching. This model is specifically trained for the **journal: Cybersecurity: Education, Science, Technology** Website: [https://csecurity.kubg.edu.ua](https://csecurity.kubg.edu.ua/index.php/journal) # What does it do? Given a query in **English, Ukrainian, or any other language**, the model: - Translates the query to English (using Google Translate). - Encodes the query into a dense embedding using Sentence-BERT. - Computes cosine similarity between the query embedding and **precomputed article embeddings**. - Returns the top **unique article codes** with highest similarity scores. Returned article codes can be viewed at: ``` https://csecurity.kubg.edu.ua/index.php/journal/article/view/{CODE} ``` For example: `560` → [https://csecurity.kubg.edu.ua/index.php/journal/article/view/560](https://csecurity.kubg.edu.ua/index.php/journal/article/view/560) --- # Model Files The repository includes: - `LiteModel` – SBERT-based semantic encoder - `sbert_embeddings.npy` – Precomputed embeddings for articles - `sbert_labels.pkl` – Corresponding article codes (e.g., `560`, `532`) --- # Usage (Sentence-Transformers) Install the required package: ```bash pip install -U sentence-transformers deep-translator huggingface-hub scikit-learn ``` Example usage: ```python from sentence_transformers import SentenceTransformer import numpy as np import pickle from huggingface_hub import snapshot_download from deep_translator import GoogleTranslator import os from sklearn.metrics.pairwise import cosine_similarity # Load model and data from Hugging Face model_name = 'MrZaper/LiteModel' model_dir = snapshot_download(repo_id=model_name) # Load SBERT model sbert_model = SentenceTransformer(model_dir) # Load precomputed article embeddings embeddings = np.load(os.path.join(model_dir, "sbert_embeddings.npy")) # Load article codes (labels) with open(os.path.join(model_dir, "sbert_labels.pkl"), 'rb') as f: labels = pickle.load(f) def preprocess_query(query: str) -> str: """Translate the query to English using Google Translate.""" try: return GoogleTranslator(source="auto", target="en").translate(query) except Exception as e: print(f"Translation error: {e}") return query def predict_semantic(query, model, embeddings, labels, top_n=5): """Find top-N most semantically similar unique article codes.""" query_emb = model.encode([preprocess_query(query)]) similarities = cosine_similarity(query_emb, embeddings)[0] seen_keys = set() results = [] # Sort results by similarity (descending) sorted_indices = np.argsort(similarities)[::-1] for idx in sorted_indices: label = labels[idx] sim = similarities[idx] if label not in seen_keys: seen_keys.add(label) results.append({ "article_code": label, "similarity": float(sim) }) print(f"📄 Article {label} – similarity: {sim * 100:.2f}%") if len(results) >= top_n: break return results # Example query query = "sql injection in websites" results = predict_semantic(query, sbert_model, embeddings, labels) print("\nTop article codes:") for res in results: print(f"Article {res['article_code']} – similarity: {res['similarity']*100:.2f}%") ``` # Example Output 📄 Article 560 – similarity: 92.15% 📄 Article 532 – similarity: 89.34% 📄 Article 475 – similarity: 85.22% Corresponding links: ```bach https://csecurity.kubg.edu.ua/index.php/journal/article/view/560 https://csecurity.kubg.edu.ua/index.php/journal/article/view/532 https://csecurity.kubg.edu.ua/index.php/journal/article/view/475 ```