MrZaper
/

LiteModel

@@ -3,98 +3,139 @@ license: apache-2.0
 library_name: sentence-transformers
 tags:
 - sentence-transformers
 - feature-extraction
 - sentence-similarity
-- transformers
 pipeline_tag: sentence-similarity
 ---
-# sentence-transformers/paraphrase-MiniLM-L6-v2
-This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
-## Usage (Sentence-Transformers)
-Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
-```
-pip install -U sentence-transformers
-```
-Then you can use the model like this:
-```python
-from sentence_transformers import SentenceTransformer
-sentences = ["This is an example sentence", "Each sentence is converted"]
-model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')
-embeddings = model.encode(sentences)
-print(embeddings)
 ```
-## Usage (HuggingFace Transformers)
-Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
-```python
-from transformers import AutoTokenizer, AutoModel
-import torch
-#Mean Pooling - Take attention mask into account for correct averaging
-def mean_pooling(model_output, attention_mask):
-    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
-    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
-    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-# Sentences we want sentence embeddings for
-sentences = ['This is an example sentence', 'Each sentence is converted']
-# Load model from HuggingFace Hub
-tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-MiniLM-L6-v2')
-model = AutoModel.from_pretrained('sentence-transformers/paraphrase-MiniLM-L6-v2')
-# Tokenize sentences
-encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
-# Compute token embeddings
-with torch.no_grad():
-    model_output = model(**encoded_input)
-# Perform pooling. In this case, max pooling.
-sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
-print("Sentence embeddings:")
-print(sentence_embeddings)
 ```
-## Full Model Architecture
-```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
-  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
-)
 ```
-## Citing & Authors
-This model was trained by [sentence-transformers](https://www.sbert.net/).
-If you find this model helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):
-```bibtex
-@inproceedings{reimers-2019-sentence-bert,
-    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
-    author = "Reimers, Nils and Gurevych, Iryna",
-    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
-    month = "11",
-    year = "2019",
-    publisher = "Association for Computational Linguistics",
-    url = "http://arxiv.org/abs/1908.10084",
-}
-```

 library_name: sentence-transformers
 tags:
 - sentence-transformers
+- semantic-search
 - feature-extraction
 - sentence-similarity
+- cybersecurity
 pipeline_tag: sentence-similarity
 ---
+# MrZaper/LiteModel
+**MrZaper/LiteModel** is a lightweight [sentence-transformers](https://www.SBERT.net) model fine-tuned for **semantic search and retrieval of academic articles in cybersecurity**.
+It maps queries and article phrases into a 384-dimensional dense vector space for similarity search, clustering, and semantic matching.
+This model is specifically trained for the **journal: Cybersecurity: Education, Science, Technology**
+Website: [https://csecurity.kubg.edu.ua](https://csecurity.kubg.edu.ua/index.php/journal)
+# What does it do?
+Given a query in **English, Ukrainian, or any other language**, the model:
+- Translates the query to English (using Google Translate).
+- Encodes the query into a dense embedding using Sentence-BERT.
+- Computes cosine similarity between the query embedding and **precomputed article embeddings**.
+- Returns the top **unique article codes** with highest similarity scores.
+Returned article codes can be viewed at:
+```
+https://csecurity.kubg.edu.ua/index.php/journal/article/view/{CODE}
 ```
+For example:
+`560` → [https://csecurity.kubg.edu.ua/index.php/journal/article/view/560](https://csecurity.kubg.edu.ua/index.php/journal/article/view/560)
+---
+# Model Files
+The repository includes:
+- `LiteModel` – SBERT-based semantic encoder
+- `sbert_embeddings.npy` – Precomputed embeddings for articles
+- `sbert_labels.pkl` – Corresponding article codes (e.g., `560`, `532`)
+---
+# Usage (Sentence-Transformers)
+Install the required package:
+```bash
+pip install -U sentence-transformers deep-translator huggingface-hub scikit-learn
+```
+Example usage:
+```python
+from sentence_transformers import SentenceTransformer
+import numpy as np
+import pickle
+from huggingface_hub import snapshot_download
+from deep_translator import GoogleTranslator
+import os
+from sklearn.metrics.pairwise import cosine_similarity
+# Load model and data from Hugging Face
+model_name = 'MrZaper/LiteModel'
+model_dir = snapshot_download(repo_id=model_name)
+# Load SBERT model
+sbert_model = SentenceTransformer(model_dir)
+# Load precomputed article embeddings
+embeddings = np.load(os.path.join(model_dir, "sbert_embeddings.npy"))
+# Load article codes (labels)
+with open(os.path.join(model_dir, "sbert_labels.pkl"), 'rb') as f:
+    labels = pickle.load(f)
+def preprocess_query(query: str) -> str:
+    """Translate the query to English using Google Translate."""
+    try:
+        return GoogleTranslator(source="auto", target="en").translate(query)
+    except Exception as e:
+        print(f"Translation error: {e}")
+        return query
+def predict_semantic(query, model, embeddings, labels, top_n=5):
+    """Find top-N most semantically similar unique article codes."""
+    query_emb = model.encode([preprocess_query(query)])
+    similarities = cosine_similarity(query_emb, embeddings)[0]
+    seen_keys = set()
+    results = []
+    # Sort results by similarity (descending)
+    sorted_indices = np.argsort(similarities)[::-1]
+    for idx in sorted_indices:
+        label = labels[idx]
+        sim = similarities[idx]
+        if label not in seen_keys:
+            seen_keys.add(label)
+            results.append({
+                "article_code": label,
+                "similarity": float(sim)
+            })
+            print(f"📄 Article {label} – similarity: {sim * 100:.2f}%")
+        if len(results) >= top_n:
+            break
+    return results
+# Example query
+query = "sql injection in websites"
+results = predict_semantic(query, sbert_model, embeddings, labels)
+print("\nTop article codes:")
+for res in results:
+    print(f"Article {res['article_code']} – similarity: {res['similarity']*100:.2f}%")
 ```
+# Example Output
+📄 Article 560 – similarity: 92.15%
+📄 Article 532 – similarity: 89.34%
+📄 Article 475 – similarity: 85.22%
+Corresponding links:
+```bach
+https://csecurity.kubg.edu.ua/index.php/journal/article/view/560
+https://csecurity.kubg.edu.ua/index.php/journal/article/view/532
+https://csecurity.kubg.edu.ua/index.php/journal/article/view/475
 ```