|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: sentence-transformers |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- semantic-search |
|
|
- feature-extraction |
|
|
- sentence-similarity |
|
|
- cybersecurity |
|
|
pipeline_tag: sentence-similarity |
|
|
--- |
|
|
|
|
|
# MrZaper/LiteModel |
|
|
|
|
|
**MrZaper/LiteModel** is a lightweight [sentence-transformers](https://www.SBERT.net) model fine-tuned for **semantic search and retrieval of academic articles in cybersecurity**. |
|
|
It maps queries and article phrases into a 384-dimensional dense vector space for similarity search, clustering, and semantic matching. |
|
|
|
|
|
This model is specifically trained for the **journal: Cybersecurity: Education, Science, Technology** |
|
|
Website: [https://csecurity.kubg.edu.ua](https://csecurity.kubg.edu.ua/index.php/journal) |
|
|
|
|
|
# What does it do? |
|
|
|
|
|
Given a query in **English, Ukrainian, or any other language**, the model: |
|
|
|
|
|
- Translates the query to English (using Google Translate). |
|
|
- Encodes the query into a dense embedding using Sentence-BERT. |
|
|
- Computes cosine similarity between the query embedding and **precomputed article embeddings**. |
|
|
- Returns the top **unique article codes** with highest similarity scores. |
|
|
|
|
|
Returned article codes can be viewed at: |
|
|
|
|
|
|
|
|
``` |
|
|
https://csecurity.kubg.edu.ua/index.php/journal/article/view/{CODE} |
|
|
``` |
|
|
|
|
|
|
|
|
For example: |
|
|
|
|
|
`560` β [https://csecurity.kubg.edu.ua/index.php/journal/article/view/560](https://csecurity.kubg.edu.ua/index.php/journal/article/view/560) |
|
|
|
|
|
--- |
|
|
|
|
|
# Model Files |
|
|
|
|
|
The repository includes: |
|
|
|
|
|
- `LiteModel` β SBERT-based semantic encoder |
|
|
- `sbert_embeddings.npy` β Precomputed embeddings for articles |
|
|
- `sbert_labels.pkl` β Corresponding article codes (e.g., `560`, `532`) |
|
|
|
|
|
--- |
|
|
|
|
|
# Usage (Sentence-Transformers) |
|
|
|
|
|
Install the required package: |
|
|
|
|
|
```bash |
|
|
pip install -U sentence-transformers deep-translator huggingface-hub scikit-learn |
|
|
``` |
|
|
Example usage: |
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
import numpy as np |
|
|
import pickle |
|
|
from huggingface_hub import snapshot_download |
|
|
from deep_translator import GoogleTranslator |
|
|
import os |
|
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
|
|
|
# Load model and data from Hugging Face |
|
|
model_name = 'MrZaper/LiteModel' |
|
|
model_dir = snapshot_download(repo_id=model_name) |
|
|
|
|
|
# Load SBERT model |
|
|
sbert_model = SentenceTransformer(model_dir) |
|
|
|
|
|
# Load precomputed article embeddings |
|
|
embeddings = np.load(os.path.join(model_dir, "sbert_embeddings.npy")) |
|
|
|
|
|
# Load article codes (labels) |
|
|
with open(os.path.join(model_dir, "sbert_labels.pkl"), 'rb') as f: |
|
|
labels = pickle.load(f) |
|
|
|
|
|
def preprocess_query(query: str) -> str: |
|
|
"""Translate the query to English using Google Translate.""" |
|
|
try: |
|
|
return GoogleTranslator(source="auto", target="en").translate(query) |
|
|
except Exception as e: |
|
|
print(f"Translation error: {e}") |
|
|
return query |
|
|
|
|
|
def predict_semantic(query, model, embeddings, labels, top_n=5): |
|
|
"""Find top-N most semantically similar unique article codes.""" |
|
|
query_emb = model.encode([preprocess_query(query)]) |
|
|
similarities = cosine_similarity(query_emb, embeddings)[0] |
|
|
|
|
|
seen_keys = set() |
|
|
results = [] |
|
|
|
|
|
# Sort results by similarity (descending) |
|
|
sorted_indices = np.argsort(similarities)[::-1] |
|
|
|
|
|
for idx in sorted_indices: |
|
|
label = labels[idx] |
|
|
sim = similarities[idx] |
|
|
|
|
|
if label not in seen_keys: |
|
|
seen_keys.add(label) |
|
|
results.append({ |
|
|
"article_code": label, |
|
|
"similarity": float(sim) |
|
|
}) |
|
|
print(f"π Article {label} β similarity: {sim * 100:.2f}%") |
|
|
|
|
|
if len(results) >= top_n: |
|
|
break |
|
|
|
|
|
return results |
|
|
|
|
|
# Example query |
|
|
query = "sql injection in websites" |
|
|
results = predict_semantic(query, sbert_model, embeddings, labels) |
|
|
|
|
|
print("\nTop article codes:") |
|
|
for res in results: |
|
|
print(f"Article {res['article_code']} β similarity: {res['similarity']*100:.2f}%") |
|
|
``` |
|
|
|
|
|
# Example Output |
|
|
π Article 560 β similarity: 92.15% |
|
|
π Article 532 β similarity: 89.34% |
|
|
π Article 475 β similarity: 85.22% |
|
|
|
|
|
Corresponding links: |
|
|
```bach |
|
|
https://csecurity.kubg.edu.ua/index.php/journal/article/view/560 |
|
|
https://csecurity.kubg.edu.ua/index.php/journal/article/view/532 |
|
|
https://csecurity.kubg.edu.ua/index.php/journal/article/view/475 |
|
|
``` |
|
|
|
|
|
|