LiteModel / README.md
MrZaper's picture
Update README.md
604991a verified
|
raw
history blame
4.24 kB
---
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- semantic-search
- feature-extraction
- sentence-similarity
- cybersecurity
pipeline_tag: sentence-similarity
---
# MrZaper/LiteModel
**MrZaper/LiteModel** is a lightweight [sentence-transformers](https://www.SBERT.net) model fine-tuned for **semantic search and retrieval of academic articles in cybersecurity**.
It maps queries and article phrases into a 384-dimensional dense vector space for similarity search, clustering, and semantic matching.
This model is specifically trained for the **journal: Cybersecurity: Education, Science, Technology**
Website: [https://csecurity.kubg.edu.ua](https://csecurity.kubg.edu.ua/index.php/journal)
# What does it do?
Given a query in **English, Ukrainian, or any other language**, the model:
- Translates the query to English (using Google Translate).
- Encodes the query into a dense embedding using Sentence-BERT.
- Computes cosine similarity between the query embedding and **precomputed article embeddings**.
- Returns the top **unique article codes** with highest similarity scores.
Returned article codes can be viewed at:
```
https://csecurity.kubg.edu.ua/index.php/journal/article/view/{CODE}
```
For example:
`560` β†’ [https://csecurity.kubg.edu.ua/index.php/journal/article/view/560](https://csecurity.kubg.edu.ua/index.php/journal/article/view/560)
---
# Model Files
The repository includes:
- `LiteModel` – SBERT-based semantic encoder
- `sbert_embeddings.npy` – Precomputed embeddings for articles
- `sbert_labels.pkl` – Corresponding article codes (e.g., `560`, `532`)
---
# Usage (Sentence-Transformers)
Install the required package:
```bash
pip install -U sentence-transformers deep-translator huggingface-hub scikit-learn
```
Example usage:
```python
from sentence_transformers import SentenceTransformer
import numpy as np
import pickle
from huggingface_hub import snapshot_download
from deep_translator import GoogleTranslator
import os
from sklearn.metrics.pairwise import cosine_similarity
# Load model and data from Hugging Face
model_name = 'MrZaper/LiteModel'
model_dir = snapshot_download(repo_id=model_name)
# Load SBERT model
sbert_model = SentenceTransformer(model_dir)
# Load precomputed article embeddings
embeddings = np.load(os.path.join(model_dir, "sbert_embeddings.npy"))
# Load article codes (labels)
with open(os.path.join(model_dir, "sbert_labels.pkl"), 'rb') as f:
labels = pickle.load(f)
def preprocess_query(query: str) -> str:
"""Translate the query to English using Google Translate."""
try:
return GoogleTranslator(source="auto", target="en").translate(query)
except Exception as e:
print(f"Translation error: {e}")
return query
def predict_semantic(query, model, embeddings, labels, top_n=5):
"""Find top-N most semantically similar unique article codes."""
query_emb = model.encode([preprocess_query(query)])
similarities = cosine_similarity(query_emb, embeddings)[0]
seen_keys = set()
results = []
# Sort results by similarity (descending)
sorted_indices = np.argsort(similarities)[::-1]
for idx in sorted_indices:
label = labels[idx]
sim = similarities[idx]
if label not in seen_keys:
seen_keys.add(label)
results.append({
"article_code": label,
"similarity": float(sim)
})
print(f"πŸ“„ Article {label} – similarity: {sim * 100:.2f}%")
if len(results) >= top_n:
break
return results
# Example query
query = "sql injection in websites"
results = predict_semantic(query, sbert_model, embeddings, labels)
print("\nTop article codes:")
for res in results:
print(f"Article {res['article_code']} – similarity: {res['similarity']*100:.2f}%")
```
# Example Output
πŸ“„ Article 560 – similarity: 92.15%
πŸ“„ Article 532 – similarity: 89.34%
πŸ“„ Article 475 – similarity: 85.22%
Corresponding links:
```bach
https://csecurity.kubg.edu.ua/index.php/journal/article/view/560
https://csecurity.kubg.edu.ua/index.php/journal/article/view/532
https://csecurity.kubg.edu.ua/index.php/journal/article/view/475
```