File size: 4,242 Bytes
f3e1950
 
 
 
 
604991a
f3e1950
 
604991a
f3e1950
 
 
604991a
f3e1950
604991a
 
f3e1950
604991a
 
f3e1950
604991a
f3e1950
604991a
f3e1950
604991a
 
 
 
f3e1950
604991a
f3e1950
 
604991a
 
f3e1950
 
 
604991a
f3e1950
604991a
f3e1950
604991a
f3e1950
604991a
f3e1950
604991a
f3e1950
604991a
 
 
f3e1950
604991a
f3e1950
604991a
f3e1950
604991a
f3e1950
604991a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f3e1950
 
604991a
 
 
 
f3e1950
604991a
 
 
 
 
f3e1950
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- semantic-search
- feature-extraction
- sentence-similarity
- cybersecurity
pipeline_tag: sentence-similarity
---

# MrZaper/LiteModel

**MrZaper/LiteModel** is a lightweight [sentence-transformers](https://www.SBERT.net) model fine-tuned for **semantic search and retrieval of academic articles in cybersecurity**.  
It maps queries and article phrases into a 384-dimensional dense vector space for similarity search, clustering, and semantic matching.

This model is specifically trained for the **journal: Cybersecurity: Education, Science, Technology**  
Website: [https://csecurity.kubg.edu.ua](https://csecurity.kubg.edu.ua/index.php/journal)

# What does it do?

Given a query in **English, Ukrainian, or any other language**, the model:

- Translates the query to English (using Google Translate).
- Encodes the query into a dense embedding using Sentence-BERT.
- Computes cosine similarity between the query embedding and **precomputed article embeddings**.
- Returns the top **unique article codes** with highest similarity scores.

Returned article codes can be viewed at:


```
https://csecurity.kubg.edu.ua/index.php/journal/article/view/{CODE}
```


For example:

`560` β†’ [https://csecurity.kubg.edu.ua/index.php/journal/article/view/560](https://csecurity.kubg.edu.ua/index.php/journal/article/view/560)

---

# Model Files

The repository includes:

- `LiteModel` – SBERT-based semantic encoder
- `sbert_embeddings.npy` – Precomputed embeddings for articles
- `sbert_labels.pkl` – Corresponding article codes (e.g., `560`, `532`)

---

# Usage (Sentence-Transformers)

Install the required package:

```bash
pip install -U sentence-transformers deep-translator huggingface-hub scikit-learn
```
Example usage:
```python
from sentence_transformers import SentenceTransformer
import numpy as np
import pickle
from huggingface_hub import snapshot_download
from deep_translator import GoogleTranslator
import os
from sklearn.metrics.pairwise import cosine_similarity

# Load model and data from Hugging Face
model_name = 'MrZaper/LiteModel'
model_dir = snapshot_download(repo_id=model_name)

# Load SBERT model
sbert_model = SentenceTransformer(model_dir)

# Load precomputed article embeddings
embeddings = np.load(os.path.join(model_dir, "sbert_embeddings.npy"))

# Load article codes (labels)
with open(os.path.join(model_dir, "sbert_labels.pkl"), 'rb') as f:
    labels = pickle.load(f)

def preprocess_query(query: str) -> str:
    """Translate the query to English using Google Translate."""
    try:
        return GoogleTranslator(source="auto", target="en").translate(query)
    except Exception as e:
        print(f"Translation error: {e}")
        return query

def predict_semantic(query, model, embeddings, labels, top_n=5):
    """Find top-N most semantically similar unique article codes."""
    query_emb = model.encode([preprocess_query(query)])
    similarities = cosine_similarity(query_emb, embeddings)[0]

    seen_keys = set()
    results = []

    # Sort results by similarity (descending)
    sorted_indices = np.argsort(similarities)[::-1]

    for idx in sorted_indices:
        label = labels[idx]
        sim = similarities[idx]

        if label not in seen_keys:
            seen_keys.add(label)
            results.append({
                "article_code": label,
                "similarity": float(sim)
            })
            print(f"πŸ“„ Article {label} – similarity: {sim * 100:.2f}%")

        if len(results) >= top_n:
            break

    return results

# Example query
query = "sql injection in websites"
results = predict_semantic(query, sbert_model, embeddings, labels)

print("\nTop article codes:")
for res in results:
    print(f"Article {res['article_code']} – similarity: {res['similarity']*100:.2f}%")
```

# Example Output
πŸ“„ Article 560 – similarity: 92.15%
πŸ“„ Article 532 – similarity: 89.34%
πŸ“„ Article 475 – similarity: 85.22%

Corresponding links:
```bach
https://csecurity.kubg.edu.ua/index.php/journal/article/view/560
https://csecurity.kubg.edu.ua/index.php/journal/article/view/532
https://csecurity.kubg.edu.ua/index.php/journal/article/view/475
```