MrZaper commited on
Commit
604991a
Β·
verified Β·
1 Parent(s): e2f75cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -65
README.md CHANGED
@@ -3,98 +3,139 @@ license: apache-2.0
3
  library_name: sentence-transformers
4
  tags:
5
  - sentence-transformers
 
6
  - feature-extraction
7
  - sentence-similarity
8
- - transformers
9
  pipeline_tag: sentence-similarity
10
  ---
11
 
12
- # sentence-transformers/paraphrase-MiniLM-L6-v2
13
 
14
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 
15
 
 
 
16
 
 
17
 
18
- ## Usage (Sentence-Transformers)
19
 
20
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 
 
 
21
 
22
- ```
23
- pip install -U sentence-transformers
24
- ```
25
 
26
- Then you can use the model like this:
27
-
28
- ```python
29
- from sentence_transformers import SentenceTransformer
30
- sentences = ["This is an example sentence", "Each sentence is converted"]
31
 
32
- model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')
33
- embeddings = model.encode(sentences)
34
- print(embeddings)
35
  ```
36
 
37
 
 
38
 
39
- ## Usage (HuggingFace Transformers)
40
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
41
-
42
- ```python
43
- from transformers import AutoTokenizer, AutoModel
44
- import torch
45
-
46
 
47
- #Mean Pooling - Take attention mask into account for correct averaging
48
- def mean_pooling(model_output, attention_mask):
49
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
50
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
51
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
52
 
 
53
 
54
- # Sentences we want sentence embeddings for
55
- sentences = ['This is an example sentence', 'Each sentence is converted']
56
 
57
- # Load model from HuggingFace Hub
58
- tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-MiniLM-L6-v2')
59
- model = AutoModel.from_pretrained('sentence-transformers/paraphrase-MiniLM-L6-v2')
60
 
61
- # Tokenize sentences
62
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
63
 
64
- # Compute token embeddings
65
- with torch.no_grad():
66
- model_output = model(**encoded_input)
67
 
68
- # Perform pooling. In this case, max pooling.
69
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
70
 
71
- print("Sentence embeddings:")
72
- print(sentence_embeddings)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  ```
74
 
 
 
 
 
75
 
76
-
77
- ## Full Model Architecture
78
- ```
79
- SentenceTransformer(
80
- (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
81
- (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
82
- )
83
  ```
84
 
85
- ## Citing & Authors
86
-
87
- This model was trained by [sentence-transformers](https://www.sbert.net/).
88
-
89
- If you find this model helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):
90
- ```bibtex
91
- @inproceedings{reimers-2019-sentence-bert,
92
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
93
- author = "Reimers, Nils and Gurevych, Iryna",
94
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
95
- month = "11",
96
- year = "2019",
97
- publisher = "Association for Computational Linguistics",
98
- url = "http://arxiv.org/abs/1908.10084",
99
- }
100
- ```
 
3
  library_name: sentence-transformers
4
  tags:
5
  - sentence-transformers
6
+ - semantic-search
7
  - feature-extraction
8
  - sentence-similarity
9
+ - cybersecurity
10
  pipeline_tag: sentence-similarity
11
  ---
12
 
13
+ # MrZaper/LiteModel
14
 
15
+ **MrZaper/LiteModel** is a lightweight [sentence-transformers](https://www.SBERT.net) model fine-tuned for **semantic search and retrieval of academic articles in cybersecurity**.
16
+ It maps queries and article phrases into a 384-dimensional dense vector space for similarity search, clustering, and semantic matching.
17
 
18
+ This model is specifically trained for the **journal: Cybersecurity: Education, Science, Technology**
19
+ Website: [https://csecurity.kubg.edu.ua](https://csecurity.kubg.edu.ua/index.php/journal)
20
 
21
+ # What does it do?
22
 
23
+ Given a query in **English, Ukrainian, or any other language**, the model:
24
 
25
+ - Translates the query to English (using Google Translate).
26
+ - Encodes the query into a dense embedding using Sentence-BERT.
27
+ - Computes cosine similarity between the query embedding and **precomputed article embeddings**.
28
+ - Returns the top **unique article codes** with highest similarity scores.
29
 
30
+ Returned article codes can be viewed at:
 
 
31
 
 
 
 
 
 
32
 
33
+ ```
34
+ https://csecurity.kubg.edu.ua/index.php/journal/article/view/{CODE}
 
35
  ```
36
 
37
 
38
+ For example:
39
 
40
+ `560` β†’ [https://csecurity.kubg.edu.ua/index.php/journal/article/view/560](https://csecurity.kubg.edu.ua/index.php/journal/article/view/560)
 
 
 
 
 
 
41
 
42
+ ---
 
 
 
 
43
 
44
+ # Model Files
45
 
46
+ The repository includes:
 
47
 
48
+ - `LiteModel` – SBERT-based semantic encoder
49
+ - `sbert_embeddings.npy` – Precomputed embeddings for articles
50
+ - `sbert_labels.pkl` – Corresponding article codes (e.g., `560`, `532`)
51
 
52
+ ---
 
53
 
54
+ # Usage (Sentence-Transformers)
 
 
55
 
56
+ Install the required package:
 
57
 
58
+ ```bash
59
+ pip install -U sentence-transformers deep-translator huggingface-hub scikit-learn
60
+ ```
61
+ Example usage:
62
+ ```python
63
+ from sentence_transformers import SentenceTransformer
64
+ import numpy as np
65
+ import pickle
66
+ from huggingface_hub import snapshot_download
67
+ from deep_translator import GoogleTranslator
68
+ import os
69
+ from sklearn.metrics.pairwise import cosine_similarity
70
+
71
+ # Load model and data from Hugging Face
72
+ model_name = 'MrZaper/LiteModel'
73
+ model_dir = snapshot_download(repo_id=model_name)
74
+
75
+ # Load SBERT model
76
+ sbert_model = SentenceTransformer(model_dir)
77
+
78
+ # Load precomputed article embeddings
79
+ embeddings = np.load(os.path.join(model_dir, "sbert_embeddings.npy"))
80
+
81
+ # Load article codes (labels)
82
+ with open(os.path.join(model_dir, "sbert_labels.pkl"), 'rb') as f:
83
+ labels = pickle.load(f)
84
+
85
+ def preprocess_query(query: str) -> str:
86
+ """Translate the query to English using Google Translate."""
87
+ try:
88
+ return GoogleTranslator(source="auto", target="en").translate(query)
89
+ except Exception as e:
90
+ print(f"Translation error: {e}")
91
+ return query
92
+
93
+ def predict_semantic(query, model, embeddings, labels, top_n=5):
94
+ """Find top-N most semantically similar unique article codes."""
95
+ query_emb = model.encode([preprocess_query(query)])
96
+ similarities = cosine_similarity(query_emb, embeddings)[0]
97
+
98
+ seen_keys = set()
99
+ results = []
100
+
101
+ # Sort results by similarity (descending)
102
+ sorted_indices = np.argsort(similarities)[::-1]
103
+
104
+ for idx in sorted_indices:
105
+ label = labels[idx]
106
+ sim = similarities[idx]
107
+
108
+ if label not in seen_keys:
109
+ seen_keys.add(label)
110
+ results.append({
111
+ "article_code": label,
112
+ "similarity": float(sim)
113
+ })
114
+ print(f"πŸ“„ Article {label} – similarity: {sim * 100:.2f}%")
115
+
116
+ if len(results) >= top_n:
117
+ break
118
+
119
+ return results
120
+
121
+ # Example query
122
+ query = "sql injection in websites"
123
+ results = predict_semantic(query, sbert_model, embeddings, labels)
124
+
125
+ print("\nTop article codes:")
126
+ for res in results:
127
+ print(f"Article {res['article_code']} – similarity: {res['similarity']*100:.2f}%")
128
  ```
129
 
130
+ # Example Output
131
+ πŸ“„ Article 560 – similarity: 92.15%
132
+ πŸ“„ Article 532 – similarity: 89.34%
133
+ πŸ“„ Article 475 – similarity: 85.22%
134
 
135
+ Corresponding links:
136
+ ```bach
137
+ https://csecurity.kubg.edu.ua/index.php/journal/article/view/560
138
+ https://csecurity.kubg.edu.ua/index.php/journal/article/view/532
139
+ https://csecurity.kubg.edu.ua/index.php/journal/article/view/475
 
 
140
  ```
141