socialengineering / README.md
smd20's picture
Update README.md
165a8fd verified
|
Raw
History Blame Contribute Delete
10.5 kB
---
library_name: gguf
license: other
base_model: google/gemma-4-e4b-it
tags:
- gguf
- gemma4
- gemma
- unsloth
- social-engineering
- cybersecurity
- phishing
- red-team
- conversational
- fine-tuned
- llama.cpp
pipeline_tag: text-generation
language:
- en
- fa
datasets:
- smd20/social-engineering-qa-english
- smd20/social-engineering-qa-persian
---
# Social Engineering Specialist — Gemma 4 E4B (GGUF)
**`smd20/socialengineering`** is a domain-specialized conversational model for **social engineering,
phishing awareness, and red-team education**, fine-tuned from **Google Gemma 4 E4B**
using [Unsloth](https://github.com/unslothai/unsloth) and exported as **BF16 GGUF**
for efficient local deployment with `llama.cpp`, Ollama, LM Studio, and related runtimes.
The model was trained on a large bilingual Q&A corpus derived from authoritative
social-engineering reference books, covering definitions, attack techniques
(phishing, vishing, pretexting, baiting, tailgating), case studies, and defensive
strategies.
---
## Model Summary
| Property | Value |
| --- | --- |
| **Base architecture** | Gemma 4 (E4B instruction-tuned variant) |
| **Parameters** | ~8B |
| **Precision / format** | BF16 GGUF |
| **Primary weight file** | `unsloth-gemma-4-E4B-it.BF16.gguf` |
| **Multimodal projector** | `unsloth-gemma-4-E4B-it.BF16-mmproj.gguf` |
| **Fine-tuning framework** | [Unsloth](https://github.com/unslothai/unsloth) |
| **Domain** | Social engineering, phishing, red-team awareness |
| **Languages** | English, Persian (Farsi) |
| **Context length (training)** | 2,048 tokens |
| **Repository** | [smd20/socialengineering](https://huggingface.co/smd20/socialengineering) |
---
## Intended Use
### Primary use cases
- Organizational **security-awareness chatbots**
- **Phishing and social-engineering education** for analysts and end users
- **Red-team / blue-team training** scenarios in controlled environments
- Local, privacy-preserving Q&A over social-engineering concepts
### Out-of-scope / misuse
This model is **not** a substitute for legal, operational, or incident-response
authority. It must **not** be used to conduct unauthorized attacks, harvest credentials,
or deceive individuals outside approved training and research contexts.
---
## Training Procedure
Fine-tuning was performed in **Unsloth Studio** on top of **`gemma-4-E4B`**, using a
bilingual social-engineering Q&A corpus built from structured knowledge articles
extracted from eight reference books.
### Training hyperparameters
| Setting | Value |
| --- | --- |
| Epochs | 30 |
| Learning rate | `2.0e-4` |
| Context length | 2,048 |
| LoRA rank | 16 |
| LoRA dropout | 0.16 |
| LoRA target modules | All enabled (`Enable LoRA`) |
| Optimizer | AdamW 8-bit |
| LR scheduler | Linear |
| Weight decay | 0.001 |
### Export configuration
| Setting | Value |
| --- | --- |
| Training run | `gemma-4-E4B` |
| Export method | GGUF (quantized export path) |
| Published precision | BF16 |
| Main artifact | `unsloth-gemma-4-E4B-it.BF16.gguf` |
The published checkpoint preserves the merged fine-tuned weights in GGUF form for
deployment with `llama.cpp`-compatible runtimes.
---
## Training Data
The model was trained on conversational Q&A pairs grounded in curated social-engineering
knowledge. The underlying datasets are publicly released on Hugging Face:
| Dataset | URL | Records |
| --- | --- | ---: |
| English Q&A | [https://huggingface.co/datasets/smd20/social-engineering-qa-english](https://huggingface.co/datasets/smd20/social-engineering-qa-english) | 3,330 |
| Persian Q&A | [https://huggingface.co/datasets/smd20/social-engineering-qa-persian](https://huggingface.co/datasets/smd20/social-engineering-qa-persian) | 3,330 |
### Reference corpora
Knowledge articles were derived from the following legally acquired books:
- Deep Insight into Social Engineering
- ESET Social Engineering Handbook
- Learn Social Engineering: Learn the Art of Human Hacking (Erdal Ozkaya)
- Social Engineering: How Crowdmasters, Phreaks, Hackers (Gehl & Lawson)
- Social Engineering in Cybersecurity: Threats and Defenses (Gururaj et al.)
- Social Engineering: The Science of Human Hacking (Christopher Hadnagy)
- Social Engineering: The Art of Human Hacking (Christopher Hadnagy)
- Sefreta: Zero to Hundred Social Engineering (Persian)
### Corpus construction pipeline
1. Controlled segmentation of reference books
2. Schema-driven knowledge article generation (JSONL)
3. Grounded bilingual Q&A generation with strict source constraints
4. Global deduplication and bilingual split
### Training Corpus Overview
| Metric | Value |
| --- | ---: |
| English Q&A records | 3,330 |
| Persian Q&A records | 3,330 |
| Bilingual question units | 3,330 |
| Total bilingual records (EN + FA) | 6,660 |
| Structured knowledge articles | 1,165 |
| Article coverage | 1,163 / 1,165 (99.8%) |
| Reference books | 8 |
| Deduplicated v1 duplicates skipped | 159 |
### Character-Length Statistics
| Split | Field | Mean | Median | Std. Dev. | Min | Max |
| --- | --- | ---: | ---: | ---: | ---: | ---: |
| English | Question | 96.56 | 95.0 | 21.98 | 23 | 199 |
| English | Answer | 180.12 | 171.0 | 80.13 | 3 | 827 |
| Persian | Question | 81.08 | 80.0 | 21.76 | 12 | 181 |
| Persian | Answer | 163.48 | 153.0 | 74.06 | 3 | 481 |
| Combined (EN+FA) | Question | 88.82 | 88.0 | 23.2 | 12 | 199 |
| Combined (EN+FA) | Answer | 171.8 | 161.0 | 77.6 | 3 | 827 |
### Knowledge Articles per Reference Book
| Reference Book (internal ID) | Knowledge Articles |
| --- | ---: |
| Learn-Social-Engineering-Learn-the-Art-of-Human-Hacking-Dr.-Erdal-Ozkaya-_-WeLib.org-__FULL | 397 |
| Social-Engineering-Science-Hacking-Hadnagy_FULL | 239 |
| Social-Engineering-Cybersecurity-Gururaj_FULL | 212 |
| Social-Engineering-Crowdmasters-Gehl-Lawson_FULL | 206 |
| Sefreta-Social-Engineering_FULL | 55 |
| ESET-Social_engineering_handbook_FULL | 28 |
| Social-Engineering-Art-Hacking-Hadnagy_FULL | 21 |
| deep-insight-into-social-engineering_FULL | 7 |
---
## Evaluation & Limitations
- The model inherits base-model limitations and may **hallucinate** on out-of-domain queries.
- Training data were LLM-assisted and should be complemented with human review for
high-stakes deployments.
- Copyright of source books remains with publishers; released datasets contain **derived
annotations only**.
- BF16 GGUF requires approximately **15.1 GB** VRAM/RAM for full-precision loading.
---
## How to Download from Hugging Face
### Option 1 — `huggingface_hub` (recommended)
```python
from huggingface_hub import hf_hub_download
repo_id = "smd20/socialengineering"
token = None # set HF_TOKEN if the repo is private
model_path = hf_hub_download(
repo_id=repo_id,
filename="unsloth-gemma-4-E4B-it.BF16.gguf",
token=token,
)
mmproj_path = hf_hub_download(
repo_id=repo_id,
filename="unsloth-gemma-4-E4B-it.BF16-mmproj.gguf",
token=token,
)
print("Model:", model_path)
print("MMProj:", mmproj_path)
```
### Option 2 — Snapshot download
```python
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="smd20/socialengineering",
allow_patterns=["*.gguf"],
)
print("Downloaded to:", local_dir)
```
### Option 3 — CLI
```bash
huggingface-cli download smd20/socialengineering \
unsloth-gemma-4-E4B-it.BF16.gguf \
unsloth-gemma-4-E4B-it.BF16-mmproj.gguf
```
---
## Inference Examples
### `llama.cpp`
```bash
llama-cli -hf smd20/socialengineering:BF16 --jinja
```
For multimodal usage:
```bash
llama-mtmd-cli -hf smd20/socialengineering:BF16 --jinja
```
### `llama-cpp-python`
```python
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="smd20/socialengineering",
filename="unsloth-gemma-4-E4B-it.BF16-mmproj.gguf",
)
response = llm.create_chat_completion(
messages=[
{
"role": "user",
"content": "What is pretexting in social engineering, and how does it differ from impersonation?",
}
],
)
print(response["choices"][0]["message"]["content"])
```
### Ollama
```bash
ollama run hf.co/smd20/socialengineering:BF16
```
---
## Authorship, Ownership, and Legal Notice
**Legal owner and maintainer:** **Samad Sohrab** — PhD Student in Artificial Intelligence.
This model checkpoint, its associated training configuration, and the derived Q&A
datasets released under the `smd20` Hugging Face namespace are authored and
maintained by **Samad Sohrab**. All rights in the model card, training pipeline
documentation, and derived dataset annotations are reserved by the author unless
otherwise stated in the repository license.
Source-book copyrights remain with their respective publishers. This repository
distributes **fine-tuned model weights** and **derived instructional annotations** only.
---
## Acknowledgments
This work was conducted under the research supervision of **Dr. Amir Nezami Safa**,
who served as academic advisor throughout dataset construction, model fine-tuning,
and publication. His guidance on methodology, reproducibility, and scientific rigor
was instrumental to this release.
Training infrastructure used [Unsloth](https://github.com/unslothai/unsloth) for
efficient Gemma 4 fine-tuning and GGUF export.
---
## Citation
If you use this model or the associated datasets in academic work, please cite:
```bibtex
@misc{sohrab2026socialengineering,
author = {Sohrab, Samad and Nazami Saffa, Amir},
title = {Social Engineering Specialist: Fine-Tuned Gemma 4 E4B (GGUF)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/smd20/socialengineering}},
note = {PhD research release. Advisor: Dr. Amir Nazami Saffa}
}
```
---
## Dataset Citations
```bibtex
@misc{sohrab2026seqaen,
author = {Sohrab, Samad},
title = {Social Engineering Q&A Dataset (English)},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/smd20/social-engineering-qa-english}}
}
@misc{sohrab2026seqafa,
author = {Sohrab, Samad},
title = {Social Engineering Q&A Dataset (Persian)},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/smd20/social-engineering-qa-persian}}
}
```
---
*Model card last updated: 2026-06-21T12:56:17.859588+00:00*