|
|
--- |
|
|
base_model: microsoft/phi-2 |
|
|
library_name: peft |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- phi-2 |
|
|
- lora |
|
|
- kali-linux |
|
|
- penetration-testing |
|
|
- security |
|
|
- fine-tuned |
|
|
--- |
|
|
|
|
|
# Phi-2 Fine-tuned on Kali Linux Documentation |
|
|
|
|
|
Fine-tuned Microsoft Phi-2 (2.7B) model using LoRA adapters for Kali Linux and penetration testing Q&A. |
|
|
|
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This is a LoRA-adapted Phi-2 model fine-tuned on Kali Linux documentation for answering cybersecurity and penetration testing questions. |
|
|
|
|
|
- **Developed by:** Mithun Kumar |
|
|
- **Model type:** Causal Language Model |
|
|
- **Language(s) (NLP):** English |
|
|
- **License:** MIT |
|
|
- **Finetuned from model:** [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) |
|
|
- **Fine-tuning method:** LoRA (Low-Rank Adaptation) |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [GitHub](https://github.com/yourusername/phi2-kali-linux) |
|
|
- **Demo:** [Hugging Face Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux) |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
This model is designed to answer questions related to: |
|
|
- **Kali Linux tools and commands** |
|
|
- **Penetration testing methodologies** |
|
|
- **Cybersecurity concepts** |
|
|
- **Linux administration and troubleshooting** |
|
|
|
|
|
The model can be used for: |
|
|
- Chatbots and Q&A systems |
|
|
- Educational tools for cybersecurity training |
|
|
- Documentation lookup and explanation |
|
|
- Penetration testing knowledge base |
|
|
|
|
|
### Intended Use |
|
|
|
|
|
**Best practices for using this model:** |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from peft import PeftModel |
|
|
|
|
|
# Load the base model and adapter |
|
|
base_model = AutoModelForCausalLM.from_pretrained( |
|
|
"microsoft/phi-2", |
|
|
device_map="cpu", |
|
|
torch_dtype=torch.float32, |
|
|
trust_remote_code=True |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True) |
|
|
|
|
|
# Load LoRA adapter |
|
|
model = PeftModel.from_pretrained(base_model, "Mithun-999/phi2-kali-linux-finetuned") |
|
|
|
|
|
# Generate response |
|
|
prompt = "What is the purpose of nmap in Kali Linux?" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_length=256, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
do_sample=True |
|
|
) |
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
This model is **NOT** intended for: |
|
|
- Illegal hacking or unauthorized system access |
|
|
- Bypassing security measures on systems you don't own |
|
|
- Creating malware or exploits for malicious purposes |
|
|
- Any activity that violates laws or ethical guidelines |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
**Potential Biases:** |
|
|
- Training data reflects tool documentation; may have biases present in original Kali Linux materials |
|
|
- Q&A generation heuristic may favor common over edge-case scenarios |
|
|
|
|
|
**Known Risks:** |
|
|
- Responses may contain outdated information (depends on PDF document dates) |
|
|
- Generated answers may sometimes be incomplete or require manual verification |
|
|
- Model can generate misleading information if prompted outside training domain |
|
|
|
|
|
**Limitations:** |
|
|
- Performance degrades on topics outside Kali Linux documentation |
|
|
- Single-epoch training may limit depth of learning |
|
|
- CPU inference is significantly slower than GPU |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
Users should: |
|
|
- Verify all security-related advice with official documentation |
|
|
- Only use on systems you own or have explicit authorization to test |
|
|
- Treat output as supplementary information, not absolute truth |
|
|
- Follow responsible disclosure practices if discovering vulnerabilities |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Try the interactive demo: [Kali Linux Q&A Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux) |
|
|
|
|
|
Or run locally with the code example provided in the "Intended Use" section above. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was fine-tuned on extracted text from **5 Kali Linux PDF documents**: |
|
|
|
|
|
| Document | Size | Content Focus | |
|
|
|----------|------|---| |
|
|
| PDF 1 | Large | Kali Linux fundamentals, tools overview | |
|
|
| PDF 2 | Large | Network penetration testing techniques | |
|
|
| PDF 3 | Medium | Web application penetration testing | |
|
|
| PDF 4 | Medium | Post-exploitation and privilege escalation | |
|
|
| PDF 5 | Medium | Linux system hardening and defense | |
|
|
|
|
|
**Data Extraction Summary:** |
|
|
- **Total Characters Extracted:** 2,300,000+ characters |
|
|
- **Total Words:** 336,000+ words |
|
|
- **Extraction Method:** PyPDF2 text extraction from PDF documents |
|
|
|
|
|
### Training Data Processing |
|
|
|
|
|
The training dataset was generated using a **heuristic question-answer generation approach**: |
|
|
|
|
|
1. **Text Chunking:** PDF text split into 512-character chunks with 128-character overlap (sliding window) |
|
|
2. **Sentence Extraction:** Chunks processed to extract meaningful sentences using NLTK sentence tokenizer |
|
|
3. **Q&A Pairing:** Question-answer pairs generated by: |
|
|
- Extracting sentences as answers |
|
|
- Creating relevant questions from answer content |
|
|
- Using keyword extraction and pattern matching |
|
|
- Filtering for quality and relevance |
|
|
|
|
|
**Dataset Statistics:** |
|
|
|
|
|
| Split | Count | Percentage | |
|
|
|-------|-------|-----------| |
|
|
| Training | 23,776 | 80% | |
|
|
| Validation | 2,972 | 10% | |
|
|
| Testing | 2,972 | 10% | |
|
|
| **Total** | **29,720** | **100%** | |
|
|
|
|
|
**Dataset Format:** JSONL and CSV |
|
|
- **Average Question Length:** 15-25 tokens |
|
|
- **Average Answer Length:** 40-100 tokens |
|
|
- **Total Training Examples:** 23,776 Q&A pairs |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
**Training Environment:** |
|
|
- **Platform:** Kaggle GPU (NVIDIA Tesla T4 Γ 2) |
|
|
- **Framework:** PyTorch 2.0.1 + Transformers 4.40.0 + PEFT 0.8.2 |
|
|
- **Precision:** float32 (full precision for stability) |
|
|
- **Device:** cuda |
|
|
|
|
|
**Hyperparameters:** |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Learning Rate | 0.00005 | |
|
|
| Batch Size | 1 | |
|
|
| Epochs | 1 | |
|
|
| Max Sequence Length | 256 tokens | |
|
|
| Gradient Clipping Norm | 1.0 | |
|
|
| Optimizer | AdamW | |
|
|
| Weight Decay | 0.01 | |
|
|
|
|
|
**LoRA Configuration:** |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| LoRA Rank (r) | 8 | |
|
|
| LoRA Alpha | 16 | |
|
|
| LoRA Dropout | 0.05 | |
|
|
| Target Modules | ["q_proj", "v_proj"] | |
|
|
|
|
|
**Training Time & Resources:** |
|
|
- **Estimated Duration:** ~2 hours on Kaggle GPU |
|
|
- **Dataset Size:** 23,776 training examples |
|
|
- **Total Tokens Processed:** ~6.1M tokens |
|
|
- **Model Adapter Size:** 13.2 MB (99.76% reduction from 5.5GB base) |
|
|
|
|
|
**Key Optimizations:** |
|
|
- Memory optimization: `torch.cuda.empty_cache()` after each batch |
|
|
- Gradient accumulation: Not needed with batch_size=1 and smaller max_length |
|
|
- Mixed precision: float32 used (not fp16) to prevent NaN losses |
|
|
- Real-time progress tracking with ETA calculation |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Results |
|
|
|
|
|
**Model Performance:** |
|
|
|
|
|
- β
Successfully trained on 23,776 Q&A pairs without errors |
|
|
- β
Loss converged during training (float32 precision prevented NaN) |
|
|
- β
Model inference working on both GPU and CPU |
|
|
- β
LoRA adapter reduced parameters to **13.2 MB** (vs 5.5GB base model) |
|
|
- β
Inference latency: **~2-5 seconds on GPU**, ~30-60 seconds on CPU |
|
|
|
|
|
**Task Completion:** |
|
|
- β
Dataset generated: 29,720 Q&A pairs from 5 PDFs |
|
|
- β
Fine-tuning: Successfully completed on Kaggle GPU |
|
|
- β
Model deployment: Live on HuggingFace Hub and Spaces |
|
|
- β
Inference interface: Gradio-based web UI available |
|
|
|
|
|
#### Summary |
|
|
|
|
|
The model successfully learned Kali Linux documentation and can generate contextually relevant responses to penetration testing and cybersecurity questions. The lightweight LoRA adapter (13.2MB) makes deployment feasible on resource-constrained platforms. |
|
|
|
|
|
|
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
**Hardware Type:** NVIDIA Tesla T4 GPU (2x on Kaggle) |
|
|
**Hours used:** ~2 hours |
|
|
**Cloud Provider:** Kaggle |
|
|
**Compute Region:** Cloud (exact region not specified) |
|
|
**Carbon Emitted:** Estimated low (~0.1-0.5 kg CO2eq for 2-hour GPU training) |
|
|
|
|
|
Training focused on efficiency with reduced parameters (LoRA) and single epoch. |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
|
|
|
- **Architecture:** Transformer-based causal language model (Phi-2, 2.7B parameters) |
|
|
- **Objective:** Next-token prediction with LoRA fine-tuning |
|
|
- **Maximum Sequence Length:** 256 tokens |
|
|
- **Vocabulary Size:** 50,257 tokens (tokenizer from microsoft/phi-2) |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
#### Hardware |
|
|
|
|
|
- **Training:** NVIDIA Tesla T4 GPU Γ 2 (16GB VRAM each) |
|
|
- **Inference:** CPU or GPU capable (tested on both) |
|
|
|
|
|
#### Software |
|
|
|
|
|
**Training Stack:** |
|
|
- PyTorch 2.0.1 |
|
|
- Transformers 4.40.0 |
|
|
- PEFT 0.8.2 |
|
|
- Accelerate 0.27.0 |
|
|
- PyPDF2 (data extraction) |
|
|
|
|
|
**Deployment Stack:** |
|
|
- Transformers 4.41.2 (HF Space) |
|
|
- PEFT 0.11.1 (HF Space) |
|
|
- Gradio 4.0+ |
|
|
- SafeTensors |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{phi2-kali-linux, |
|
|
author = {Kumar, Mithun}, |
|
|
title = {Phi-2 Fine-tuned on Kali Linux Documentation}, |
|
|
year = {2024}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/Mithun-999/phi2-kali-linux-finetuned}}, |
|
|
license = {MIT} |
|
|
} |
|
|
``` |
|
|
|
|
|
**APA Citation:** |
|
|
|
|
|
Kumar, M. (2024). Phi-2 fine-tuned on Kali Linux documentation [Model]. HuggingFace. https://huggingface.co/Mithun-999/phi2-kali-linux-finetuned |
|
|
|
|
|
## Glossary |
|
|
|
|
|
- **LoRA:** Low-Rank Adaptation - a technique to fine-tune large models with minimal parameters |
|
|
- **Adapter:** LoRA weights that modify base model behavior (13.2 MB in this case) |
|
|
- **Tokenizer:** Converts text to numerical tokens for model input |
|
|
- **Gradient Clipping:** Prevents gradient explosion during training |
|
|
- **SafeTensors:** Safe serialization format for model weights |
|
|
|
|
|
## More Information |
|
|
|
|
|
**GitHub Repository:** [phi2-kali-linux](https://github.com/yourusername/phi2-kali-linux) |
|
|
|
|
|
**Related Resources:** |
|
|
- [Phi-2 Model](https://huggingface.co/microsoft/phi-2) |
|
|
- [LoRA Paper (Hu et al. 2021)](https://arxiv.org/abs/2106.09714) |
|
|
- [PEFT Library](https://github.com/huggingface/peft) |
|
|
- [Kali Linux Official](https://www.kali.org/) |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
- **Primary Author:** Mithun Kumar |
|
|
- **Framework & Methodology:** PyTorch, HuggingFace Transformers, PEFT |
|
|
- **Platform:** Kaggle (training), HuggingFace Hub (deployment) |
|
|
|
|
|
## Ethical Considerations & Responsible Use |
|
|
|
|
|
**This model is intended for educational and authorized security purposes only.** |
|
|
|
|
|
### Permitted Uses: |
|
|
β
Learning penetration testing on systems you own or have permission to test |
|
|
β
Cybersecurity education and training |
|
|
β
Defensive security research |
|
|
β
Documentation lookup for Kali Linux tools |
|
|
|
|
|
### Prohibited Uses: |
|
|
β Unauthorized access to systems |
|
|
β Malware creation or distribution |
|
|
β Violating laws (CFAA, GDPR, etc.) |
|
|
β Privacy violations or data theft |
|
|
β Targeting systems without explicit authorization |
|
|
|
|
|
**Users must comply with all applicable laws and ethical guidelines.** |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
- **Author:** Mithun Kumar |
|
|
- **GitHub Issues:** [Report issues here](https://github.com/yourusername/phi2-kali-linux/issues) |
|
|
- **Discussion Space:** [HuggingFace Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux) |
|
|
|
|
|
--- |
|
|
|
|
|
**Last Updated:** January 2024 |
|
|
**License:** MIT |
|
|
**Base Model:** [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) |
|
|
|