Mithun-999's picture
Add comprehensive model card with methodology, training details, and ethical guidelines
7b9bdc0
---
base_model: microsoft/phi-2
library_name: peft
license: mit
language:
- en
tags:
- phi-2
- lora
- kali-linux
- penetration-testing
- security
- fine-tuned
---
# Phi-2 Fine-tuned on Kali Linux Documentation
Fine-tuned Microsoft Phi-2 (2.7B) model using LoRA adapters for Kali Linux and penetration testing Q&A.
## Model Details
### Model Description
This is a LoRA-adapted Phi-2 model fine-tuned on Kali Linux documentation for answering cybersecurity and penetration testing questions.
- **Developed by:** Mithun Kumar
- **Model type:** Causal Language Model
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)
- **Fine-tuning method:** LoRA (Low-Rank Adaptation)
### Model Sources
- **Repository:** [GitHub](https://github.com/yourusername/phi2-kali-linux)
- **Demo:** [Hugging Face Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux)
## Uses
### Direct Use
This model is designed to answer questions related to:
- **Kali Linux tools and commands**
- **Penetration testing methodologies**
- **Cybersecurity concepts**
- **Linux administration and troubleshooting**
The model can be used for:
- Chatbots and Q&A systems
- Educational tools for cybersecurity training
- Documentation lookup and explanation
- Penetration testing knowledge base
### Intended Use
**Best practices for using this model:**
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load the base model and adapter
base_model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-2",
device_map="cpu",
torch_dtype=torch.float32,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Mithun-999/phi2-kali-linux-finetuned")
# Generate response
prompt = "What is the purpose of nmap in Kali Linux?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=256,
temperature=0.7,
top_p=0.9,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
### Out-of-Scope Use
This model is **NOT** intended for:
- Illegal hacking or unauthorized system access
- Bypassing security measures on systems you don't own
- Creating malware or exploits for malicious purposes
- Any activity that violates laws or ethical guidelines
## Bias, Risks, and Limitations
**Potential Biases:**
- Training data reflects tool documentation; may have biases present in original Kali Linux materials
- Q&A generation heuristic may favor common over edge-case scenarios
**Known Risks:**
- Responses may contain outdated information (depends on PDF document dates)
- Generated answers may sometimes be incomplete or require manual verification
- Model can generate misleading information if prompted outside training domain
**Limitations:**
- Performance degrades on topics outside Kali Linux documentation
- Single-epoch training may limit depth of learning
- CPU inference is significantly slower than GPU
### Recommendations
Users should:
- Verify all security-related advice with official documentation
- Only use on systems you own or have explicit authorization to test
- Treat output as supplementary information, not absolute truth
- Follow responsible disclosure practices if discovering vulnerabilities
## How to Get Started with the Model
Try the interactive demo: [Kali Linux Q&A Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux)
Or run locally with the code example provided in the "Intended Use" section above.
## Training Details
### Training Data
The model was fine-tuned on extracted text from **5 Kali Linux PDF documents**:
| Document | Size | Content Focus |
|----------|------|---|
| PDF 1 | Large | Kali Linux fundamentals, tools overview |
| PDF 2 | Large | Network penetration testing techniques |
| PDF 3 | Medium | Web application penetration testing |
| PDF 4 | Medium | Post-exploitation and privilege escalation |
| PDF 5 | Medium | Linux system hardening and defense |
**Data Extraction Summary:**
- **Total Characters Extracted:** 2,300,000+ characters
- **Total Words:** 336,000+ words
- **Extraction Method:** PyPDF2 text extraction from PDF documents
### Training Data Processing
The training dataset was generated using a **heuristic question-answer generation approach**:
1. **Text Chunking:** PDF text split into 512-character chunks with 128-character overlap (sliding window)
2. **Sentence Extraction:** Chunks processed to extract meaningful sentences using NLTK sentence tokenizer
3. **Q&A Pairing:** Question-answer pairs generated by:
- Extracting sentences as answers
- Creating relevant questions from answer content
- Using keyword extraction and pattern matching
- Filtering for quality and relevance
**Dataset Statistics:**
| Split | Count | Percentage |
|-------|-------|-----------|
| Training | 23,776 | 80% |
| Validation | 2,972 | 10% |
| Testing | 2,972 | 10% |
| **Total** | **29,720** | **100%** |
**Dataset Format:** JSONL and CSV
- **Average Question Length:** 15-25 tokens
- **Average Answer Length:** 40-100 tokens
- **Total Training Examples:** 23,776 Q&A pairs
### Training Procedure
**Training Environment:**
- **Platform:** Kaggle GPU (NVIDIA Tesla T4 Γ— 2)
- **Framework:** PyTorch 2.0.1 + Transformers 4.40.0 + PEFT 0.8.2
- **Precision:** float32 (full precision for stability)
- **Device:** cuda
**Hyperparameters:**
| Parameter | Value |
|-----------|-------|
| Learning Rate | 0.00005 |
| Batch Size | 1 |
| Epochs | 1 |
| Max Sequence Length | 256 tokens |
| Gradient Clipping Norm | 1.0 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
**LoRA Configuration:**
| Parameter | Value |
|-----------|-------|
| LoRA Rank (r) | 8 |
| LoRA Alpha | 16 |
| LoRA Dropout | 0.05 |
| Target Modules | ["q_proj", "v_proj"] |
**Training Time & Resources:**
- **Estimated Duration:** ~2 hours on Kaggle GPU
- **Dataset Size:** 23,776 training examples
- **Total Tokens Processed:** ~6.1M tokens
- **Model Adapter Size:** 13.2 MB (99.76% reduction from 5.5GB base)
**Key Optimizations:**
- Memory optimization: `torch.cuda.empty_cache()` after each batch
- Gradient accumulation: Not needed with batch_size=1 and smaller max_length
- Mixed precision: float32 used (not fp16) to prevent NaN losses
- Real-time progress tracking with ETA calculation
## Evaluation
### Results
**Model Performance:**
- βœ… Successfully trained on 23,776 Q&A pairs without errors
- βœ… Loss converged during training (float32 precision prevented NaN)
- βœ… Model inference working on both GPU and CPU
- βœ… LoRA adapter reduced parameters to **13.2 MB** (vs 5.5GB base model)
- βœ… Inference latency: **~2-5 seconds on GPU**, ~30-60 seconds on CPU
**Task Completion:**
- βœ… Dataset generated: 29,720 Q&A pairs from 5 PDFs
- βœ… Fine-tuning: Successfully completed on Kaggle GPU
- βœ… Model deployment: Live on HuggingFace Hub and Spaces
- βœ… Inference interface: Gradio-based web UI available
#### Summary
The model successfully learned Kali Linux documentation and can generate contextually relevant responses to penetration testing and cybersecurity questions. The lightweight LoRA adapter (13.2MB) makes deployment feasible on resource-constrained platforms.
## Environmental Impact
**Hardware Type:** NVIDIA Tesla T4 GPU (2x on Kaggle)
**Hours used:** ~2 hours
**Cloud Provider:** Kaggle
**Compute Region:** Cloud (exact region not specified)
**Carbon Emitted:** Estimated low (~0.1-0.5 kg CO2eq for 2-hour GPU training)
Training focused on efficiency with reduced parameters (LoRA) and single epoch.
## Technical Specifications
### Model Architecture and Objective
- **Architecture:** Transformer-based causal language model (Phi-2, 2.7B parameters)
- **Objective:** Next-token prediction with LoRA fine-tuning
- **Maximum Sequence Length:** 256 tokens
- **Vocabulary Size:** 50,257 tokens (tokenizer from microsoft/phi-2)
### Compute Infrastructure
#### Hardware
- **Training:** NVIDIA Tesla T4 GPU Γ— 2 (16GB VRAM each)
- **Inference:** CPU or GPU capable (tested on both)
#### Software
**Training Stack:**
- PyTorch 2.0.1
- Transformers 4.40.0
- PEFT 0.8.2
- Accelerate 0.27.0
- PyPDF2 (data extraction)
**Deployment Stack:**
- Transformers 4.41.2 (HF Space)
- PEFT 0.11.1 (HF Space)
- Gradio 4.0+
- SafeTensors
## Citation
If you use this model, please cite:
```bibtex
@misc{phi2-kali-linux,
author = {Kumar, Mithun},
title = {Phi-2 Fine-tuned on Kali Linux Documentation},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Mithun-999/phi2-kali-linux-finetuned}},
license = {MIT}
}
```
**APA Citation:**
Kumar, M. (2024). Phi-2 fine-tuned on Kali Linux documentation [Model]. HuggingFace. https://huggingface.co/Mithun-999/phi2-kali-linux-finetuned
## Glossary
- **LoRA:** Low-Rank Adaptation - a technique to fine-tune large models with minimal parameters
- **Adapter:** LoRA weights that modify base model behavior (13.2 MB in this case)
- **Tokenizer:** Converts text to numerical tokens for model input
- **Gradient Clipping:** Prevents gradient explosion during training
- **SafeTensors:** Safe serialization format for model weights
## More Information
**GitHub Repository:** [phi2-kali-linux](https://github.com/yourusername/phi2-kali-linux)
**Related Resources:**
- [Phi-2 Model](https://huggingface.co/microsoft/phi-2)
- [LoRA Paper (Hu et al. 2021)](https://arxiv.org/abs/2106.09714)
- [PEFT Library](https://github.com/huggingface/peft)
- [Kali Linux Official](https://www.kali.org/)
## Model Card Authors
- **Primary Author:** Mithun Kumar
- **Framework & Methodology:** PyTorch, HuggingFace Transformers, PEFT
- **Platform:** Kaggle (training), HuggingFace Hub (deployment)
## Ethical Considerations & Responsible Use
**This model is intended for educational and authorized security purposes only.**
### Permitted Uses:
βœ… Learning penetration testing on systems you own or have permission to test
βœ… Cybersecurity education and training
βœ… Defensive security research
βœ… Documentation lookup for Kali Linux tools
### Prohibited Uses:
❌ Unauthorized access to systems
❌ Malware creation or distribution
❌ Violating laws (CFAA, GDPR, etc.)
❌ Privacy violations or data theft
❌ Targeting systems without explicit authorization
**Users must comply with all applicable laws and ethical guidelines.**
## Model Card Contact
- **Author:** Mithun Kumar
- **GitHub Issues:** [Report issues here](https://github.com/yourusername/phi2-kali-linux/issues)
- **Discussion Space:** [HuggingFace Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux)
---
**Last Updated:** January 2024
**License:** MIT
**Base Model:** [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)