---
base_model: microsoft/phi-2
library_name: peft
license: mit
language:
- en
tags:
- phi-2
- lora
- kali-linux
- penetration-testing
- security
- fine-tuned
---

# Phi-2 Fine-tuned on Kali Linux Documentation

Fine-tuned Microsoft Phi-2 (2.7B) model using LoRA adapters for Kali Linux and penetration testing Q&A.


## Model Details

### Model Description

This is a LoRA-adapted Phi-2 model fine-tuned on Kali Linux documentation for answering cybersecurity and penetration testing questions.

- **Developed by:** Mithun Kumar
- **Model type:** Causal Language Model
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)
- **Fine-tuning method:** LoRA (Low-Rank Adaptation)

### Model Sources

- **Repository:** [GitHub](https://github.com/yourusername/phi2-kali-linux)
- **Demo:** [Hugging Face Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux)

## Uses

### Direct Use

This model is designed to answer questions related to:
- **Kali Linux tools and commands**
- **Penetration testing methodologies**
- **Cybersecurity concepts**
- **Linux administration and troubleshooting**

The model can be used for:
- Chatbots and Q&A systems
- Educational tools for cybersecurity training
- Documentation lookup and explanation
- Penetration testing knowledge base

### Intended Use

**Best practices for using this model:**

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load the base model and adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    device_map="cpu",
    torch_dtype=torch.float32,
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Mithun-999/phi2-kali-linux-finetuned")

# Generate response
prompt = "What is the purpose of nmap in Kali Linux?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Out-of-Scope Use

This model is **NOT** intended for:
- Illegal hacking or unauthorized system access
- Bypassing security measures on systems you don't own
- Creating malware or exploits for malicious purposes
- Any activity that violates laws or ethical guidelines

## Bias, Risks, and Limitations

**Potential Biases:**
- Training data reflects tool documentation; may have biases present in original Kali Linux materials
- Q&A generation heuristic may favor common over edge-case scenarios

**Known Risks:**
- Responses may contain outdated information (depends on PDF document dates)
- Generated answers may sometimes be incomplete or require manual verification
- Model can generate misleading information if prompted outside training domain

**Limitations:**
- Performance degrades on topics outside Kali Linux documentation
- Single-epoch training may limit depth of learning
- CPU inference is significantly slower than GPU

### Recommendations

Users should:
- Verify all security-related advice with official documentation
- Only use on systems you own or have explicit authorization to test
- Treat output as supplementary information, not absolute truth
- Follow responsible disclosure practices if discovering vulnerabilities

## How to Get Started with the Model

Try the interactive demo: [Kali Linux Q&A Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux)

Or run locally with the code example provided in the "Intended Use" section above.

## Training Details

### Training Data

The model was fine-tuned on extracted text from **5 Kali Linux PDF documents**:

| Document | Size | Content Focus |
|----------|------|---|
| PDF 1 | Large | Kali Linux fundamentals, tools overview |
| PDF 2 | Large | Network penetration testing techniques |
| PDF 3 | Medium | Web application penetration testing |
| PDF 4 | Medium | Post-exploitation and privilege escalation |
| PDF 5 | Medium | Linux system hardening and defense |

**Data Extraction Summary:**
- **Total Characters Extracted:** 2,300,000+ characters
- **Total Words:** 336,000+ words
- **Extraction Method:** PyPDF2 text extraction from PDF documents

### Training Data Processing

The training dataset was generated using a **heuristic question-answer generation approach**:

1. **Text Chunking:** PDF text split into 512-character chunks with 128-character overlap (sliding window)
2. **Sentence Extraction:** Chunks processed to extract meaningful sentences using NLTK sentence tokenizer
3. **Q&A Pairing:** Question-answer pairs generated by:
   - Extracting sentences as answers
   - Creating relevant questions from answer content
   - Using keyword extraction and pattern matching
   - Filtering for quality and relevance

**Dataset Statistics:**

| Split | Count | Percentage |
|-------|-------|-----------|
| Training | 23,776 | 80% |
| Validation | 2,972 | 10% |
| Testing | 2,972 | 10% |
| **Total** | **29,720** | **100%** |

**Dataset Format:** JSONL and CSV
- **Average Question Length:** 15-25 tokens
- **Average Answer Length:** 40-100 tokens
- **Total Training Examples:** 23,776 Q&A pairs

### Training Procedure

**Training Environment:**
- **Platform:** Kaggle GPU (NVIDIA Tesla T4 × 2)
- **Framework:** PyTorch 2.0.1 + Transformers 4.40.0 + PEFT 0.8.2
- **Precision:** float32 (full precision for stability)
- **Device:** cuda

**Hyperparameters:**

| Parameter | Value |
|-----------|-------|
| Learning Rate | 0.00005 |
| Batch Size | 1 |
| Epochs | 1 |
| Max Sequence Length | 256 tokens |
| Gradient Clipping Norm | 1.0 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |

**LoRA Configuration:**

| Parameter | Value |
|-----------|-------|
| LoRA Rank (r) | 8 |
| LoRA Alpha | 16 |
| LoRA Dropout | 0.05 |
| Target Modules | ["q_proj", "v_proj"] |

**Training Time & Resources:**
- **Estimated Duration:** ~2 hours on Kaggle GPU
- **Dataset Size:** 23,776 training examples
- **Total Tokens Processed:** ~6.1M tokens
- **Model Adapter Size:** 13.2 MB (99.76% reduction from 5.5GB base)

**Key Optimizations:**
- Memory optimization: `torch.cuda.empty_cache()` after each batch
- Gradient accumulation: Not needed with batch_size=1 and smaller max_length
- Mixed precision: float32 used (not fp16) to prevent NaN losses
- Real-time progress tracking with ETA calculation

## Evaluation

### Results

**Model Performance:**

- ✅ Successfully trained on 23,776 Q&A pairs without errors
- ✅ Loss converged during training (float32 precision prevented NaN)
- ✅ Model inference working on both GPU and CPU
- ✅ LoRA adapter reduced parameters to **13.2 MB** (vs 5.5GB base model)
- ✅ Inference latency: **~2-5 seconds on GPU**, ~30-60 seconds on CPU

**Task Completion:**
- ✅ Dataset generated: 29,720 Q&A pairs from 5 PDFs
- ✅ Fine-tuning: Successfully completed on Kaggle GPU
- ✅ Model deployment: Live on HuggingFace Hub and Spaces
- ✅ Inference interface: Gradio-based web UI available

#### Summary

The model successfully learned Kali Linux documentation and can generate contextually relevant responses to penetration testing and cybersecurity questions. The lightweight LoRA adapter (13.2MB) makes deployment feasible on resource-constrained platforms.


## Environmental Impact

**Hardware Type:** NVIDIA Tesla T4 GPU (2x on Kaggle)
**Hours used:** ~2 hours
**Cloud Provider:** Kaggle
**Compute Region:** Cloud (exact region not specified)
**Carbon Emitted:** Estimated low (~0.1-0.5 kg CO2eq for 2-hour GPU training)

Training focused on efficiency with reduced parameters (LoRA) and single epoch.

## Technical Specifications

### Model Architecture and Objective

- **Architecture:** Transformer-based causal language model (Phi-2, 2.7B parameters)
- **Objective:** Next-token prediction with LoRA fine-tuning
- **Maximum Sequence Length:** 256 tokens
- **Vocabulary Size:** 50,257 tokens (tokenizer from microsoft/phi-2)

### Compute Infrastructure

#### Hardware

- **Training:** NVIDIA Tesla T4 GPU × 2 (16GB VRAM each)
- **Inference:** CPU or GPU capable (tested on both)

#### Software

**Training Stack:**
- PyTorch 2.0.1
- Transformers 4.40.0
- PEFT 0.8.2
- Accelerate 0.27.0
- PyPDF2 (data extraction)

**Deployment Stack:**
- Transformers 4.41.2 (HF Space)
- PEFT 0.11.1 (HF Space)
- Gradio 4.0+
- SafeTensors

## Citation

If you use this model, please cite:

```bibtex
@misc{phi2-kali-linux,
  author = {Kumar, Mithun},
  title = {Phi-2 Fine-tuned on Kali Linux Documentation},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Mithun-999/phi2-kali-linux-finetuned}},
  license = {MIT}
}
```

**APA Citation:**

Kumar, M. (2024). Phi-2 fine-tuned on Kali Linux documentation [Model]. HuggingFace. https://huggingface.co/Mithun-999/phi2-kali-linux-finetuned

## Glossary

- **LoRA:** Low-Rank Adaptation - a technique to fine-tune large models with minimal parameters
- **Adapter:** LoRA weights that modify base model behavior (13.2 MB in this case)
- **Tokenizer:** Converts text to numerical tokens for model input
- **Gradient Clipping:** Prevents gradient explosion during training
- **SafeTensors:** Safe serialization format for model weights

## More Information

**GitHub Repository:** [phi2-kali-linux](https://github.com/yourusername/phi2-kali-linux)

**Related Resources:**
- [Phi-2 Model](https://huggingface.co/microsoft/phi-2)
- [LoRA Paper (Hu et al. 2021)](https://arxiv.org/abs/2106.09714)
- [PEFT Library](https://github.com/huggingface/peft)
- [Kali Linux Official](https://www.kali.org/)

## Model Card Authors

- **Primary Author:** Mithun Kumar
- **Framework & Methodology:** PyTorch, HuggingFace Transformers, PEFT
- **Platform:** Kaggle (training), HuggingFace Hub (deployment)

## Ethical Considerations & Responsible Use

**This model is intended for educational and authorized security purposes only.**

### Permitted Uses:
✅ Learning penetration testing on systems you own or have permission to test  
✅ Cybersecurity education and training  
✅ Defensive security research  
✅ Documentation lookup for Kali Linux tools  

### Prohibited Uses:
❌ Unauthorized access to systems  
❌ Malware creation or distribution  
❌ Violating laws (CFAA, GDPR, etc.)  
❌ Privacy violations or data theft  
❌ Targeting systems without explicit authorization  

**Users must comply with all applicable laws and ethical guidelines.**

## Model Card Contact

- **Author:** Mithun Kumar
- **GitHub Issues:** [Report issues here](https://github.com/yourusername/phi2-kali-linux/issues)
- **Discussion Space:** [HuggingFace Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux)

---

**Last Updated:** January 2024  
**License:** MIT  
**Base Model:** [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)