Add comprehensive model card with methodology, training details, and ethical guidelines

7b9bdc0 about 2 months ago

11 kB

	---
	base_model: microsoft/phi-2
	library_name: peft
	license: mit
	language:
	- en
	tags:
	- phi-2
	- lora
	- kali-linux
	- penetration-testing
	- security
	- fine-tuned
	---

	# Phi-2 Fine-tuned on Kali Linux Documentation

	Fine-tuned Microsoft Phi-2 (2.7B) model using LoRA adapters for Kali Linux and penetration testing Q&A.



	## Model Details

	### Model Description

	This is a LoRA-adapted Phi-2 model fine-tuned on Kali Linux documentation for answering cybersecurity and penetration testing questions.

	- Developed by: Mithun Kumar
	- Model type: Causal Language Model
	- Language(s) (NLP): English
	- License: MIT
	- Finetuned from model: [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)
	- Fine-tuning method: LoRA (Low-Rank Adaptation)

	### Model Sources

	- Repository: [GitHub](https://github.com/yourusername/phi2-kali-linux)
	- Demo: [Hugging Face Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux)

	## Uses

	### Direct Use

	This model is designed to answer questions related to:
	- Kali Linux tools and commands
	- Penetration testing methodologies
	- Cybersecurity concepts
	- Linux administration and troubleshooting

	The model can be used for:
	- Chatbots and Q&A systems
	- Educational tools for cybersecurity training
	- Documentation lookup and explanation
	- Penetration testing knowledge base

	### Intended Use

	Best practices for using this model:

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	# Load the base model and adapter
	base_model = AutoModelForCausalLM.from_pretrained(
	"microsoft/phi-2",
	device_map="cpu",
	torch_dtype=torch.float32,
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

	# Load LoRA adapter
	model = PeftModel.from_pretrained(base_model, "Mithun-999/phi2-kali-linux-finetuned")

	# Generate response
	prompt = "What is the purpose of nmap in Kali Linux?"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(
	**inputs,
	max_length=256,
	temperature=0.7,
	top_p=0.9,
	do_sample=True
	)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	### Out-of-Scope Use

	This model is NOT intended for:
	- Illegal hacking or unauthorized system access
	- Bypassing security measures on systems you don't own
	- Creating malware or exploits for malicious purposes
	- Any activity that violates laws or ethical guidelines

	## Bias, Risks, and Limitations

	Potential Biases:
	- Training data reflects tool documentation; may have biases present in original Kali Linux materials
	- Q&A generation heuristic may favor common over edge-case scenarios

	Known Risks:
	- Responses may contain outdated information (depends on PDF document dates)
	- Generated answers may sometimes be incomplete or require manual verification
	- Model can generate misleading information if prompted outside training domain

	Limitations:
	- Performance degrades on topics outside Kali Linux documentation
	- Single-epoch training may limit depth of learning
	- CPU inference is significantly slower than GPU

	### Recommendations

	Users should:
	- Verify all security-related advice with official documentation
	- Only use on systems you own or have explicit authorization to test
	- Treat output as supplementary information, not absolute truth
	- Follow responsible disclosure practices if discovering vulnerabilities

	## How to Get Started with the Model

	Try the interactive demo: [Kali Linux Q&A Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux)

	Or run locally with the code example provided in the "Intended Use" section above.

	## Training Details

	### Training Data

	The model was fine-tuned on extracted text from 5 Kali Linux PDF documents:

	\| Document \| Size \| Content Focus \|
	\|----------\|------\|---\|
	\| PDF 1 \| Large \| Kali Linux fundamentals, tools overview \|
	\| PDF 2 \| Large \| Network penetration testing techniques \|
	\| PDF 3 \| Medium \| Web application penetration testing \|
	\| PDF 4 \| Medium \| Post-exploitation and privilege escalation \|
	\| PDF 5 \| Medium \| Linux system hardening and defense \|

	Data Extraction Summary:
	- Total Characters Extracted: 2,300,000+ characters
	- Total Words: 336,000+ words
	- Extraction Method: PyPDF2 text extraction from PDF documents

	### Training Data Processing

	The training dataset was generated using a heuristic question-answer generation approach:

	1. Text Chunking: PDF text split into 512-character chunks with 128-character overlap (sliding window)
	2. Sentence Extraction: Chunks processed to extract meaningful sentences using NLTK sentence tokenizer
	3. Q&A Pairing: Question-answer pairs generated by:
	- Extracting sentences as answers
	- Creating relevant questions from answer content
	- Using keyword extraction and pattern matching
	- Filtering for quality and relevance

	Dataset Statistics:

	\| Split \| Count \| Percentage \|
	\|-------\|-------\|-----------\|
	\| Training \| 23,776 \| 80% \|
	\| Validation \| 2,972 \| 10% \|
	\| Testing \| 2,972 \| 10% \|
	\| Total \| 29,720 \| 100% \|

	Dataset Format: JSONL and CSV
	- Average Question Length: 15-25 tokens
	- Average Answer Length: 40-100 tokens
	- Total Training Examples: 23,776 Q&A pairs

	### Training Procedure

	Training Environment:
	- Platform: Kaggle GPU (NVIDIA Tesla T4 × 2)
	- Framework: PyTorch 2.0.1 + Transformers 4.40.0 + PEFT 0.8.2
	- Precision: float32 (full precision for stability)
	- Device: cuda

	Hyperparameters:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning Rate \| 0.00005 \|
	\| Batch Size \| 1 \|
	\| Epochs \| 1 \|
	\| Max Sequence Length \| 256 tokens \|
	\| Gradient Clipping Norm \| 1.0 \|
	\| Optimizer \| AdamW \|
	\| Weight Decay \| 0.01 \|

	LoRA Configuration:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| LoRA Rank (r) \| 8 \|
	\| LoRA Alpha \| 16 \|
	\| LoRA Dropout \| 0.05 \|
	\| Target Modules \| ["q_proj", "v_proj"] \|

	Training Time & Resources:
	- Estimated Duration: ~2 hours on Kaggle GPU
	- Dataset Size: 23,776 training examples
	- Total Tokens Processed: ~6.1M tokens
	- Model Adapter Size: 13.2 MB (99.76% reduction from 5.5GB base)

	Key Optimizations:
	- Memory optimization: `torch.cuda.empty_cache()` after each batch
	- Gradient accumulation: Not needed with batch_size=1 and smaller max_length
	- Mixed precision: float32 used (not fp16) to prevent NaN losses
	- Real-time progress tracking with ETA calculation

	## Evaluation

	### Results

	Model Performance:

	- ✅ Successfully trained on 23,776 Q&A pairs without errors
	- ✅ Loss converged during training (float32 precision prevented NaN)
	- ✅ Model inference working on both GPU and CPU
	- ✅ LoRA adapter reduced parameters to 13.2 MB (vs 5.5GB base model)
	- ✅ Inference latency: ~2-5 seconds on GPU, ~30-60 seconds on CPU

	Task Completion:
	- ✅ Dataset generated: 29,720 Q&A pairs from 5 PDFs
	- ✅ Fine-tuning: Successfully completed on Kaggle GPU
	- ✅ Model deployment: Live on HuggingFace Hub and Spaces
	- ✅ Inference interface: Gradio-based web UI available

	#### Summary

	The model successfully learned Kali Linux documentation and can generate contextually relevant responses to penetration testing and cybersecurity questions. The lightweight LoRA adapter (13.2MB) makes deployment feasible on resource-constrained platforms.



	## Environmental Impact

	Hardware Type: NVIDIA Tesla T4 GPU (2x on Kaggle)
	Hours used: ~2 hours
	Cloud Provider: Kaggle
	Compute Region: Cloud (exact region not specified)
	Carbon Emitted: Estimated low (~0.1-0.5 kg CO2eq for 2-hour GPU training)

	Training focused on efficiency with reduced parameters (LoRA) and single epoch.

	## Technical Specifications

	### Model Architecture and Objective

	- Architecture: Transformer-based causal language model (Phi-2, 2.7B parameters)
	- Objective: Next-token prediction with LoRA fine-tuning
	- Maximum Sequence Length: 256 tokens
	- Vocabulary Size: 50,257 tokens (tokenizer from microsoft/phi-2)

	### Compute Infrastructure

	#### Hardware

	- Training: NVIDIA Tesla T4 GPU × 2 (16GB VRAM each)
	- Inference: CPU or GPU capable (tested on both)

	#### Software

	Training Stack:
	- PyTorch 2.0.1
	- Transformers 4.40.0
	- PEFT 0.8.2
	- Accelerate 0.27.0
	- PyPDF2 (data extraction)

	Deployment Stack:
	- Transformers 4.41.2 (HF Space)
	- PEFT 0.11.1 (HF Space)
	- Gradio 4.0+
	- SafeTensors

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{phi2-kali-linux,
	author = {Kumar, Mithun},
	title = {Phi-2 Fine-tuned on Kali Linux Documentation},
	year = {2024},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/Mithun-999/phi2-kali-linux-finetuned}},
	license = {MIT}
	}
	```

	APA Citation:

	Kumar, M. (2024). Phi-2 fine-tuned on Kali Linux documentation [Model]. HuggingFace. https://huggingface.co/Mithun-999/phi2-kali-linux-finetuned

	## Glossary

	- LoRA: Low-Rank Adaptation - a technique to fine-tune large models with minimal parameters
	- Adapter: LoRA weights that modify base model behavior (13.2 MB in this case)
	- Tokenizer: Converts text to numerical tokens for model input
	- Gradient Clipping: Prevents gradient explosion during training
	- SafeTensors: Safe serialization format for model weights

	## More Information

	GitHub Repository: [phi2-kali-linux](https://github.com/yourusername/phi2-kali-linux)

	Related Resources:
	- [Phi-2 Model](https://huggingface.co/microsoft/phi-2)
	- [LoRA Paper (Hu et al. 2021)](https://arxiv.org/abs/2106.09714)
	- [PEFT Library](https://github.com/huggingface/peft)
	- [Kali Linux Official](https://www.kali.org/)

	## Model Card Authors

	- Primary Author: Mithun Kumar
	- Framework & Methodology: PyTorch, HuggingFace Transformers, PEFT
	- Platform: Kaggle (training), HuggingFace Hub (deployment)

	## Ethical Considerations & Responsible Use

	This model is intended for educational and authorized security purposes only.

	### Permitted Uses:
	✅ Learning penetration testing on systems you own or have permission to test
	✅ Cybersecurity education and training
	✅ Defensive security research
	✅ Documentation lookup for Kali Linux tools

	### Prohibited Uses:
	❌ Unauthorized access to systems
	❌ Malware creation or distribution
	❌ Violating laws (CFAA, GDPR, etc.)
	❌ Privacy violations or data theft
	❌ Targeting systems without explicit authorization

	Users must comply with all applicable laws and ethical guidelines.

	## Model Card Contact

	- Author: Mithun Kumar
	- GitHub Issues: [Report issues here](https://github.com/yourusername/phi2-kali-linux/issues)
	- Discussion Space: [HuggingFace Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux)

	---

	Last Updated: January 2024
	License: MIT
	Base Model: [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)