--- base_model: microsoft/phi-2 library_name: peft license: mit language: - en tags: - phi-2 - lora - kali-linux - penetration-testing - security - fine-tuned --- # Phi-2 Fine-tuned on Kali Linux Documentation Fine-tuned Microsoft Phi-2 (2.7B) model using LoRA adapters for Kali Linux and penetration testing Q&A. ## Model Details ### Model Description This is a LoRA-adapted Phi-2 model fine-tuned on Kali Linux documentation for answering cybersecurity and penetration testing questions. - **Developed by:** Mithun Kumar - **Model type:** Causal Language Model - **Language(s) (NLP):** English - **License:** MIT - **Finetuned from model:** [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) - **Fine-tuning method:** LoRA (Low-Rank Adaptation) ### Model Sources - **Repository:** [GitHub](https://github.com/yourusername/phi2-kali-linux) - **Demo:** [Hugging Face Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux) ## Uses ### Direct Use This model is designed to answer questions related to: - **Kali Linux tools and commands** - **Penetration testing methodologies** - **Cybersecurity concepts** - **Linux administration and troubleshooting** The model can be used for: - Chatbots and Q&A systems - Educational tools for cybersecurity training - Documentation lookup and explanation - Penetration testing knowledge base ### Intended Use **Best practices for using this model:** ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load the base model and adapter base_model = AutoModelForCausalLM.from_pretrained( "microsoft/phi-2", device_map="cpu", torch_dtype=torch.float32, trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True) # Load LoRA adapter model = PeftModel.from_pretrained(base_model, "Mithun-999/phi2-kali-linux-finetuned") # Generate response prompt = "What is the purpose of nmap in Kali Linux?" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate( **inputs, max_length=256, temperature=0.7, top_p=0.9, do_sample=True ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### Out-of-Scope Use This model is **NOT** intended for: - Illegal hacking or unauthorized system access - Bypassing security measures on systems you don't own - Creating malware or exploits for malicious purposes - Any activity that violates laws or ethical guidelines ## Bias, Risks, and Limitations **Potential Biases:** - Training data reflects tool documentation; may have biases present in original Kali Linux materials - Q&A generation heuristic may favor common over edge-case scenarios **Known Risks:** - Responses may contain outdated information (depends on PDF document dates) - Generated answers may sometimes be incomplete or require manual verification - Model can generate misleading information if prompted outside training domain **Limitations:** - Performance degrades on topics outside Kali Linux documentation - Single-epoch training may limit depth of learning - CPU inference is significantly slower than GPU ### Recommendations Users should: - Verify all security-related advice with official documentation - Only use on systems you own or have explicit authorization to test - Treat output as supplementary information, not absolute truth - Follow responsible disclosure practices if discovering vulnerabilities ## How to Get Started with the Model Try the interactive demo: [Kali Linux Q&A Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux) Or run locally with the code example provided in the "Intended Use" section above. ## Training Details ### Training Data The model was fine-tuned on extracted text from **5 Kali Linux PDF documents**: | Document | Size | Content Focus | |----------|------|---| | PDF 1 | Large | Kali Linux fundamentals, tools overview | | PDF 2 | Large | Network penetration testing techniques | | PDF 3 | Medium | Web application penetration testing | | PDF 4 | Medium | Post-exploitation and privilege escalation | | PDF 5 | Medium | Linux system hardening and defense | **Data Extraction Summary:** - **Total Characters Extracted:** 2,300,000+ characters - **Total Words:** 336,000+ words - **Extraction Method:** PyPDF2 text extraction from PDF documents ### Training Data Processing The training dataset was generated using a **heuristic question-answer generation approach**: 1. **Text Chunking:** PDF text split into 512-character chunks with 128-character overlap (sliding window) 2. **Sentence Extraction:** Chunks processed to extract meaningful sentences using NLTK sentence tokenizer 3. **Q&A Pairing:** Question-answer pairs generated by: - Extracting sentences as answers - Creating relevant questions from answer content - Using keyword extraction and pattern matching - Filtering for quality and relevance **Dataset Statistics:** | Split | Count | Percentage | |-------|-------|-----------| | Training | 23,776 | 80% | | Validation | 2,972 | 10% | | Testing | 2,972 | 10% | | **Total** | **29,720** | **100%** | **Dataset Format:** JSONL and CSV - **Average Question Length:** 15-25 tokens - **Average Answer Length:** 40-100 tokens - **Total Training Examples:** 23,776 Q&A pairs ### Training Procedure **Training Environment:** - **Platform:** Kaggle GPU (NVIDIA Tesla T4 × 2) - **Framework:** PyTorch 2.0.1 + Transformers 4.40.0 + PEFT 0.8.2 - **Precision:** float32 (full precision for stability) - **Device:** cuda **Hyperparameters:** | Parameter | Value | |-----------|-------| | Learning Rate | 0.00005 | | Batch Size | 1 | | Epochs | 1 | | Max Sequence Length | 256 tokens | | Gradient Clipping Norm | 1.0 | | Optimizer | AdamW | | Weight Decay | 0.01 | **LoRA Configuration:** | Parameter | Value | |-----------|-------| | LoRA Rank (r) | 8 | | LoRA Alpha | 16 | | LoRA Dropout | 0.05 | | Target Modules | ["q_proj", "v_proj"] | **Training Time & Resources:** - **Estimated Duration:** ~2 hours on Kaggle GPU - **Dataset Size:** 23,776 training examples - **Total Tokens Processed:** ~6.1M tokens - **Model Adapter Size:** 13.2 MB (99.76% reduction from 5.5GB base) **Key Optimizations:** - Memory optimization: `torch.cuda.empty_cache()` after each batch - Gradient accumulation: Not needed with batch_size=1 and smaller max_length - Mixed precision: float32 used (not fp16) to prevent NaN losses - Real-time progress tracking with ETA calculation ## Evaluation ### Results **Model Performance:** - ✅ Successfully trained on 23,776 Q&A pairs without errors - ✅ Loss converged during training (float32 precision prevented NaN) - ✅ Model inference working on both GPU and CPU - ✅ LoRA adapter reduced parameters to **13.2 MB** (vs 5.5GB base model) - ✅ Inference latency: **~2-5 seconds on GPU**, ~30-60 seconds on CPU **Task Completion:** - ✅ Dataset generated: 29,720 Q&A pairs from 5 PDFs - ✅ Fine-tuning: Successfully completed on Kaggle GPU - ✅ Model deployment: Live on HuggingFace Hub and Spaces - ✅ Inference interface: Gradio-based web UI available #### Summary The model successfully learned Kali Linux documentation and can generate contextually relevant responses to penetration testing and cybersecurity questions. The lightweight LoRA adapter (13.2MB) makes deployment feasible on resource-constrained platforms. ## Environmental Impact **Hardware Type:** NVIDIA Tesla T4 GPU (2x on Kaggle) **Hours used:** ~2 hours **Cloud Provider:** Kaggle **Compute Region:** Cloud (exact region not specified) **Carbon Emitted:** Estimated low (~0.1-0.5 kg CO2eq for 2-hour GPU training) Training focused on efficiency with reduced parameters (LoRA) and single epoch. ## Technical Specifications ### Model Architecture and Objective - **Architecture:** Transformer-based causal language model (Phi-2, 2.7B parameters) - **Objective:** Next-token prediction with LoRA fine-tuning - **Maximum Sequence Length:** 256 tokens - **Vocabulary Size:** 50,257 tokens (tokenizer from microsoft/phi-2) ### Compute Infrastructure #### Hardware - **Training:** NVIDIA Tesla T4 GPU × 2 (16GB VRAM each) - **Inference:** CPU or GPU capable (tested on both) #### Software **Training Stack:** - PyTorch 2.0.1 - Transformers 4.40.0 - PEFT 0.8.2 - Accelerate 0.27.0 - PyPDF2 (data extraction) **Deployment Stack:** - Transformers 4.41.2 (HF Space) - PEFT 0.11.1 (HF Space) - Gradio 4.0+ - SafeTensors ## Citation If you use this model, please cite: ```bibtex @misc{phi2-kali-linux, author = {Kumar, Mithun}, title = {Phi-2 Fine-tuned on Kali Linux Documentation}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Mithun-999/phi2-kali-linux-finetuned}}, license = {MIT} } ``` **APA Citation:** Kumar, M. (2024). Phi-2 fine-tuned on Kali Linux documentation [Model]. HuggingFace. https://huggingface.co/Mithun-999/phi2-kali-linux-finetuned ## Glossary - **LoRA:** Low-Rank Adaptation - a technique to fine-tune large models with minimal parameters - **Adapter:** LoRA weights that modify base model behavior (13.2 MB in this case) - **Tokenizer:** Converts text to numerical tokens for model input - **Gradient Clipping:** Prevents gradient explosion during training - **SafeTensors:** Safe serialization format for model weights ## More Information **GitHub Repository:** [phi2-kali-linux](https://github.com/yourusername/phi2-kali-linux) **Related Resources:** - [Phi-2 Model](https://huggingface.co/microsoft/phi-2) - [LoRA Paper (Hu et al. 2021)](https://arxiv.org/abs/2106.09714) - [PEFT Library](https://github.com/huggingface/peft) - [Kali Linux Official](https://www.kali.org/) ## Model Card Authors - **Primary Author:** Mithun Kumar - **Framework & Methodology:** PyTorch, HuggingFace Transformers, PEFT - **Platform:** Kaggle (training), HuggingFace Hub (deployment) ## Ethical Considerations & Responsible Use **This model is intended for educational and authorized security purposes only.** ### Permitted Uses: ✅ Learning penetration testing on systems you own or have permission to test ✅ Cybersecurity education and training ✅ Defensive security research ✅ Documentation lookup for Kali Linux tools ### Prohibited Uses: ❌ Unauthorized access to systems ❌ Malware creation or distribution ❌ Violating laws (CFAA, GDPR, etc.) ❌ Privacy violations or data theft ❌ Targeting systems without explicit authorization **Users must comply with all applicable laws and ethical guidelines.** ## Model Card Contact - **Author:** Mithun Kumar - **GitHub Issues:** [Report issues here](https://github.com/yourusername/phi2-kali-linux/issues) - **Discussion Space:** [HuggingFace Space](https://huggingface.co/spaces/Mithun-999/phi2-kali-linux) --- **Last Updated:** January 2024 **License:** MIT **Base Model:** [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)