--- base_model: unsloth/llama-3-8b-Instruct-bnb-4bit library_name: peft pipeline_tag: text-generation tags: - base_model:adapter:unsloth/llama-3-8b-Instruct-bnb-4bit - grpo - lora - sft - transformers - trl - unsloth - cybersecurity - web-pentesting - autonomous-agent --- # SENTINEL — Llama-3-8B QLoRA SFT + GRPO Fine-tune This repository contains the **Final (QLoRA SFT + GRPO)** adapter weights for **SENTINEL**, an autonomous web-exploitation agent. ## Model Overview SENTINEL is trained to autonomously navigate, analyze, and exploit web vulnerabilities using a structured JSON-based reasoning and action schema . This repository represents the completion of **both Stage 1 and Stage 2**: * **Stage 1:** QLoRA SFT on 415 SENTINEL trajectory pairs. This stage teaches the agent the correct JSON schema, action vocabulary, and basic reasoning patterns for web exploitation. * **Stage 2:** GRPO (Generative Reward Proximal Policy Optimization). This stage reinforces successful exploitation pathways, heavily penalizing JSON formatting errors and hallucinated actions, while rewarding verified vulnerability exploitation. ## Training Details The model was fine-tuned using [Unsloth](https://github.com/unslothai/unsloth) for optimized, 2x faster training. ### Hardware & Configuration - **GPU:** 1x Tesla T4 (16GB VRAM) - **Precision:** FP16 (bf16=False) - **Base Model:** `unsloth/llama-3-8b-Instruct-bnb-4bit` - **Training Time:** ~45 minutes ### Hyperparameters - **Num Examples:** 415 - **Epochs:** 2 - **Total Steps:** 104 - **Batch Size per Device:** 2 - **Gradient Accumulation Steps:** 4 - **Total Effective Batch Size:** 8 - **Trainable Parameters:** 41,943,040 of 8,072,204,288 (0.52% trained) ### Results - **Final Training Loss:** 1.0764 - **Final Validation Loss:** 1.1465 *Loss curve during training:* | Step | Training Loss | Validation Loss | |------|---------------|-----------------| | 20 | 1.333500 | 1.492488 | | 40 | 1.127200 | 1.263251 | | 60 | 0.689200 | 1.210084 | | 80 | 0.724800 | 1.163273 | | 100 | 0.763000 | 1.146508 | ## Dataset The model was trained on a custom dataset (`train_llama3.jsonl`) consisting of **415 SENTINEL trajectory pairs**. These trajectories represent successful pentesting workflows, teaching the model how to target vulnerability sinks (form actions, hidden fields, query parameters, JSON bodies, etc.), infer backend technologies, and deliver appropriate payloads. ## Usage Because this is a PEFT (Parameter-Efficient Fine-Tuning) adapter, you must load the base model (`unsloth/llama-3-8b-Instruct-bnb-4bit` or the standard `meta-llama/Meta-Llama-3-8B-Instruct`) and apply these LoRA weights on top using the `peft` library. ### Framework Versions - PEFT 0.19.1 - Transformers - Unsloth