--- base_model: unsloth/llama-3-8b-Instruct library_name: peft pipeline_tag: text-generation tags: - base_model:adapter:unsloth/llama-3-8b-Instruct - grpo - lora - transformers - trl - unsloth --- # CLI Agent — Llama 3 8B GRPO Fine-tune (GPU 1 / lr=5e-6) A LoRA adapter fine-tuned on Meta-Llama-3-8B-Instruct using GRPO (Group Relative Policy Optimization) to generate correct Linux shell commands from natural language task descriptions. This is the GPU 1 run trained at lr=5e-6. See also [jalva182/cli-agent-model](https://huggingface.co/jalva182/cli-agent-model) for the GPU 0 run at lr=3e-6. ## Model Details ### Model Description - **Developed by:** Jose Alvarez, Carson Chiem, Prisha Bhattacharyya, Vishal Tyagi - **Model type:** Causal Language Model (LoRA adapter) - **Language(s) (NLP):** English - **License:** Meta Llama 3 Community License - **Finetuned from model:** unsloth/llama-3-8b-Instruct ### Model Sources - **Repository:** https://github.com/Alvarez-Jose/unsloth-grpo-project ## Uses ### Direct Use Given a natural language description of a CLI task, the model outputs the correct shell command with no explanation, no markdown, and no backticks. Example: - Input: "Count the number of lines in /tmp/data/log.txt" - Output: `wc -l /tmp/data/log.txt` ### Out-of-Scope Use - Not intended for general conversation - Not suitable for tasks outside Linux CLI command generation - Should not be used for destructive or malicious shell commands ## Bias, Risks, and Limitations - Model may generate incorrect or harmful shell commands — always review before executing - Trained on a limited set of ~60 task types, may not generalize to all CLI scenarios - Performance degrades on complex multi-step tasks ## How to Get Started with the Model ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="jalva182/cli-agent-model-gpu1", max_seq_length=512, load_in_4bit=True, ) messages = [ {"role": "system", "content": "You are a CLI expert. Given a task, output exactly the shell commands required. No explanation, no markdown, no backticks."}, {"role": "user", "content": "Count the number of lines in /tmp/data/log.txt"}, ] inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda") outputs = model.generate(input_ids=inputs, max_new_tokens=64) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details ### Training Data 60 validated CLI tasks covering file operations, text processing (grep, awk, sed), sorting, archives, system info, permissions, and environment variables. Each task includes setup commands, expected output, and a reward function for GRPO training. ### Training Hyperparameters - **Training regime:** bf16 mixed precision - **Method:** GRPO (Group Relative Policy Optimization) - **Learning rate:** 5e-6 with linear scheduler - **Warmup ratio:** 0.1 - **Batch size:** 2 (per device) - **Gradient accumulation steps:** 2 - **Total steps:** 10000 - **LoRA rank:** 32, alpha: 64 - **KL coefficient:** 0.05 - **Number of generations:** 4 - **Max sequence length:** 512 ### Speeds, Sizes, Times - **Training time:** ~4h 7min - **Checkpoint size:** ~524MB (LoRA adapter only) - **Final train loss:** 0.0188 - **Final reward:** 8.0/8.0 on final steps ## Evaluation ### Metrics Reward function scoring 0-8 per task: - +5 for correct output match - +3 for command success with partial match - -2 for command failure or wrong output ### Results - **Best reward:** 8.0 - **Average reward (final steps):** ~6.0 - **Train loss:** 0.0188 ## Comparison with GPU 0 Run | | GPU 0 (cli-agent-model) | GPU 1 (cli-agent-model-gpu1) | |---|---|---| | Learning rate | 3e-6 | 5e-6 | | Train loss | 0.0141 | 0.0188 | | Final reward | 8.0 | 8.0 | | Runtime | 3h 13min | 4h 7min | | Recommendation | ✅ Primary | Secondary | GPU 0 achieved lower train loss and is recommended as the primary model. ## Environmental Impact - **Hardware Type:** H100 SXM 80GB - **Hours used:** ~4.5 hours - **Cloud Provider:** Vast.ai ## Technical Specifications ### Model Architecture - Base: Meta-Llama-3-8B-Instruct - Adapter: LoRA (rank=32, alpha=64, dropout=0.05) - Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj ### Software - unsloth 2026.3.3 - trl 0.24.0 - transformers 4.56.1 - torch 2.6.0+cu124 - PEFT 0.18.1 ## Model Card Authors Jose Alvarez ## Model Card Contact https://github.com/Alvarez-Jose/unsloth-grpo-project ### Framework versions - PEFT 0.18.1