File size: 4,094 Bytes
8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 8ce7dc8 7018a70 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | ---
base_model: unsloth/llama-3-8b-Instruct
library_name: peft
pipeline_tag: text-generation
tags:
- base_model:adapter:unsloth/llama-3-8b-Instruct
- grpo
- lora
- transformers
- trl
- unsloth
license: apache-2.0
language:
- en
---
# CLI Agent — Llama 3 8B GRPO Fine-tune
A LoRA adapter fine-tuned on Meta-Llama-3-8B-Instruct using GRPO (Group Relative Policy Optimization) to generate correct Linux shell commands from natural language task descriptions.
## Model Details
### Model Description
- **Developed by:** Jose Alvarez, Carson Chiem, Prisha Bhattacharyya, Vishal Tyagi
- **Model type:** Causal Language Model (LoRA adapter)
- **Language(s) (NLP):** English
- **License:** Meta Llama 3 Community License
- **Finetuned from model:** unsloth/llama-3-8b-Instruct
### Model Sources
- **Repository:** https://github.com/Alvarez-Jose/unsloth-grpo-project
## Uses
### Direct Use
Given a natural language description of a CLI task, the model outputs the correct shell command with no explanation, no markdown, and no backticks.
Example:
- Input: "Count the number of lines in /tmp/data/log.txt"
- Output: `wc -l /tmp/data/log.txt`
### Out-of-Scope Use
- Not intended for general conversation
- Not suitable for tasks outside Linux CLI command generation
- Should not be used for destructive or malicious shell commands
## Bias, Risks, and Limitations
- Model may generate incorrect or harmful shell commands — always review before executing
- Trained on a limited set of ~60 task types, may not generalize to all CLI scenarios
- Performance degrades on complex multi-step tasks
## How to Get Started with the Model
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="jalva182/cli-agent-model",
max_seq_length=512,
load_in_4bit=True,
)
messages = [
{"role": "system", "content": "You are a CLI expert. Given a task, output exactly the shell commands required. No explanation, no markdown, no backticks."},
{"role": "user", "content": "Count the number of lines in /tmp/data/log.txt"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training Details
### Training Data
60 validated CLI tasks covering file operations, text processing (grep, awk, sed), sorting, archives, system info, permissions, and environment variables. Each task includes setup commands, expected output, and a reward function for GRPO training.
### Training Hyperparameters
- **Training regime:** bf16 mixed precision
- **Method:** GRPO (Group Relative Policy Optimization)
- **Learning rate:** 3e-6 with linear scheduler
- **Warmup ratio:** 0.1
- **Batch size:** 2 (per device)
- **Gradient accumulation steps:** 2
- **Total steps:** 10000
- **LoRA rank:** 32, alpha: 64
- **KL coefficient:** 0.05
- **Number of generations:** 4
- **Max sequence length:** 512
### Speeds, Sizes, Times
- **Training time:** ~3h 13min
- **Checkpoint size:** ~524MB (LoRA adapter only)
- **Final train loss:** 0.0141
- **Final reward:** 8.0/8.0 on easy tasks, ~6.0 average
## Evaluation
### Metrics
Reward function scoring 0-8 per task:
- +5 for correct output match
- +3 for command success with partial match
- -2 for command failure or wrong output
### Results
- **Best reward:** 8.0
- **Average reward (final steps):** ~6.0
- **Train loss:** 0.0141
## Environmental Impact
- **Hardware Type:** H100 SXM 80GB
- **Hours used:** ~3.5 hours
- **Cloud Provider:** Vast.ai
## Technical Specifications
### Model Architecture
- Base: Meta-Llama-3-8B-Instruct
- Adapter: LoRA (rank=32, alpha=64, dropout=0.05)
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
### Software
- unsloth 2026.3.3
- trl 0.24.0
- transformers 4.56.1
- torch 2.6.0+cu124
- PEFT 0.18.1
## Model Card Authors
Jose Alvarez
## Model Card Contact
https://github.com/Alvarez-Jose/unsloth-grpo-project
### Framework versions
- PEFT 0.18.1 |