File size: 4,094 Bytes
8ce7dc8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7018a70
 
 
 
 
8ce7dc8
 
 
 
 
7018a70
8ce7dc8
7018a70
8ce7dc8
7018a70
 
 
 
 
8ce7dc8
7018a70
8ce7dc8
 
 
7018a70
 
 
8ce7dc8
 
 
7018a70
 
 
8ce7dc8
 
 
7018a70
 
8ce7dc8
 
7018a70
8ce7dc8
 
 
 
 
7018a70
8ce7dc8
 
 
 
 
 
 
 
 
7018a70
 
 
 
 
8ce7dc8
7018a70
8ce7dc8
7018a70
8ce7dc8
 
 
 
 
 
 
 
 
 
 
7018a70
8ce7dc8
7018a70
8ce7dc8
 
 
 
7018a70
 
 
8ce7dc8
7018a70
8ce7dc8
 
 
 
7018a70
 
 
8ce7dc8
 
 
7018a70
 
 
8ce7dc8
 
 
7018a70
8ce7dc8
7018a70
8ce7dc8
7018a70
8ce7dc8
 
 
7018a70
8ce7dc8
7018a70
8ce7dc8
 
 
 
 
7018a70
8ce7dc8
7018a70
8ce7dc8
7018a70
 
 
8ce7dc8
 
7018a70
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---

base_model: unsloth/llama-3-8b-Instruct
library_name: peft
pipeline_tag: text-generation
tags:
- base_model:adapter:unsloth/llama-3-8b-Instruct
- grpo
- lora
- transformers
- trl
- unsloth
license: apache-2.0
language:
- en
---


# CLI Agent — Llama 3 8B GRPO Fine-tune

A LoRA adapter fine-tuned on Meta-Llama-3-8B-Instruct using GRPO (Group Relative Policy Optimization) to generate correct Linux shell commands from natural language task descriptions.

## Model Details

### Model Description

- **Developed by:** Jose Alvarez, Carson Chiem, Prisha Bhattacharyya, Vishal Tyagi
- **Model type:** Causal Language Model (LoRA adapter)
- **Language(s) (NLP):** English
- **License:** Meta Llama 3 Community License
- **Finetuned from model:** unsloth/llama-3-8b-Instruct

### Model Sources

- **Repository:** https://github.com/Alvarez-Jose/unsloth-grpo-project

## Uses

### Direct Use

Given a natural language description of a CLI task, the model outputs the correct shell command with no explanation, no markdown, and no backticks.

Example:
- Input: "Count the number of lines in /tmp/data/log.txt"
- Output: `wc -l /tmp/data/log.txt`

### Out-of-Scope Use

- Not intended for general conversation
- Not suitable for tasks outside Linux CLI command generation
- Should not be used for destructive or malicious shell commands

## Bias, Risks, and Limitations

- Model may generate incorrect or harmful shell commands — always review before executing
- Trained on a limited set of ~60 task types, may not generalize to all CLI scenarios
- Performance degrades on complex multi-step tasks

## How to Get Started with the Model
```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="jalva182/cli-agent-model",
    max_seq_length=512,
    load_in_4bit=True,
)

messages = [
    {"role": "system", "content": "You are a CLI expert. Given a task, output exactly the shell commands required. No explanation, no markdown, no backticks."},
    {"role": "user", "content": "Count the number of lines in /tmp/data/log.txt"},
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Training Details

### Training Data

60 validated CLI tasks covering file operations, text processing (grep, awk, sed), sorting, archives, system info, permissions, and environment variables. Each task includes setup commands, expected output, and a reward function for GRPO training.

### Training Hyperparameters

- **Training regime:** bf16 mixed precision
- **Method:** GRPO (Group Relative Policy Optimization)
- **Learning rate:** 3e-6 with linear scheduler
- **Warmup ratio:** 0.1
- **Batch size:** 2 (per device)
- **Gradient accumulation steps:** 2
- **Total steps:** 10000
- **LoRA rank:** 32, alpha: 64
- **KL coefficient:** 0.05
- **Number of generations:** 4
- **Max sequence length:** 512

### Speeds, Sizes, Times

- **Training time:** ~3h 13min
- **Checkpoint size:** ~524MB (LoRA adapter only)
- **Final train loss:** 0.0141
- **Final reward:** 8.0/8.0 on easy tasks, ~6.0 average

## Evaluation

### Metrics

Reward function scoring 0-8 per task:
- +5 for correct output match
- +3 for command success with partial match
- -2 for command failure or wrong output

### Results

- **Best reward:** 8.0
- **Average reward (final steps):** ~6.0
- **Train loss:** 0.0141

## Environmental Impact

- **Hardware Type:** H100 SXM 80GB
- **Hours used:** ~3.5 hours
- **Cloud Provider:** Vast.ai

## Technical Specifications

### Model Architecture

- Base: Meta-Llama-3-8B-Instruct
- Adapter: LoRA (rank=32, alpha=64, dropout=0.05)
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

### Software

- unsloth 2026.3.3
- trl 0.24.0
- transformers 4.56.1
- torch 2.6.0+cu124
- PEFT 0.18.1

## Model Card Authors

Jose Alvarez

## Model Card Contact

https://github.com/Alvarez-Jose/unsloth-grpo-project

### Framework versions

- PEFT 0.18.1