jalva182's picture
Update README.md
be16f21 verified
---
base_model: unsloth/llama-3-8b-Instruct
library_name: peft
pipeline_tag: text-generation
tags:
- base_model:adapter:unsloth/llama-3-8b-Instruct
- grpo
- lora
- transformers
- trl
- unsloth
---
# CLI Agent — Llama 3 8B GRPO Fine-tune (GPU 1 / lr=5e-6)
A LoRA adapter fine-tuned on Meta-Llama-3-8B-Instruct using GRPO (Group Relative Policy Optimization) to generate correct Linux shell commands from natural language task descriptions. This is the GPU 1 run trained at lr=5e-6. See also [jalva182/cli-agent-model](https://huggingface.co/jalva182/cli-agent-model) for the GPU 0 run at lr=3e-6.
## Model Details
### Model Description
- **Developed by:** Jose Alvarez, Carson Chiem, Prisha Bhattacharyya, Vishal Tyagi
- **Model type:** Causal Language Model (LoRA adapter)
- **Language(s) (NLP):** English
- **License:** Meta Llama 3 Community License
- **Finetuned from model:** unsloth/llama-3-8b-Instruct
### Model Sources
- **Repository:** https://github.com/Alvarez-Jose/unsloth-grpo-project
## Uses
### Direct Use
Given a natural language description of a CLI task, the model outputs the correct shell command with no explanation, no markdown, and no backticks.
Example:
- Input: "Count the number of lines in /tmp/data/log.txt"
- Output: `wc -l /tmp/data/log.txt`
### Out-of-Scope Use
- Not intended for general conversation
- Not suitable for tasks outside Linux CLI command generation
- Should not be used for destructive or malicious shell commands
## Bias, Risks, and Limitations
- Model may generate incorrect or harmful shell commands — always review before executing
- Trained on a limited set of ~60 task types, may not generalize to all CLI scenarios
- Performance degrades on complex multi-step tasks
## How to Get Started with the Model
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="jalva182/cli-agent-model-gpu1",
max_seq_length=512,
load_in_4bit=True,
)
messages = [
{"role": "system", "content": "You are a CLI expert. Given a task, output exactly the shell commands required. No explanation, no markdown, no backticks."},
{"role": "user", "content": "Count the number of lines in /tmp/data/log.txt"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training Details
### Training Data
60 validated CLI tasks covering file operations, text processing (grep, awk, sed), sorting, archives, system info, permissions, and environment variables. Each task includes setup commands, expected output, and a reward function for GRPO training.
### Training Hyperparameters
- **Training regime:** bf16 mixed precision
- **Method:** GRPO (Group Relative Policy Optimization)
- **Learning rate:** 5e-6 with linear scheduler
- **Warmup ratio:** 0.1
- **Batch size:** 2 (per device)
- **Gradient accumulation steps:** 2
- **Total steps:** 10000
- **LoRA rank:** 32, alpha: 64
- **KL coefficient:** 0.05
- **Number of generations:** 4
- **Max sequence length:** 512
### Speeds, Sizes, Times
- **Training time:** ~4h 7min
- **Checkpoint size:** ~524MB (LoRA adapter only)
- **Final train loss:** 0.0188
- **Final reward:** 8.0/8.0 on final steps
## Evaluation
### Metrics
Reward function scoring 0-8 per task:
- +5 for correct output match
- +3 for command success with partial match
- -2 for command failure or wrong output
### Results
- **Best reward:** 8.0
- **Average reward (final steps):** ~6.0
- **Train loss:** 0.0188
## Comparison with GPU 0 Run
| | GPU 0 (cli-agent-model) | GPU 1 (cli-agent-model-gpu1) |
|---|---|---|
| Learning rate | 3e-6 | 5e-6 |
| Train loss | 0.0141 | 0.0188 |
| Final reward | 8.0 | 8.0 |
| Runtime | 3h 13min | 4h 7min |
| Recommendation | ✅ Primary | Secondary |
GPU 0 achieved lower train loss and is recommended as the primary model.
## Environmental Impact
- **Hardware Type:** H100 SXM 80GB
- **Hours used:** ~4.5 hours
- **Cloud Provider:** Vast.ai
## Technical Specifications
### Model Architecture
- Base: Meta-Llama-3-8B-Instruct
- Adapter: LoRA (rank=32, alpha=64, dropout=0.05)
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
### Software
- unsloth 2026.3.3
- trl 0.24.0
- transformers 4.56.1
- torch 2.6.0+cu124
- PEFT 0.18.1
## Model Card Authors
Jose Alvarez
## Model Card Contact
https://github.com/Alvarez-Jose/unsloth-grpo-project
### Framework versions
- PEFT 0.18.1