Update README.md

be16f21 verified 6 days ago

4.56 kB

	---
	base_model: unsloth/llama-3-8b-Instruct
	library_name: peft
	pipeline_tag: text-generation
	tags:
	- base_model:adapter:unsloth/llama-3-8b-Instruct
	- grpo
	- lora
	- transformers
	- trl
	- unsloth
	---

	# CLI Agent — Llama 3 8B GRPO Fine-tune (GPU 1 / lr=5e-6)

	A LoRA adapter fine-tuned on Meta-Llama-3-8B-Instruct using GRPO (Group Relative Policy Optimization) to generate correct Linux shell commands from natural language task descriptions. This is the GPU 1 run trained at lr=5e-6. See also [jalva182/cli-agent-model](https://huggingface.co/jalva182/cli-agent-model) for the GPU 0 run at lr=3e-6.

	## Model Details

	### Model Description

	- Developed by: Jose Alvarez, Carson Chiem, Prisha Bhattacharyya, Vishal Tyagi
	- Model type: Causal Language Model (LoRA adapter)
	- Language(s) (NLP): English
	- License: Meta Llama 3 Community License
	- Finetuned from model: unsloth/llama-3-8b-Instruct

	### Model Sources

	- Repository: https://github.com/Alvarez-Jose/unsloth-grpo-project

	## Uses

	### Direct Use

	Given a natural language description of a CLI task, the model outputs the correct shell command with no explanation, no markdown, and no backticks.

	Example:
	- Input: "Count the number of lines in /tmp/data/log.txt"
	- Output: `wc -l /tmp/data/log.txt`

	### Out-of-Scope Use

	- Not intended for general conversation
	- Not suitable for tasks outside Linux CLI command generation
	- Should not be used for destructive or malicious shell commands

	## Bias, Risks, and Limitations

	- Model may generate incorrect or harmful shell commands — always review before executing
	- Trained on a limited set of ~60 task types, may not generalize to all CLI scenarios
	- Performance degrades on complex multi-step tasks

	## How to Get Started with the Model
	```python
	from unsloth import FastLanguageModel

	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name="jalva182/cli-agent-model-gpu1",
	max_seq_length=512,
	load_in_4bit=True,
	)

	messages = [
	{"role": "system", "content": "You are a CLI expert. Given a task, output exactly the shell commands required. No explanation, no markdown, no backticks."},
	{"role": "user", "content": "Count the number of lines in /tmp/data/log.txt"},
	]

	inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
	outputs = model.generate(input_ids=inputs, max_new_tokens=64)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details

	### Training Data

	60 validated CLI tasks covering file operations, text processing (grep, awk, sed), sorting, archives, system info, permissions, and environment variables. Each task includes setup commands, expected output, and a reward function for GRPO training.

	### Training Hyperparameters

	- Training regime: bf16 mixed precision
	- Method: GRPO (Group Relative Policy Optimization)
	- Learning rate: 5e-6 with linear scheduler
	- Warmup ratio: 0.1
	- Batch size: 2 (per device)
	- Gradient accumulation steps: 2
	- Total steps: 10000
	- LoRA rank: 32, alpha: 64
	- KL coefficient: 0.05
	- Number of generations: 4
	- Max sequence length: 512

	### Speeds, Sizes, Times

	- Training time: ~4h 7min
	- Checkpoint size: ~524MB (LoRA adapter only)
	- Final train loss: 0.0188
	- Final reward: 8.0/8.0 on final steps

	## Evaluation

	### Metrics

	Reward function scoring 0-8 per task:
	- +5 for correct output match
	- +3 for command success with partial match
	- -2 for command failure or wrong output

	### Results

	- Best reward: 8.0
	- Average reward (final steps): ~6.0
	- Train loss: 0.0188

	## Comparison with GPU 0 Run

	\| \| GPU 0 (cli-agent-model) \| GPU 1 (cli-agent-model-gpu1) \|
	\|---\|---\|---\|
	\| Learning rate \| 3e-6 \| 5e-6 \|
	\| Train loss \| 0.0141 \| 0.0188 \|
	\| Final reward \| 8.0 \| 8.0 \|
	\| Runtime \| 3h 13min \| 4h 7min \|
	\| Recommendation \| ✅ Primary \| Secondary \|

	GPU 0 achieved lower train loss and is recommended as the primary model.

	## Environmental Impact

	- Hardware Type: H100 SXM 80GB
	- Hours used: ~4.5 hours
	- Cloud Provider: Vast.ai

	## Technical Specifications

	### Model Architecture

	- Base: Meta-Llama-3-8B-Instruct
	- Adapter: LoRA (rank=32, alpha=64, dropout=0.05)
	- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

	### Software

	- unsloth 2026.3.3
	- trl 0.24.0
	- transformers 4.56.1
	- torch 2.6.0+cu124
	- PEFT 0.18.1

	## Model Card Authors

	Jose Alvarez

	## Model Card Contact

	https://github.com/Alvarez-Jose/unsloth-grpo-project

	### Framework versions

	- PEFT 0.18.1