| | --- |
| | base_model: unsloth/llama-3-8b-Instruct |
| | library_name: peft |
| | pipeline_tag: text-generation |
| | tags: |
| | - base_model:adapter:unsloth/llama-3-8b-Instruct |
| | - grpo |
| | - lora |
| | - transformers |
| | - trl |
| | - unsloth |
| | --- |
| | |
| | # CLI Agent — Llama 3 8B GRPO Fine-tune (GPU 1 / lr=5e-6) |
| |
|
| | A LoRA adapter fine-tuned on Meta-Llama-3-8B-Instruct using GRPO (Group Relative Policy Optimization) to generate correct Linux shell commands from natural language task descriptions. This is the GPU 1 run trained at lr=5e-6. See also [jalva182/cli-agent-model](https://huggingface.co/jalva182/cli-agent-model) for the GPU 0 run at lr=3e-6. |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | - **Developed by:** Jose Alvarez, Carson Chiem, Prisha Bhattacharyya, Vishal Tyagi |
| | - **Model type:** Causal Language Model (LoRA adapter) |
| | - **Language(s) (NLP):** English |
| | - **License:** Meta Llama 3 Community License |
| | - **Finetuned from model:** unsloth/llama-3-8b-Instruct |
| |
|
| | ### Model Sources |
| |
|
| | - **Repository:** https://github.com/Alvarez-Jose/unsloth-grpo-project |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| |
|
| | Given a natural language description of a CLI task, the model outputs the correct shell command with no explanation, no markdown, and no backticks. |
| |
|
| | Example: |
| | - Input: "Count the number of lines in /tmp/data/log.txt" |
| | - Output: `wc -l /tmp/data/log.txt` |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | - Not intended for general conversation |
| | - Not suitable for tasks outside Linux CLI command generation |
| | - Should not be used for destructive or malicious shell commands |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | - Model may generate incorrect or harmful shell commands — always review before executing |
| | - Trained on a limited set of ~60 task types, may not generalize to all CLI scenarios |
| | - Performance degrades on complex multi-step tasks |
| |
|
| | ## How to Get Started with the Model |
| | ```python |
| | from unsloth import FastLanguageModel |
| | |
| | model, tokenizer = FastLanguageModel.from_pretrained( |
| | model_name="jalva182/cli-agent-model-gpu1", |
| | max_seq_length=512, |
| | load_in_4bit=True, |
| | ) |
| | |
| | messages = [ |
| | {"role": "system", "content": "You are a CLI expert. Given a task, output exactly the shell commands required. No explanation, no markdown, no backticks."}, |
| | {"role": "user", "content": "Count the number of lines in /tmp/data/log.txt"}, |
| | ] |
| | |
| | inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda") |
| | outputs = model.generate(input_ids=inputs, max_new_tokens=64) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | 60 validated CLI tasks covering file operations, text processing (grep, awk, sed), sorting, archives, system info, permissions, and environment variables. Each task includes setup commands, expected output, and a reward function for GRPO training. |
| |
|
| | ### Training Hyperparameters |
| |
|
| | - **Training regime:** bf16 mixed precision |
| | - **Method:** GRPO (Group Relative Policy Optimization) |
| | - **Learning rate:** 5e-6 with linear scheduler |
| | - **Warmup ratio:** 0.1 |
| | - **Batch size:** 2 (per device) |
| | - **Gradient accumulation steps:** 2 |
| | - **Total steps:** 10000 |
| | - **LoRA rank:** 32, alpha: 64 |
| | - **KL coefficient:** 0.05 |
| | - **Number of generations:** 4 |
| | - **Max sequence length:** 512 |
| |
|
| | ### Speeds, Sizes, Times |
| |
|
| | - **Training time:** ~4h 7min |
| | - **Checkpoint size:** ~524MB (LoRA adapter only) |
| | - **Final train loss:** 0.0188 |
| | - **Final reward:** 8.0/8.0 on final steps |
| |
|
| | ## Evaluation |
| |
|
| | ### Metrics |
| |
|
| | Reward function scoring 0-8 per task: |
| | - +5 for correct output match |
| | - +3 for command success with partial match |
| | - -2 for command failure or wrong output |
| |
|
| | ### Results |
| |
|
| | - **Best reward:** 8.0 |
| | - **Average reward (final steps):** ~6.0 |
| | - **Train loss:** 0.0188 |
| |
|
| | ## Comparison with GPU 0 Run |
| |
|
| | | | GPU 0 (cli-agent-model) | GPU 1 (cli-agent-model-gpu1) | |
| | |---|---|---| |
| | | Learning rate | 3e-6 | 5e-6 | |
| | | Train loss | 0.0141 | 0.0188 | |
| | | Final reward | 8.0 | 8.0 | |
| | | Runtime | 3h 13min | 4h 7min | |
| | | Recommendation | ✅ Primary | Secondary | |
| |
|
| | GPU 0 achieved lower train loss and is recommended as the primary model. |
| |
|
| | ## Environmental Impact |
| |
|
| | - **Hardware Type:** H100 SXM 80GB |
| | - **Hours used:** ~4.5 hours |
| | - **Cloud Provider:** Vast.ai |
| |
|
| | ## Technical Specifications |
| |
|
| | ### Model Architecture |
| |
|
| | - Base: Meta-Llama-3-8B-Instruct |
| | - Adapter: LoRA (rank=32, alpha=64, dropout=0.05) |
| | - Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| | |
| | ### Software |
| | |
| | - unsloth 2026.3.3 |
| | - trl 0.24.0 |
| | - transformers 4.56.1 |
| | - torch 2.6.0+cu124 |
| | - PEFT 0.18.1 |
| | |
| | ## Model Card Authors |
| | |
| | Jose Alvarez |
| | |
| | ## Model Card Contact |
| | |
| | https://github.com/Alvarez-Jose/unsloth-grpo-project |
| | |
| | ### Framework versions |
| | |
| | - PEFT 0.18.1 |