YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
T-Bench Qwen SFT Multi-Task NAT v9
Model Description
This is a Qwen3-8B model fine-tuned on terminal bench tasks using Negative-Aware Training (NAT) v9. The model has been trained to avoid common failure patterns when executing terminal commands.
Training Details
- Base Model: Qwen/Qwen3-8B
- Training Method: Negative-Aware Training (NAT) v9
- Tasks: 5 terminal bench tasks
- Epochs: 300
- Learning Rate: 5e-5
- Max Length: 16384 tokens
Dataset Composition
The training dataset includes 45 samples per epoch:
- Positive Examples: 20 (4 per task)
- Negative Examples: 25 (5 per task)
- Negative Ratio: 55.6%
Tasks Included
- fix-git (4 pos, 5 neg)
- log-summary-date-ranges (4 pos, 5 neg)
- pypi-server (4 pos, 5 neg)
- regex-log (4 pos, 5 neg)
- cancel-async-tasks (4 pos, 5 neg)
NAT v9 Improvements
Based on failure analysis of v8 (44% success rate), v9 includes 5 types of negative examples:
Negative Types
- Hallucinated Arguments: Adding non-existent parameters like
message_title,message_description,message_attachment - Asking User for Help: "Would you like me to..." instead of executing autonomously
- Excessive Exploration: Running many commands without taking decisive action
- Wrong Block Mode: Using
block=Falsefor quick commands that should be blocking - Looping Behavior: Repeating commands after task completion
Key Improvements
- 55.6% negative ratio (vs 23% in v8)
- Task-specific negatives for each failure pattern
- Clean positive examples with no hallucinated parameters
- Enhanced system prompt emphasizing autonomous execution
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Aznaur/tbench-qwen-sft-multitask-nat-v9",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"Aznaur/tbench-qwen-sft-multitask-nat-v9",
trust_remote_code=True
)
Performance
This model addresses the key failure patterns identified in v8:
- Eliminates hallucinated tool parameters
- Prevents asking user for help
- Encourages decisive action over excessive exploration
- Uses correct block modes for commands
- Avoids looping behavior
Limitations
- Training on 5 tasks only - may not generalize to all terminal tasks
- Negative examples based on observed failure patterns only
- Model may still fail on edge cases not covered in training
Training Pipeline
- Dataset Creation:
create_multitask_nat_v9.py - Training Config:
experiment_multitask_nat_v9.yaml - Failure Analysis:
MULTITASK_NAT_FAILURE_ANALYSIS.md
License
This model inherits the license from the base Qwen3-8B model.
- Downloads last month
- 3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support