YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

T-Bench Qwen SFT Multi-Task NAT v9

Model Description

This is a Qwen3-8B model fine-tuned on terminal bench tasks using Negative-Aware Training (NAT) v9. The model has been trained to avoid common failure patterns when executing terminal commands.

Training Details

Base Model: Qwen/Qwen3-8B
Training Method: Negative-Aware Training (NAT) v9
Tasks: 5 terminal bench tasks
Epochs: 300
Learning Rate: 5e-5
Max Length: 16384 tokens

Dataset Composition

The training dataset includes 45 samples per epoch:

Positive Examples: 20 (4 per task)
Negative Examples: 25 (5 per task)
Negative Ratio: 55.6%

Tasks Included

fix-git (4 pos, 5 neg)
log-summary-date-ranges (4 pos, 5 neg)
pypi-server (4 pos, 5 neg)
regex-log (4 pos, 5 neg)
cancel-async-tasks (4 pos, 5 neg)

NAT v9 Improvements

Based on failure analysis of v8 (44% success rate), v9 includes 5 types of negative examples:

Negative Types

Hallucinated Arguments: Adding non-existent parameters like message_title, message_description, message_attachment
Asking User for Help: "Would you like me to..." instead of executing autonomously
Excessive Exploration: Running many commands without taking decisive action
Wrong Block Mode: Using block=False for quick commands that should be blocking
Looping Behavior: Repeating commands after task completion

Key Improvements

55.6% negative ratio (vs 23% in v8)
Task-specific negatives for each failure pattern
Clean positive examples with no hallucinated parameters
Enhanced system prompt emphasizing autonomous execution

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Aznaur/tbench-qwen-sft-multitask-nat-v9",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "Aznaur/tbench-qwen-sft-multitask-nat-v9",
    trust_remote_code=True
)

Performance

This model addresses the key failure patterns identified in v8:

Eliminates hallucinated tool parameters
Prevents asking user for help
Encourages decisive action over excessive exploration
Uses correct block modes for commands
Avoids looping behavior

Limitations

Training on 5 tasks only - may not generalize to all terminal tasks
Negative examples based on observed failure patterns only
Model may still fail on edge cases not covered in training

Training Pipeline

Dataset Creation: create_multitask_nat_v9.py
Training Config: experiment_multitask_nat_v9.yaml
Failure Analysis: MULTITASK_NAT_FAILURE_ANALYSIS.md

License

This model inherits the license from the base Qwen3-8B model.

Downloads last month: 3

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support