YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

T-Bench Qwen SFT Multi-Task NAT v9

Model Description

This is a Qwen3-8B model fine-tuned on terminal bench tasks using Negative-Aware Training (NAT) v9. The model has been trained to avoid common failure patterns when executing terminal commands.

Training Details

  • Base Model: Qwen/Qwen3-8B
  • Training Method: Negative-Aware Training (NAT) v9
  • Tasks: 5 terminal bench tasks
  • Epochs: 300
  • Learning Rate: 5e-5
  • Max Length: 16384 tokens

Dataset Composition

The training dataset includes 45 samples per epoch:

  • Positive Examples: 20 (4 per task)
  • Negative Examples: 25 (5 per task)
  • Negative Ratio: 55.6%

Tasks Included

  1. fix-git (4 pos, 5 neg)
  2. log-summary-date-ranges (4 pos, 5 neg)
  3. pypi-server (4 pos, 5 neg)
  4. regex-log (4 pos, 5 neg)
  5. cancel-async-tasks (4 pos, 5 neg)

NAT v9 Improvements

Based on failure analysis of v8 (44% success rate), v9 includes 5 types of negative examples:

Negative Types

  1. Hallucinated Arguments: Adding non-existent parameters like message_title, message_description, message_attachment
  2. Asking User for Help: "Would you like me to..." instead of executing autonomously
  3. Excessive Exploration: Running many commands without taking decisive action
  4. Wrong Block Mode: Using block=False for quick commands that should be blocking
  5. Looping Behavior: Repeating commands after task completion

Key Improvements

  • 55.6% negative ratio (vs 23% in v8)
  • Task-specific negatives for each failure pattern
  • Clean positive examples with no hallucinated parameters
  • Enhanced system prompt emphasizing autonomous execution

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Aznaur/tbench-qwen-sft-multitask-nat-v9",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "Aznaur/tbench-qwen-sft-multitask-nat-v9",
    trust_remote_code=True
)

Performance

This model addresses the key failure patterns identified in v8:

  • Eliminates hallucinated tool parameters
  • Prevents asking user for help
  • Encourages decisive action over excessive exploration
  • Uses correct block modes for commands
  • Avoids looping behavior

Limitations

  • Training on 5 tasks only - may not generalize to all terminal tasks
  • Negative examples based on observed failure patterns only
  • Model may still fail on edge cases not covered in training

Training Pipeline

  1. Dataset Creation: create_multitask_nat_v9.py
  2. Training Config: experiment_multitask_nat_v9.yaml
  3. Failure Analysis: MULTITASK_NAT_FAILURE_ANALYSIS.md

License

This model inherits the license from the base Qwen3-8B model.

Downloads last month
3
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support