T-Bench Qwen SFT Multi-Task NAT v8

Model Description

This model is fine-tuned from Qwen3-8B using enhanced Negative-Aware Training (NAT) on multiple terminal bench tasks.

Training Details

Base Model: Qwen/Qwen3-8B
Training Method: Enhanced Negative-Aware Training (NAT v8)
Tasks: 5 tasks (fix-git, log-summary-date-ranges, pypi-server, regex-log, cancel-async-tasks)
Epochs: 200 (trained for 200 epochs)
Learning Rate: 5e-5
Batch Size: 2

Dataset Composition

Total samples: 30 per epoch
Positive examples: 20 (4 per task)
Negative examples: 10 (2 per task, task-specific negatives)

NAT v8 Enhancements

Negative examples teach task-specific anti-patterns:

Hallucinated arguments: Adding message_title, message_description
Looping behavior: Repeating commands after task completion
Wrong command format: Using id instead of actual command
Task-specific failures: Customized negative patterns for each task

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Aznaur/tbench-qwen-sft-multitask-nat-v8")
tokenizer = AutoTokenizer.from_pretrained("Aznaur/tbench-qwen-sft-multitask-nat-v8")

Performance

Trained for 200 epochs with enhanced NAT to improve tool usage and avoid task-specific failure patterns.

Paper Reference

Based on "Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents" (arXiv 2402.11651)

Model Checkpoint

Epoch: 199
Global Step: 7799
Training completed successfully

Downloads last month: -

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support