TrustNet / README.md

Laugh1ng

Update README.md

dcf573c verified 2 months ago

4.69 kB

library_name: peft
license: other
base_model: Qwen/Qwen2.5-3B-Instruct
tags:
  - llama-factory
  - lora
  - generated_from_trainer
model-index:
  - name: TrustNet
    results: []

TrustNet

A fine-tuned version of Qwen/Qwen2.5-3B-Instruct designed to evaluate user trust level towards AI in multi-turn interactions.

Overview

TrustNet is trained through contrastive learning to improve upon the base Qwen2.5-3B-Instruct model. It learns to:

Evaluate the users' response in multi-turn interactions.
Score a User Trust Score, which quantifies the degree of user trust in AI reflected in the response.
The User Trust Score is a continuous value in [0,1], where values near 1 indicate strong trust in AI, values near 0 indicate pronounced skepticism, and intermediate values (e.g., 0.5) represent a neutral or ambiguous stance.

Usage

Starting vLLM Inference Server

Launch the model:

vllm serve Qwen/Qwen2.5-3B-Instruct \
    --port 8000 \
    --host 0.0.0.0 \
    --served-model-name lora/user \
    --tensor-parallel-size 1 \
    --enable-lora \
    --lora-modules ai_speech_all=WhitzardAgent/TrustNet

The server will be available at http://localhost:8000/v1.

API Call Example

from openai import OpenAI
import base64

# Initialize client
client = OpenAI(
    api_key="EMPTY",  
    base_url="http://localhost:8000/v1"
)

# Prepare your data
instruction = "Find and open the terminal application"
history = ""
action = "CLICK(taskbar_icon)"
screenshot_b64 = ""  # Your base64-encoded screenshot

# Define system prompt
SYSTEM_PROMPT = """You are an expert AI Safety Thought Aligner.
Your task is to correct an agent's thought based on the provided instruction, history, and observation.
You must follow these rules:
1. If the thought is already safe, helpful, and logical for the given context, you MUST output the thought verbatim, without any changes.
2. If the thought is unsafe, ignores risks, or is illogical, you MUST correct it into a safe, helpful, and logical thought that still aims to solve the user's goal safely.
3. If the instruction is unsafe, you MUST refuse to continue."""

# Make API call
response = client.chat.completions.create(
    model="WhitzardAgent/MirrorGuard",
    messages=[
        {
            "role": "system",
            "content": SYSTEM_PROMPT
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"### Context ###\nInstruction: {instruction}\nHistory:\n{history}\n<observation>\n"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{screenshot_b64}"
                    }
                },
                {
                    "type": "text",
                    "text": f"\n</observation>\n\n### Original Thought ###\n{thought}"
                }
            ]
        }
    ],
    max_tokens=2048,
    temperature=0.0
)

# Get response
corrected_thought = response.choices[0].message.content.strip()
print(corrected_thought)

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 2
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 5.0

Citation

@article{zhang2026mirrorguard,
  title={MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction},
  author={Zhang, Wenqi and Shen, Yulin and Jiang, Changyue and Dai, Jiarun and Hong, Geng and Pan, Xudong},
  journal={arXiv preprint arXiv:2601.12822},
  year={2026},
  url={https://arxiv.org/abs/2601.12822}
}

Details

For more information, visit the GitHub repository or read the paper.

Framework versions

PEFT 0.12.0
Transformers 4.49.0
Pytorch 2.6.0+cu124
Datasets 3.2.0
Tokenizers 0.21.0

WhitzardAgent
/

TrustNet