TrustSafeAI

community

https://sites.google.com/site/pinyuchenpage/home

AI & ML interests

Research Demos and Tools for Trustworthy and Safe AI Development and Deployment

Recent Activity

gregH submitted a paper 22 days ago

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

gregH authored a paper 22 days ago

RADAR: Robust AI-Text Detection via Adversarial Learning

gregH authored a paper 22 days ago

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes

View all activity

submitted a paper to Daily Papers 22 days ago

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

Paper • 2604.10866 • Published 25 days ago • 65

authored 6 papers 22 days ago

RADAR: Robust AI-Text Detection via Adversarial Learning

Paper • 2307.03838 • Published Jul 7, 2023 • 1

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes

Paper • 2403.00867 • Published Mar 1, 2024

Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models

Paper • 2412.18171 • Published Dec 24, 2024

Qwen3Guard Technical Report

Paper • 2510.14276 • Published Oct 16, 2025 • 15

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Paper • 2511.04570 • Published Nov 6, 2025 • 242

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

Paper • 2604.10866 • Published 25 days ago • 65

authored a paper about 1 month ago

Emergent Social Intelligence Risks in Generative Multi-Agent Systems

Paper • 2603.27771 • Published Mar 29 • 52

authored 3 papers 4 months ago

Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Paper • 2506.05346 • Published Jun 5, 2025

Spectral Insights into Data-Oblivious Critical Layers in Large Language Models

Paper • 2506.00382 • Published May 31, 2025

NCTV: Neural Clamping Toolkit and Visualization for Neural Network Calibration

Paper • 2211.16274 • Published Nov 29, 2022

updated a Space 4 months ago

NCTV: Neural Clamping Toolkit and Visualization

Model-agnostic Toolkit for Neural Network Calibration

updated a Space 7 months ago

Test Time Calibration

Test-time calibration for improving test-time reasoning

published a Space 7 months ago

Test Time Calibration

Test-time calibration for improving test-time reasoning

updated a Space 7 months ago

LLM Physical Safety

LLM benchmark for Physical Safety

authored a paper 8 months ago

GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

Paper • 2411.14133 • Published Nov 21, 2024 • 1

updated a Space 10 months ago

README

updated a collection 10 months ago

DivEye: Diversity-Driven AI Text Detector

https://openreview.net/forum?id=QuDDXJ47nq • 1 item • Updated Jul 15, 2025

updated a model 11 months ago

TrustSafeAI/AudioDeepfakeDetectors

Updated Jun 18, 2025

updated a Space 11 months ago

CoP Agentic Red-teaming

Generate jailbreak prompts for LLMs using principles