Inquiry regarding translation pipelines and label consistency for sensitive/adversarial datasets

by HeYujie - opened Jan 21

Jan 21

Good work!
I am currently building a multilingual dataset that includes sensitive and adversarial content, and I have encountered several challenges. I would appreciate it if you could share some insights into your methodology:

Translation: Did you use LLMs for the translation? Most safety-aligned models refuse to process such samples. How do you ensure the model responds?And how to maintain high translation quality and fluency?
Aegis: I noticed that you utilize Aegis for validation. But it was not translated into Chinese in current releases. Is there a specific technical reason for not translating it?
Labels: Do you fully trust the original labels provided by the source datasets? In my experience, safety standards vary significantly between different sources; the same content might be labeled as "safe" in one dataset but "unsafe" in another. This conflict can cause significant noise during training. Do you have a specific pipeline for cross-dataset label alignment or de-noising?

I understand if some of your internal processes are proprietary, but any high-level guidance or best practices you could share would be incredibly helpful!
Best regards

thomaslwang

OpenGuardrails org Jan 24

Thanks for the thoughtful questions — these are exactly the kinds of challenges we also ran into while building the dataset. I’ll try to answer at a high level.

Translation
Yes, translating adversarial and toxic samples with LLMs is genuinely difficult — this is something you only fully appreciate once you’ve tried it. Many safety-aligned models will refuse or partially sanitize such content.
Our approach is actually quite simple: since we have prior experience researching prompt injection, we leverage prompt-injection techniques to intentionally steer the LLM into performing faithful translations of these samples. This allows us to bypass unnecessary refusals while preserving the original semantics, toxicity, and attack intent, which is critical for safety evaluation.

Aegis
There was no specific technical or strategic reason for not translating Aegis into Chinese in the current release. It was primarily a matter of prioritization and resource constraints, rather than a fundamental limitation.

Labels
Yes — we do not blindly trust the original labels across datasets. As you pointed out, safety standards differ significantly between sources, and this inconsistency can introduce a lot of noise.
Our solution is to configure dataset-specific strategies for each test set, aligning the labeling and evaluation logic with that dataset’s original standard. In other words, instead of forcing a single global label schema, we adapt the strategy so that it remains consistent with the intent and definitions of each source dataset.

We understand that some of these details can get quite implementation-specific, but hopefully these high-level ideas are useful. Happy to discuss further if you’d like to dive deeper.

Best regards

LiF

Feb 12

•

edited Feb 12

不知怎么滴就搜到你了，还看到了你的T4晋升申请，还有云上业务治理Hackathon。就像问一句，你干这个赚钱吗？

HeYujie

Feb 12

Hope so

HeYujie changed discussion status to closed Feb 12

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment