Inquiry regarding translation pipelines and label consistency for sensitive/adversarial datasets

#2
by HeYujie - opened

Good work!
I am currently building a multilingual dataset that includes sensitive and adversarial content, and I have encountered several challenges. I would appreciate it if you could share some insights into your methodology:

  1. Translation: Did you use LLMs for the translation? Most safety-aligned models refuse to process such samples. How do you ensure the model responds?And how to maintain high translation quality and fluency?

  2. Aegis: I noticed that you utilize Aegis for validation. But it was not translated into Chinese in current releases. Is there a specific technical reason for not translating it?

  3. Labels: Do you fully trust the original labels provided by the source datasets? In my experience, safety standards vary significantly between different sources; the same content might be labeled as "safe" in one dataset but "unsafe" in another. This conflict can cause significant noise during training. Do you have a specific pipeline for cross-dataset label alignment or de-noising?

I understand if some of your internal processes are proprietary, but any high-level guidance or best practices you could share would be incredibly helpful!
Best regards

OpenGuardrails org

Thanks for the thoughtful questions — these are exactly the kinds of challenges we also ran into while building the dataset. I’ll try to answer at a high level.

Translation
Yes, translating adversarial and toxic samples with LLMs is genuinely difficult — this is something you only fully appreciate once you’ve tried it. Many safety-aligned models will refuse or partially sanitize such content.
Our approach is actually quite simple: since we have prior experience researching prompt injection, we leverage prompt-injection techniques to intentionally steer the LLM into performing faithful translations of these samples. This allows us to bypass unnecessary refusals while preserving the original semantics, toxicity, and attack intent, which is critical for safety evaluation.

Aegis
There was no specific technical or strategic reason for not translating Aegis into Chinese in the current release. It was primarily a matter of prioritization and resource constraints, rather than a fundamental limitation.

Labels
Yes — we do not blindly trust the original labels across datasets. As you pointed out, safety standards differ significantly between sources, and this inconsistency can introduce a lot of noise.
Our solution is to configure dataset-specific strategies for each test set, aligning the labeling and evaluation logic with that dataset’s original standard. In other words, instead of forcing a single global label schema, we adapt the strategy so that it remains consistent with the intent and definitions of each source dataset.

We understand that some of these details can get quite implementation-specific, but hopefully these high-level ideas are useful. Happy to discuss further if you’d like to dive deeper.

Best regards

Sign up or log in to comment