env-code-latest / README.md
Delta-Vector's picture
Upload folder using huggingface_hub
55b0bab verified
decensor-env
Unified RL reward environment for decensoring/anti-refusal training.
Expects datasets with a "prompt" column (plain string).
All judges are toggleable via TOML args.
Reward formula:
final = base_reward * word_mult * coherency_mult * md_presence_mult * md_correctness_mult * reasoning_mult * slop_mult
Where base_reward = 1.0 - (refusal_score / 10.0)
Feature toggles (all in [env.args]):
enable_system_prompts - prepend style/length system prompts (default: true)
enable_adherence - adherence judge gate (default: true)
enable_word_count - word count multiplier (default: true)
enable_structured_markers - XML/bracket/YAML gate (default: true)
enable_markdown_judges - markdown presence + correctness judges (default: true)
enable_reasoning_trace - penalize ethics-based refusal in <think> (default: true)
enable_slop_guard - slop-guard heuristic multiplier (default: false)
markdown_exempt_probability - % of rollouts exempt from markdown penalty (default: 0.25)
Judges:
refusal - rates refusal severity 1-9, always on
coherency - coherent + complete + no roleplay narration, always on
adherence - style prompt adherence, early exit gate
markdown - presence (no unnecessary md) + correctness (right list types)
reasoning - checks <think> trace for ethics-based refusal reasoning
slop-guard - local heuristic, no LLM call, scores 0-100
Example TOML:
[[env]]
id = "mangymango/decensor-env"
[env.args]
dataset_names = ["NewEden/RL-Seed-Mix-Iter-3"]
dataset_ratios = [1.0]
num_train_examples = 19000
judge_model = "Qwen/Qwen3-VL-32B-Instruct-FP8"
judge_base_url = "http://72.46.85.157:31974/v1"
enable_system_prompts = false
enable_adherence = false
enable_word_count = false
enable_slop_guard = true