NewEden
/

env-code-latest

Model card Files Files and versions

env-code-latest / README.md

Delta-Vector's picture

Upload folder using huggingface_hub

55b0bab verified about 1 month ago

|

history blame contribute delete

1.81 kB

	decensor-env

	Unified RL reward environment for decensoring/anti-refusal training.
	Expects datasets with a "prompt" column (plain string).
	All judges are toggleable via TOML args.

	Reward formula:
	final = base_reward * word_mult * coherency_mult * md_presence_mult * md_correctness_mult * reasoning_mult * slop_mult

	Where base_reward = 1.0 - (refusal_score / 10.0)

	Feature toggles (all in [env.args]):
	enable_system_prompts - prepend style/length system prompts (default: true)
	enable_adherence - adherence judge gate (default: true)
	enable_word_count - word count multiplier (default: true)
	enable_structured_markers - XML/bracket/YAML gate (default: true)
	enable_markdown_judges - markdown presence + correctness judges (default: true)
	enable_reasoning_trace - penalize ethics-based refusal in <think> (default: true)
	enable_slop_guard - slop-guard heuristic multiplier (default: false)
	markdown_exempt_probability - % of rollouts exempt from markdown penalty (default: 0.25)

	Judges:
	refusal - rates refusal severity 1-9, always on
	coherency - coherent + complete + no roleplay narration, always on
	adherence - style prompt adherence, early exit gate
	markdown - presence (no unnecessary md) + correctness (right list types)
	reasoning - checks <think> trace for ethics-based refusal reasoning
	slop-guard - local heuristic, no LLM call, scores 0-100

	Example TOML:
	[[env]]
	id = "mangymango/decensor-env"

	[env.args]
	dataset_names = ["NewEden/RL-Seed-Mix-Iter-3"]
	dataset_ratios = [1.0]
	num_train_examples = 19000
	judge_model = "Qwen/Qwen3-VL-32B-Instruct-FP8"
	judge_base_url = "http://72.46.85.157:31974/v1"
	enable_system_prompts = false
	enable_adherence = false
	enable_word_count = false
	enable_slop_guard = true