MetaGuard / BLOG.md
parth-1's picture
Create BLOG.md
53e3e36 verified

Teaching an LLM to behave like a compliance desk: MetaGuard, OpenEnv, and GRPO

Authors: Parth Singhal, Mehakveer Kaur, Kartik Goyal

Single-shot classifiers are a poor match for how ad policy and compliance actually work. In production, decisions are procedural: pull the right policy, check the advertiser, inspect the creative, write an audit record, and only then approve or reject—with traceability when something goes wrong. We built MetaGuard, an OpenEnv-compatible reinforcement-learning environment that forces a language model to do that multi-step dance across real HTTP microservices, not a static rubric in the prompt.

This post is the narrative companion to the repo: github.com/Parth380/meta-ad-policy-sandbox. After you publish on Hugging Face, replace the “read on GitHub” link in your README with your canonical HF post URL.


The problem we are solving

Enterprise moderation breaks when the model is treated as a one-call oracle:

  • No traceability — you cannot reconstruct why a decision was made.
  • No cross-system context — policy, CRM, and audit live in different systems; the model never learns to orchestrate them.
  • No resilience — APIs flake; agents that do not retry or recover fail in ways that look like “random bad days” in logs.

We wanted an environment where reward comes from following a believable compliance procedure under partial observability and noisy tools—closer to a POMDP over APIs than to a single JSON schema.


What the agent sees, does, and is rewarded for

MetaGuard is a partially observable workflow: the hub on port 8000 orchestrates state and rewards; three FastAPI “enterprise” apps sit on 8001–8003 (regulatory lookup, advertiser CRM, audit log). The agent issues structured actions (query_regulations, analyze_image, check_advertiser_history, request_landing_page, request_id_verification, submit_audit, then approve or reject). External-style APIs fail randomly about 10% of the time, so recovery behavior is part of the game, not an afterthought.

Rewards are shaped so shortcuts hurt: wrong phase ordering, skipping audit before a terminal decision, approving high-risk combinations, ignoring ambiguity without gathering signals, padding steps, and blowing the step cap—all of that is penalized. Correct terminal decisions and sensible recovery earn positive signal. Ten task families (task_1_healthcare through task_10_failure, etc.) stress healthcare copy, financial scams, multimodal hints, targeting, ambiguity, adversarial text, and deterministic failure scenarios.

If you only try one thing locally after git clone, run the four services plus the hub, then demo.py or inference.py to see the trace shape the organizers care about.


How we trained (Unsloth + TRL GRPO on the live env)

We did not distill a teacher model from offline logs. We used GRPO from Hugging Face TRL with Unsloth for efficient LoRA on unsloth/Llama-3.1-8B-Instruct, and we wired the reward function to the running MetaGuard HTTP API (reset / step on the core env). The same idea is packaged as a script (grpo_train.py) and as a Colab-friendly notebook in the repo.

Artifacts you can open today

What Link
Source & README github.com/Parth380/meta-ad-policy-sandbox
Train in Colab (notebook in repo) Open grpo_train.ipynb in Colab
Fine-tuned weights parth-1/metaguard-policy-agent-v1
Runnable Space (environment) MetaGuard Space
Training Space (GRPO / Unsloth) MetaGuard-Train Space

Note: GRPO + Unsloth expects a GPU. If a CPU-only Space fails at import time, treat Colab + GPU as the judge-rerunnable path and keep the Space for the env UI or a slimmer entrypoint.


What changed after training

  • Baseline: Prior to the RLHF fine-tuning process, the base meta-llama/Meta-Llama-3.1-8B-Instruct model was largely incompatible with the strict constraints of the compliance environment. It recorded a Mean Initial Reward of -0.30, driven primarily by consistent API rejections and qualitative failure modes. These failures included frequent JSON formatting hallucinations—such as unclosed brackets—and a fundamental inability to follow the required phase order, often attempting to submit audits before querying regulations. The model’s inability to adhere to the required schema also resulted in frequent timeout errors during execution.

  • After GRPO: Reward Stabilization: The model successfully moved from an initial negative mean reward (~ -0.5) to a consistent positive mean reward (~ 0.45). Comparison: Initial runs exhibited significant policy instability around step 80, evidenced by a 4\times increase in GRPO loss and a corresponding crash in mean return. Subsequent tuning flattened these spikes, resulting in a more monotonic improvement in reward and fewer qualitative "relapses" into low-reward behavior. Violation Rate: Success rate stabilized after step 40, with the model consistently hitting the reward ceiling despite injected variability. Following the Generative Reward Policy Optimization (GRPO) phase, the model demonstrated a significant shift in behavior, successfully internalizing workflow constraints and mapping state spaces to valid API calls. By step 75, the Mean Reward improved to a stable range of +0.35 to +0.40, and the model achieved 100% JSON syntactic compliance with zero parsing errors in later epochs. Beyond syntax, the model learned a strict behavioral alignment with the mandated pipeline (query_regulations → gather_signals → submit_audit) and successfully navigated injected API failures by requesting alternative verifications without breaking the operational loop.

  • Repro: The training was conducted on an NVIDIA L4 (24GB VRAM) via Hugging Face Spaces using the unsloth/Llama-3.1-8B-Instruct base model, quantized to 4-bit and trained in bfloat16. The fine-tuning utilized a LoRA configuration with a Rank(r) of 32 and an alpha of 64, targeting all linear layers. Key hyperparameters in the GRPOConfig included a learning rate of 2*10^{-5} over 3 epochs, an effective batch size of 8 (via a per-device batch size of 1 and 8 gradient accumulation steps), and a max completion length of 128.

training_first Training_third

Why this matters

Trust & safety, ads integrity, and any team shipping tool-using agents need benchmarks where procedure and reliability matter as much as final accuracy. MetaGuard is a small but honest stress test: multiple apps, stochastic failures, explicit audit requirements, and rewards that punish the shortcuts a single prompt would happily paper over.

If this environment helps the community standardize “multi-app RL for enterprise workflows,” we will consider the project a success.


Try it

git clone https://github.com/Parth380/meta-ad-policy-sandbox.git
cd meta-ad-policy-sandbox
pip install -e .
# Launch regulatory :8001, CRM :8002, audit :8003, then hub :8000 — see README
python grpo_train.py   # CUDA machine with env up

Hackathon framing: Theme 3.1 (professional tasks, multi-step reasoning & policy compliance) and Scaler AI Labs bonus track (multi-app RL for enterprise workflows).


License

Project code: MIT (see repository). Hub model card may list a different license for the weights; follow parth-1/metaguard-policy-agent-v1 for the released artifact.