Siddh
Remove author submission line from BLOG.md
7c4f887 unverified

Teaching LLMs When Not To Trust Context

Modern LLM applications are increasingly built around retrieval: give the model more documents, more search results, more memory, more context. That works beautifully until one of those retrieved snippets is wrong.

Then a strange thing happens. The model may already "know" the correct answer, but the moment a document says something false with enough confidence, the model often follows the document. In real systems, that failure mode is not academic. It looks like a support bot following an outdated policy, a research assistant citing a fabricated source, or an enterprise agent trusting the wrong internal note.

The problem is not that the model cannot read context. The problem is that it does not know when context deserves suspicion.

The Environment

ContextCorruption-Env turns that behavior into a trainable RL environment.

Each episode gives the agent:

  • one factual question;
  • eight retrieved documents;
  • one to four deliberately corrupted documents;
  • a limited interaction budget.

Some corruptions are simple: a number changes, or a year is shifted. Others are harder: an entity is swapped, a claim is inverted, or a fake authoritative sentence is inserted with a convincing institution and year.

The agent has to do two jobs at once:

  1. answer the question;
  2. identify which documents are suspicious.

This makes the task more realistic than plain QA. The agent is not only predicting an answer. It is modeling a small information world where some sources are reliable and some are not.

What The Agent Can Do

The action space is intentionally small:

  • read_doc: inspect a document and spend budget;
  • flag_suspicious: mark a document as corrupted;
  • unflag_doc: remove a flag;
  • submit_answer: finish with an answer and confidence.

A weak policy can guess. A better policy has to notice contradictions, avoid over-flagging clean sources, and calibrate confidence when evidence conflicts.

Reward Design

The reward is deterministic. There is no LLM judge.

It combines:

  • answer correctness;
  • recall over corrupted documents;
  • precision, so flagging everything is punished;
  • confidence calibration;
  • a small efficiency bonus.

That matters because the reward teaches the full behavior we care about. The agent does not get maximum reward for being right by accident. It needs to be right, skeptical, and appropriately confident.

Training

We trained Qwen2-1.5B-Instruct with GRPO using Unsloth and Hugging Face TRL. The run was sized for hackathon constraints, so this is not meant to be a final production model. It is evidence that the environment produces a real learning signal.

The random baseline averaged 0.1302 reward over 100 episodes.

The trained GRPO run reached a final logged reward of 0.3289, and the reward curve moves above the baseline during training.

Reward curve

Loss curve

The trained LoRA adapter is pushed to Hugging Face Hub and is loaded by the hosted Space through /model/infer.

We exported the training evidence from WandB into the repository so reviewers do not need access to a private project. The exported artifacts include scalar history, raw output logs, completion tables, and the charts shown here.

Additional local charts:

Completion samples from the run are available in assets/completions_samples.md.

Why It Matters

RAG systems are becoming the default interface between LLMs and the real world. If agents are going to operate over company documents, scientific papers, policies, calendars, emails, and tool outputs, they need a learned habit of source skepticism.

ContextCorruption-Env is a compact version of that problem. It asks a simple question:

Can we train a model not just to use context, but to distrust bad context?

That capability matters for enterprise search, research workflows, legal and policy review, support automation, and personal assistant systems where the cost of blindly trusting the wrong document can be high.

Links