--- title: RLM Model Evaluation sdk: docker hardware: t4-small --- # RLM Model Evaluation Evaluates the trained needle-in-haystack model against the base model. ## Models - Base: Qwen/Qwen3-0.6B-Base - Trained: mindchain/qwen3-0.6b-rlm-needle ## Expected Results - Base: ~25% accuracy (random guessing) - Trained: 50-75% accuracy (after GRPO training)