Search-Gen-V-4b

Search-Gen-V is not only capable of efficiently determining whether text segments meet various evaluation criteria, but it can also output its reasoning process, thereby enhancing the transparency and interpretability of the evaluation results. At the same time, it supports batch verification across multiple evaluation criteria and significantly improves reasoning efficiency and resource utilization by constraining the generated tokens, maintaining high accuracy while reducing computational costs.

Model Details

Base model: [Qwen3-4B]
Fine-tuning method: [SFT+DAPO]
Training framework: [VeRL]

Usage Example

MODEL_PATH=$1
python3 -m sglang.launch_server \
    --model-path ${MODEL_PATH} \
    --tp-size 8 \
    --max-running-requests 160 \
    --cuda-graph-max-bs 160 \
    --mem-fraction-static 0.8 \
    --reasoning-parser qwen3 \
    --tool-call-parser llama3 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Result

Table 1. Results on the eval set

Verifier Model	Rubric Precision	Rubric Recall	Rubric F1	Sample Precision	Sample Recall	Sample F1	Avg. F1
Qwen3-1.7B	0.41	0.49	0.34	0.48	0.40	0.32	0.33
Qwen2.5-3B	0.42	0.47	0.43	0.49	0.46	0.43	0.43
Qwen3-4B	0.56	0.62	0.57	0.61	0.58	0.58	0.58
Qwen3-8B	0.54	0.66	0.55	0.62	0.61	0.57	0.56
LLaMA-3.1-8B	0.45	0.54	0.42	0.34	0.41	0.32	0.37
Qwen3-30B-A3B	0.56	0.66	0.56	0.63	0.62	0.62	0.58
Qwen2.5-32B-Instruct	0.60	0.67	0.60	0.67	0.68	0.64	0.62
Search-Gen-V-1.7B (SFT)	0.63	0.62	0.62	0.66	0.66	0.66	0.64
Search-Gen-V-4B (SFT)	0.70	0.66	0.68	0.72	0.72	0.71	0.70
Search-Gen-V-4B (SFT+RL)	0.71	0.68	0.70	0.74	0.74	0.73	0.72
Qwen3-235B-A22B-Instruct-2507	0.72	0.73	0.73	0.76	0.76	0.76	0.74

Table 2. Accuracy comparison on verifying rubrics in longform answers from DeepResearch Bench

Verifier Model Precision Recall F1

Qwen3-4B 0.42 0.56 0.42

Search-Gen-V-4B 0.59 0.57 0.57

Qwen3-235B-A22B 0.57 0.67 0.61

Verifier Model	Precision	Recall	F1
Qwen3-4B	0.42	0.56	0.42
Search-Gen-V-4B	0.59	0.57	0.57
Qwen3-235B-A22B	0.57	0.67	0.61

Table 3. Results on the short-form workload, HotpotQA

Verifier Model	Precision	Recall	F1
EM	0.84	0.80	0.82
Qwen3-4B	0.83	0.70	0.71
Search-Gen-V-4B	0.86	0.76	0.77
Qwen3-235B-A22B	0.87	0.78	0.80
EM + Qwen3-4B	0.94	0.92	0.93
EM + Search-Gen-V-4B	0.95	0.93	0.94
EM + Qwen3-235B-A22B	0.96	0.94	0.95

Citation

@article{ma2025searchgenv,
  title={AN EFFICIENT RUBRIC-BASED GENERATIVE VERIFIER FOR SEARCH-AUGMENTED LLMS},
  author={Ma, Linyue and Xu, Yilong and Long, Xiang and Zheng, Zhi},
  journal={arXiv preprint arXiv:2510.14660},
  year={2025},
  url={https://arxiv.org/abs/2510.14660}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for lnm1p/search-gen-v-4b

An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs

Paper • 2510.14660 • Published Oct 16, 2025 • 1

lnm1p
/

search-gen-v-4b