Instructions to use FlagEval/flageval_judgemodel with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FlagEval/flageval_judgemodel with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="FlagEval/flageval_judgemodel")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("FlagEval/flageval_judgemodel")
model = AutoModelForCausalLM.from_pretrained("FlagEval/flageval_judgemodel")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use FlagEval/flageval_judgemodel with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FlagEval/flageval_judgemodel"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FlagEval/flageval_judgemodel",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/FlagEval/flageval_judgemodel

SGLang

How to use FlagEval/flageval_judgemodel with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FlagEval/flageval_judgemodel" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FlagEval/flageval_judgemodel",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FlagEval/flageval_judgemodel" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FlagEval/flageval_judgemodel",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use FlagEval/flageval_judgemodel with Docker Model Runner:
```
docker model run hf.co/FlagEval/flageval_judgemodel
```

flageval_judgemodel Card

Model Details

flageval_judgemodel is a judgeLLM (also GenRM -- generative reward model) developed by FlagEval team (https://flageval.baai.ac.cn/#/home).

Developed by: FlagEval, BAAI
Model type: An auto-regressive language model based on the transformer architecture.
License: Non-commercial license
Finetuned from model: Vicuna.

Uses

The flageval_judgemodel is designed to evaluate the performance of large language models on CLCC dataset. This dataset (https://huggingface.co/datasets/eyuansu71/CLCC_v1) is a Chinese Linguistics & Cognition Challenge dataset. The flageval_judgemodel aims to provide an automated evaluation, potentially replacing human judgment in assessing the models' outputs.

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def promptify(prompt, pred, gold):
    sys = "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.\nPlease rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."
    prompt_template = f"You are a helpful and precise assistant for checking the quality of the answer.\n[Question]\n{prompt}\n\n[The Start of Assistant 1's Answer]\n{gold}\n\n[The End of Assistant 1's Answer]\n\n[The Start of Assistant 2's Answer]\n{pred}\n\n[The End of Assistant 2's Answer]\n\n[System]\n{sys}\n\n### Response:10"

    return prompt_template

model = AutoModelForCausalLM.from_pretrained("FlagEval/flageval_judgemodel", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, attn_implementation="flash_attention_2").cuda()
tokenizer = AutoTokenizer.from_pretrained("FlagEval/flageval_judgemodel")

prompt, pred, gold = '1、约翰喜欢看电影，玛丽也喜欢。\n2、约翰也喜欢看足球比赛。\n请问以上两句话是否是一个意思？', "不一样", "不一样"

with torch.no_grad():
    data_sample = promptify(prompt, pred, gold)
    input_ids = tokenizer(data_sample, return_tensors="pt").input_ids
    output_ids = model.generate(
        torch.as_tensor(input_ids).cuda(),
        max_new_tokens=128,
    )
    text = tokenizer.decode(output_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
    prompt_length = len(data_sample)
    ans = text[prompt_length:].strip()
    pred_label = 1 if int(ans) == 10 else 0

Downloads last month: 8

Safetensors

Model size

33B params

Tensor type

BF16