Instructions to use Aman/selfrag-zh_baichuan2_7b_chat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Aman/selfrag-zh_baichuan2_7b_chat with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Aman/selfrag-zh_baichuan2_7b_chat", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Aman/selfrag-zh_baichuan2_7b_chat", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Aman/selfrag-zh_baichuan2_7b_chat with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Aman/selfrag-zh_baichuan2_7b_chat"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Aman/selfrag-zh_baichuan2_7b_chat",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Aman/selfrag-zh_baichuan2_7b_chat

SGLang

How to use Aman/selfrag-zh_baichuan2_7b_chat with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Aman/selfrag-zh_baichuan2_7b_chat" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Aman/selfrag-zh_baichuan2_7b_chat",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Aman/selfrag-zh_baichuan2_7b_chat" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Aman/selfrag-zh_baichuan2_7b_chat",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Aman/selfrag-zh_baichuan2_7b_chat with Docker Model Runner:
```
docker model run hf.co/Aman/selfrag-zh_baichuan2_7b_chat
```

This model is a 7B Chinese version of Self-RAG.

It is trained on Baichuan2-7B-Chat with a sample of belle sft data, acompanying with interleaving passages from zhwiki. The reflection tokens are aligned with the original verison (in English), so the usage is the same. Hope you enjoy.

Data

The data used to train the model is also available (FINAL_OUTPUT_4w.jsonl), which is constructed using Belle SFT data and Wikipedia Chinese docs.

Usage

The Critic Model

The critic model is released at the critic/ folder. However, due to the quantity and quality of the critic data, there is still a distance from a perfect performance.

The Generator

I found some output errors while adopting vllm to accelerate the generation process and not sure whether it is due to some precision issues. This may be owing to the implementation of vllm. Thus, I use the original generate method of transformers.

import os, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(YOUR_TOKENIZER_PATH)
model = AutoModelForCausalLM.from_pretrained(
        YOUR_MODEL_PATH,
        torch_dtype=torch.bfloat16,
        device_map="cuda",
    )

### set your retriever if necessary
retriever = setup_retriever(YOUR_RETRIEVER_PATH)


def format_prompt(input, paragraph=None):
    prompt = "### Instruction:\n{0}\n\n### Response:".format(input)
    if paragraph is not None:
        prompt += "[Retrieval]<paragraph>{0}</paragraph>".format(paragraph)
    return prompt


while True:
    query = input("[Human]: ")
    prompt = format_prompt(query)
    sequences = model.generate(
        **tokenizer(prompt, return_tensors='pt').to(model.device),
        do_sample=False,
        num_beams=5,
        # top_k=10,
        # top_p=0.8,
        temperature=0.9,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_new_tokens=1024,
        min_new_tokens=1,
        repetition_penalty=1.5,
    )
    for seq in sequences:
        print(f"[Model]: {tokenizer.decode(seq, skip_special_tokens=False)}")
        print("-"*50)
    print("="*50)

# query_1 = "你好呀"
# Model prediction: [No Retrieval] 你好！有什么我可以帮你解答的问题吗？ [Utility:5] </s>
# query_2 = "故宫三大殿是哪些？"
# Model prediction: [Retrieval] <paragraph> ... (this query requires factual grounding, call a retriever) </paragraph> [Relevant] 太和殿、中和殿、保和殿 [Utility:5] </s>

Downloads last month: 14