phatvo/hotpotqa-raft-dev-100
Viewer • Updated • 100 • 9
How to use phatvo/Meta-Llama3.1-8B-Instruct-RAFT with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="phatvo/Meta-Llama3.1-8B-Instruct-RAFT")
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe(messages) # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("phatvo/Meta-Llama3.1-8B-Instruct-RAFT")
model = AutoModelForCausalLM.from_pretrained("phatvo/Meta-Llama3.1-8B-Instruct-RAFT")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))How to use phatvo/Meta-Llama3.1-8B-Instruct-RAFT with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "phatvo/Meta-Llama3.1-8B-Instruct-RAFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "phatvo/Meta-Llama3.1-8B-Instruct-RAFT",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker model run hf.co/phatvo/Meta-Llama3.1-8B-Instruct-RAFT
How to use phatvo/Meta-Llama3.1-8B-Instruct-RAFT with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "phatvo/Meta-Llama3.1-8B-Instruct-RAFT" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "phatvo/Meta-Llama3.1-8B-Instruct-RAFT",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "phatvo/Meta-Llama3.1-8B-Instruct-RAFT" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "phatvo/Meta-Llama3.1-8B-Instruct-RAFT",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'How to use phatvo/Meta-Llama3.1-8B-Instruct-RAFT with Docker Model Runner:
docker model run hf.co/phatvo/Meta-Llama3.1-8B-Instruct-RAFT
LORA adapters of meta-llama/Meta-Llama-3.1-8B-Instruct, trained on 100 context samples from the HotpotQA dataset using the RAFT method, enable the model to better reason through the context and return more accurate outcomes.
Evaluated on FULL validation set of HotpotQA.
| type | exatch_match | f1 | precision | recall |
|---|---|---|---|---|
| pretrained | 0.2980 | 0.3979 | 0.4116 | 0.5263 |
| finetuned | 0.3606 | 0.4857 | 0.4989 | 0.5318 |
Finetuned version increases 22% on F1 and 15% on average
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
model_id = "phatvo/Meta-Llama3.1-8B-Instruct-RAFT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, device_map="auto", revision="main", trust_remote_code=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
inst = "Given the question and context below, thinking in logical reasoning way for your answer.\
Please provide only your answer in this format: CoT Answer: {reason} <ANSWER>: {answer}."
context = ""
question = ""
prompt = f"{context}\n{question}"
chat = [
{"role": "system", "content": inst},
{"role": "user", "content": prompt},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False)
output = pipe(prompt,
temperature=0.001,
max_new_tokens=1024, # recommended to set it more than 800
return_full_text=False,
do_sample=True)
print(output[0]["generated_text"])
# CoT Answer: thoughts... <ANSWER>: final_answer...