Instructions to use Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering")
model = AutoModelForCausalLM.from_pretrained("Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering

SGLang

How to use Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering with Docker Model Runner:
```
docker model run hf.co/Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering
```

Model Description

This model is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct, trained for 🚀evidence relevance classification or evidence filtering🚀 in medical RAG pipelines.
Given a clinical query and a candidate passage, the model outputs “Yes” if the passage contains supporting evidence and “No” otherwise.

This lightweight classifier is designed to help researchers:

Improve retrieval quality in medical RAG systems.
Filter irrelevant passages before generation.
Build more reliable, interpretable RAG pipelines for medical QA.

For additional context, methodology, and full experimental details, please refer to our paper below.

📄 Paper: Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Instruction used during training
INSTRUCTION = (
    "Given a query and a text passage, determine whether the passage contains supporting evidence for the query. "
    "Supporting evidence means that the passage provides clear, relevant, and factual information that directly backs or justifies the answer to the query.\n\n"
    "Respond with one of the following labels:\n\"Yes\" if the passage contains supporting evidence for the query.\n"
    "\"No\" if the passage does not contain supporting evidence.\n"
    "You should respond with only the label (Yes or No) without any additional explanation."
)

# Example query + retrieved passage
query = "What is the first-line treatment for acute angle-closure glaucoma?"
doc = "Acute angle-closure glaucoma requires immediate treatment with topical beta-blockers, alpha agonists, and systemic carbonic anhydrase inhibitors."

# Build chat-style prompt
content = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": INSTRUCTION},
        {"role": "user", "content": f"Question: {query}\nPassage: {doc}"}
    ],
    add_generation_prompt=True,
    tokenize=False,
)

# Tokenize
input_ids = tokenizer(content, return_tensors="pt").input_ids.to(model.device)

# Define stopping tokens (Llama-3 style)
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

# Generate evidence-filtering judgment
outputs = model.generate(
    input_ids=input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=False,
    temperature=0.0,
)

# Decode model response
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

Training Setup

Dataset: 3,200 query–passage pairs with expert-provided Yes/No labels (dataset to be released in a future update).
Task: Given a query and a candidate passage, the model generates "Yes" if the passage contains supporting evidence and "No" otherwise.
Objective: Causal language modeling (cross-entropy next-token loss).
Prompt: See the Quick Start section for an example usage prompt.
Hyperparameter Tuning: Five-fold cross-validation.
Final Hyperparameters:
- Learning rate: 2e-6
- Batch size: 8
- Epochs: 3
Training Framework: LLaMA-Factory.

Performance

Evaluation was conducted on 3,200 expert-annotated query–passage pairs using five-fold cross-validation.

Model	Precision	Recall	F1
Llama-3.1-8B (zero-shot)	0.483	0.566	0.521
GPT-4o (zero-shot)	0.697	0.324	0.442
Llama-3.1-8B (fine-tuned, ours)	0.592	0.657	0.623

🔥 Fine-tuning yields substantial gains over all zero-shot baselines.

Intended Use

This model is intended for research purposes only.

Reference

Please see the information below to cite our paper.

@article{kim2025rethinking,
  title={Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights},
  author={Kim, Hyunjae and Sohn, Jiwoong and Gilson, Aidan and Cochran-Caggiano, Nicholas and Applebaum, Serina and Jin, Heeju and Park, Seihee and Park, Yujin and Park, Jiyeong and Choi, Seoyoung and others},
  journal={arXiv preprint arXiv:2511.06738},
  year={2025}
}