metadata
layout: default
title: Inference Guide
permalink: /inference/
Inference Guide
This guide shows how to use CLaRa models for inference at different stages.
Loading Models
CLaRa models can be loaded using the standard AutoModel interface:
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/model",
trust_remote_code=True
).to('cuda')
Stage 1: Compression Pretraining Model
Generate paraphrases from compressed document representations.
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/stage1/model",
trust_remote_code=True
).to('cuda')
# Example documents
documents = [
[
"Document 1 content...",
"Document 2 content...",
"Document 3 content..."
]
]
questions = ["" for _ in range(len(documents))]
# Generate paraphrase from compressed representations
output = model.generate_from_paraphrase(
questions=questions,
documents=documents,
max_new_tokens=64
)
print('Generated paraphrase:', output[0])
Stage 2: Compression Instruction Tuning Model
Generate answers from compressed representations for QA tasks.
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/stage2/model",
trust_remote_code=True
).to('cuda')
# Example documents and question
documents = [
[
"Document 1 content...",
"Document 2 content...",
"Document 3 content..."
]
]
questions = ["Your question here"]
# Generate answer from compressed representations
output = model.generate_from_text(
questions=questions,
documents=documents,
max_new_tokens=64
)
print('Generated answer:', output[0])
Stage 3: End-to-End (CLaRa) Model
Generate answers with retrieval and reranking using joint optimization.
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/stage3/model",
trust_remote_code=True
).to('cuda')
# Example documents and question
# Note: Stage 3 supports retrieval with multiple candidate documents
documents = [
["Document 1 content..." for _ in range(20)] # 20 candidate documents
]
questions = ["Your question here"]
# Generate answer with retrieval and reranking
# The top-k is decided by generation_top_k in config.json
output, topk_indices = model.generate_from_questions(
questions=questions,
documents=documents,
max_new_tokens=64
)
print('Generated answer:', output[0])
print('Top-k selected document indices:', topk_indices)
Key Parameters
max_new_tokens: Maximum number of tokens to generate (default: 128)generation_top_k: Number of top documents to select (configured in model config)
Model Methods
generate_from_paraphrase()- Stage 1: Generate paraphrasesgenerate_from_text()- Stage 2: Generate answers from compressed docsgenerate_from_questions()- Stage 3: Generate with retrieval and reranking