ml-clara / docs /inference.md
dl3239491's picture
Upload folder using huggingface_hub
30c14cd verified
metadata
layout: default
title: Inference Guide
permalink: /inference/

Inference Guide

This guide shows how to use CLaRa models for inference at different stages.

Loading Models

CLaRa models can be loaded using the standard AutoModel interface:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/model",
    trust_remote_code=True
).to('cuda')

Stage 1: Compression Pretraining Model

Generate paraphrases from compressed document representations.

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/stage1/model",
    trust_remote_code=True
).to('cuda')

# Example documents
documents = [
    [
        "Document 1 content...",
        "Document 2 content...",
        "Document 3 content..."
    ]
]

questions = ["" for _ in range(len(documents))]

# Generate paraphrase from compressed representations
output = model.generate_from_paraphrase(
    questions=questions, 
    documents=documents, 
    max_new_tokens=64
)

print('Generated paraphrase:', output[0])

Stage 2: Compression Instruction Tuning Model

Generate answers from compressed representations for QA tasks.

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/stage2/model",
    trust_remote_code=True
).to('cuda')

# Example documents and question
documents = [
    [
        "Document 1 content...",
        "Document 2 content...",
        "Document 3 content..."
    ]
]

questions = ["Your question here"]

# Generate answer from compressed representations
output = model.generate_from_text(
    questions=questions, 
    documents=documents, 
    max_new_tokens=64
)

print('Generated answer:', output[0])

Stage 3: End-to-End (CLaRa) Model

Generate answers with retrieval and reranking using joint optimization.

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/stage3/model",
    trust_remote_code=True
).to('cuda')

# Example documents and question
# Note: Stage 3 supports retrieval with multiple candidate documents
documents = [
    ["Document 1 content..." for _ in range(20)]  # 20 candidate documents
]

questions = ["Your question here"]

# Generate answer with retrieval and reranking
# The top-k is decided by generation_top_k in config.json
output, topk_indices = model.generate_from_questions(
    questions=questions, 
    documents=documents, 
    max_new_tokens=64
)

print('Generated answer:', output[0])
print('Top-k selected document indices:', topk_indices)

Key Parameters

  • max_new_tokens: Maximum number of tokens to generate (default: 128)
  • generation_top_k: Number of top documents to select (configured in model config)

Model Methods

  • generate_from_paraphrase() - Stage 1: Generate paraphrases
  • generate_from_text() - Stage 2: Generate answers from compressed docs
  • generate_from_questions() - Stage 3: Generate with retrieval and reranking