---
layout: default
title: Inference Guide
permalink: /inference/
---

# Inference Guide

This guide shows how to use CLaRa models for inference at different stages.

## Loading Models

CLaRa models can be loaded using the standard `AutoModel` interface:

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/model",
    trust_remote_code=True
).to('cuda')
```

## Stage 1: Compression Pretraining Model

Generate paraphrases from compressed document representations.

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/stage1/model",
    trust_remote_code=True
).to('cuda')

# Example documents
documents = [
    [
        "Document 1 content...",
        "Document 2 content...",
        "Document 3 content..."
    ]
]

questions = ["" for _ in range(len(documents))]

# Generate paraphrase from compressed representations
output = model.generate_from_paraphrase(
    questions=questions, 
    documents=documents, 
    max_new_tokens=64
)

print('Generated paraphrase:', output[0])
```

## Stage 2: Compression Instruction Tuning Model

Generate answers from compressed representations for QA tasks.

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/stage2/model",
    trust_remote_code=True
).to('cuda')

# Example documents and question
documents = [
    [
        "Document 1 content...",
        "Document 2 content...",
        "Document 3 content..."
    ]
]

questions = ["Your question here"]

# Generate answer from compressed representations
output = model.generate_from_text(
    questions=questions, 
    documents=documents, 
    max_new_tokens=64
)

print('Generated answer:', output[0])
```

## Stage 3: End-to-End (CLaRa) Model

Generate answers with retrieval and reranking using joint optimization.

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/stage3/model",
    trust_remote_code=True
).to('cuda')

# Example documents and question
# Note: Stage 3 supports retrieval with multiple candidate documents
documents = [
    ["Document 1 content..." for _ in range(20)]  # 20 candidate documents
]

questions = ["Your question here"]

# Generate answer with retrieval and reranking
# The top-k is decided by generation_top_k in config.json
output, topk_indices = model.generate_from_questions(
    questions=questions, 
    documents=documents, 
    max_new_tokens=64
)

print('Generated answer:', output[0])
print('Top-k selected document indices:', topk_indices)
```

## Key Parameters

- `max_new_tokens`: Maximum number of tokens to generate (default: 128)
- `generation_top_k`: Number of top documents to select (configured in model config)

## Model Methods

- `generate_from_paraphrase()` - Stage 1: Generate paraphrases
- `generate_from_text()` - Stage 2: Generate answers from compressed docs
- `generate_from_questions()` - Stage 3: Generate with retrieval and reranking