File size: 2,980 Bytes
30c14cd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | ---
layout: default
title: Inference Guide
permalink: /inference/
---
# Inference Guide
This guide shows how to use CLaRa models for inference at different stages.
## Loading Models
CLaRa models can be loaded using the standard `AutoModel` interface:
```python
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/model",
trust_remote_code=True
).to('cuda')
```
## Stage 1: Compression Pretraining Model
Generate paraphrases from compressed document representations.
```python
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/stage1/model",
trust_remote_code=True
).to('cuda')
# Example documents
documents = [
[
"Document 1 content...",
"Document 2 content...",
"Document 3 content..."
]
]
questions = ["" for _ in range(len(documents))]
# Generate paraphrase from compressed representations
output = model.generate_from_paraphrase(
questions=questions,
documents=documents,
max_new_tokens=64
)
print('Generated paraphrase:', output[0])
```
## Stage 2: Compression Instruction Tuning Model
Generate answers from compressed representations for QA tasks.
```python
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/stage2/model",
trust_remote_code=True
).to('cuda')
# Example documents and question
documents = [
[
"Document 1 content...",
"Document 2 content...",
"Document 3 content..."
]
]
questions = ["Your question here"]
# Generate answer from compressed representations
output = model.generate_from_text(
questions=questions,
documents=documents,
max_new_tokens=64
)
print('Generated answer:', output[0])
```
## Stage 3: End-to-End (CLaRa) Model
Generate answers with retrieval and reranking using joint optimization.
```python
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/stage3/model",
trust_remote_code=True
).to('cuda')
# Example documents and question
# Note: Stage 3 supports retrieval with multiple candidate documents
documents = [
["Document 1 content..." for _ in range(20)] # 20 candidate documents
]
questions = ["Your question here"]
# Generate answer with retrieval and reranking
# The top-k is decided by generation_top_k in config.json
output, topk_indices = model.generate_from_questions(
questions=questions,
documents=documents,
max_new_tokens=64
)
print('Generated answer:', output[0])
print('Top-k selected document indices:', topk_indices)
```
## Key Parameters
- `max_new_tokens`: Maximum number of tokens to generate (default: 128)
- `generation_top_k`: Number of top documents to select (configured in model config)
## Model Methods
- `generate_from_paraphrase()` - Stage 1: Generate paraphrases
- `generate_from_text()` - Stage 2: Generate answers from compressed docs
- `generate_from_questions()` - Stage 3: Generate with retrieval and reranking
|