File size: 2,980 Bytes
30c14cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
layout: default
title: Inference Guide
permalink: /inference/
---

# Inference Guide

This guide shows how to use CLaRa models for inference at different stages.

## Loading Models

CLaRa models can be loaded using the standard `AutoModel` interface:

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/model",
    trust_remote_code=True
).to('cuda')
```

## Stage 1: Compression Pretraining Model

Generate paraphrases from compressed document representations.

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/stage1/model",
    trust_remote_code=True
).to('cuda')

# Example documents
documents = [
    [
        "Document 1 content...",
        "Document 2 content...",
        "Document 3 content..."
    ]
]

questions = ["" for _ in range(len(documents))]

# Generate paraphrase from compressed representations
output = model.generate_from_paraphrase(
    questions=questions, 
    documents=documents, 
    max_new_tokens=64
)

print('Generated paraphrase:', output[0])
```

## Stage 2: Compression Instruction Tuning Model

Generate answers from compressed representations for QA tasks.

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/stage2/model",
    trust_remote_code=True
).to('cuda')

# Example documents and question
documents = [
    [
        "Document 1 content...",
        "Document 2 content...",
        "Document 3 content..."
    ]
]

questions = ["Your question here"]

# Generate answer from compressed representations
output = model.generate_from_text(
    questions=questions, 
    documents=documents, 
    max_new_tokens=64
)

print('Generated answer:', output[0])
```

## Stage 3: End-to-End (CLaRa) Model

Generate answers with retrieval and reranking using joint optimization.

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/stage3/model",
    trust_remote_code=True
).to('cuda')

# Example documents and question
# Note: Stage 3 supports retrieval with multiple candidate documents
documents = [
    ["Document 1 content..." for _ in range(20)]  # 20 candidate documents
]

questions = ["Your question here"]

# Generate answer with retrieval and reranking
# The top-k is decided by generation_top_k in config.json
output, topk_indices = model.generate_from_questions(
    questions=questions, 
    documents=documents, 
    max_new_tokens=64
)

print('Generated answer:', output[0])
print('Top-k selected document indices:', topk_indices)
```

## Key Parameters

- `max_new_tokens`: Maximum number of tokens to generate (default: 128)
- `generation_top_k`: Number of top documents to select (configured in model config)

## Model Methods

- `generate_from_paraphrase()` - Stage 1: Generate paraphrases
- `generate_from_text()` - Stage 2: Generate answers from compressed docs
- `generate_from_questions()` - Stage 3: Generate with retrieval and reranking