ml-clara / docs /inference.md
dl3239491's picture
Upload folder using huggingface_hub
30c14cd verified
---
layout: default
title: Inference Guide
permalink: /inference/
---
# Inference Guide
This guide shows how to use CLaRa models for inference at different stages.
## Loading Models
CLaRa models can be loaded using the standard `AutoModel` interface:
```python
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/model",
trust_remote_code=True
).to('cuda')
```
## Stage 1: Compression Pretraining Model
Generate paraphrases from compressed document representations.
```python
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/stage1/model",
trust_remote_code=True
).to('cuda')
# Example documents
documents = [
[
"Document 1 content...",
"Document 2 content...",
"Document 3 content..."
]
]
questions = ["" for _ in range(len(documents))]
# Generate paraphrase from compressed representations
output = model.generate_from_paraphrase(
questions=questions,
documents=documents,
max_new_tokens=64
)
print('Generated paraphrase:', output[0])
```
## Stage 2: Compression Instruction Tuning Model
Generate answers from compressed representations for QA tasks.
```python
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/stage2/model",
trust_remote_code=True
).to('cuda')
# Example documents and question
documents = [
[
"Document 1 content...",
"Document 2 content...",
"Document 3 content..."
]
]
questions = ["Your question here"]
# Generate answer from compressed representations
output = model.generate_from_text(
questions=questions,
documents=documents,
max_new_tokens=64
)
print('Generated answer:', output[0])
```
## Stage 3: End-to-End (CLaRa) Model
Generate answers with retrieval and reranking using joint optimization.
```python
from transformers import AutoModel
model = AutoModel.from_pretrained(
"path/to/stage3/model",
trust_remote_code=True
).to('cuda')
# Example documents and question
# Note: Stage 3 supports retrieval with multiple candidate documents
documents = [
["Document 1 content..." for _ in range(20)] # 20 candidate documents
]
questions = ["Your question here"]
# Generate answer with retrieval and reranking
# The top-k is decided by generation_top_k in config.json
output, topk_indices = model.generate_from_questions(
questions=questions,
documents=documents,
max_new_tokens=64
)
print('Generated answer:', output[0])
print('Top-k selected document indices:', topk_indices)
```
## Key Parameters
- `max_new_tokens`: Maximum number of tokens to generate (default: 128)
- `generation_top_k`: Number of top documents to select (configured in model config)
## Model Methods
- `generate_from_paraphrase()` - Stage 1: Generate paraphrases
- `generate_from_text()` - Stage 2: Generate answers from compressed docs
- `generate_from_questions()` - Stage 3: Generate with retrieval and reranking