dl3239491
/

ml-clara

Model card Files Files and versions

ml-clara / docs /inference.md

dl3239491's picture

Upload folder using huggingface_hub

30c14cd verified about 1 month ago

|

history blame contribute delete

2.98 kB

	---
	layout: default
	title: Inference Guide
	permalink: /inference/
	---

	# Inference Guide

	This guide shows how to use CLaRa models for inference at different stages.

	## Loading Models

	CLaRa models can be loaded using the standard `AutoModel` interface:

	```python
	from transformers import AutoModel

	model = AutoModel.from_pretrained(
	"path/to/model",
	trust_remote_code=True
	).to('cuda')
	```

	## Stage 1: Compression Pretraining Model

	Generate paraphrases from compressed document representations.

	```python
	from transformers import AutoModel

	model = AutoModel.from_pretrained(
	"path/to/stage1/model",
	trust_remote_code=True
	).to('cuda')

	# Example documents
	documents = [
	[
	"Document 1 content...",
	"Document 2 content...",
	"Document 3 content..."
	]
	]

	questions = ["" for _ in range(len(documents))]

	# Generate paraphrase from compressed representations
	output = model.generate_from_paraphrase(
	questions=questions,
	documents=documents,
	max_new_tokens=64
	)

	print('Generated paraphrase:', output[0])
	```

	## Stage 2: Compression Instruction Tuning Model

	Generate answers from compressed representations for QA tasks.

	```python
	from transformers import AutoModel

	model = AutoModel.from_pretrained(
	"path/to/stage2/model",
	trust_remote_code=True
	).to('cuda')

	# Example documents and question
	documents = [
	[
	"Document 1 content...",
	"Document 2 content...",
	"Document 3 content..."
	]
	]

	questions = ["Your question here"]

	# Generate answer from compressed representations
	output = model.generate_from_text(
	questions=questions,
	documents=documents,
	max_new_tokens=64
	)

	print('Generated answer:', output[0])
	```

	## Stage 3: End-to-End (CLaRa) Model

	Generate answers with retrieval and reranking using joint optimization.

	```python
	from transformers import AutoModel

	model = AutoModel.from_pretrained(
	"path/to/stage3/model",
	trust_remote_code=True
	).to('cuda')

	# Example documents and question
	# Note: Stage 3 supports retrieval with multiple candidate documents
	documents = [
	["Document 1 content..." for _ in range(20)] # 20 candidate documents
	]

	questions = ["Your question here"]

	# Generate answer with retrieval and reranking
	# The top-k is decided by generation_top_k in config.json
	output, topk_indices = model.generate_from_questions(
	questions=questions,
	documents=documents,
	max_new_tokens=64
	)

	print('Generated answer:', output[0])
	print('Top-k selected document indices:', topk_indices)
	```

	## Key Parameters

	- `max_new_tokens`: Maximum number of tokens to generate (default: 128)
	- `generation_top_k`: Number of top documents to select (configured in model config)

	## Model Methods

	- `generate_from_paraphrase()` - Stage 1: Generate paraphrases
	- `generate_from_text()` - Stage 2: Generate answers from compressed docs
	- `generate_from_questions()` - Stage 3: Generate with retrieval and reranking