| | --- |
| | layout: default |
| | title: Inference Guide |
| | permalink: /inference/ |
| | --- |
| | |
| | # Inference Guide |
| |
|
| | This guide shows how to use CLaRa models for inference at different stages. |
| |
|
| | ## Loading Models |
| |
|
| | CLaRa models can be loaded using the standard `AutoModel` interface: |
| |
|
| | ```python |
| | from transformers import AutoModel |
| | |
| | model = AutoModel.from_pretrained( |
| | "path/to/model", |
| | trust_remote_code=True |
| | ).to('cuda') |
| | ``` |
| |
|
| | ## Stage 1: Compression Pretraining Model |
| |
|
| | Generate paraphrases from compressed document representations. |
| |
|
| | ```python |
| | from transformers import AutoModel |
| | |
| | model = AutoModel.from_pretrained( |
| | "path/to/stage1/model", |
| | trust_remote_code=True |
| | ).to('cuda') |
| | |
| | # Example documents |
| | documents = [ |
| | [ |
| | "Document 1 content...", |
| | "Document 2 content...", |
| | "Document 3 content..." |
| | ] |
| | ] |
| | |
| | questions = ["" for _ in range(len(documents))] |
| | |
| | # Generate paraphrase from compressed representations |
| | output = model.generate_from_paraphrase( |
| | questions=questions, |
| | documents=documents, |
| | max_new_tokens=64 |
| | ) |
| | |
| | print('Generated paraphrase:', output[0]) |
| | ``` |
| |
|
| | ## Stage 2: Compression Instruction Tuning Model |
| |
|
| | Generate answers from compressed representations for QA tasks. |
| |
|
| | ```python |
| | from transformers import AutoModel |
| | |
| | model = AutoModel.from_pretrained( |
| | "path/to/stage2/model", |
| | trust_remote_code=True |
| | ).to('cuda') |
| | |
| | # Example documents and question |
| | documents = [ |
| | [ |
| | "Document 1 content...", |
| | "Document 2 content...", |
| | "Document 3 content..." |
| | ] |
| | ] |
| | |
| | questions = ["Your question here"] |
| | |
| | # Generate answer from compressed representations |
| | output = model.generate_from_text( |
| | questions=questions, |
| | documents=documents, |
| | max_new_tokens=64 |
| | ) |
| | |
| | print('Generated answer:', output[0]) |
| | ``` |
| |
|
| | ## Stage 3: End-to-End (CLaRa) Model |
| |
|
| | Generate answers with retrieval and reranking using joint optimization. |
| |
|
| | ```python |
| | from transformers import AutoModel |
| | |
| | model = AutoModel.from_pretrained( |
| | "path/to/stage3/model", |
| | trust_remote_code=True |
| | ).to('cuda') |
| | |
| | # Example documents and question |
| | # Note: Stage 3 supports retrieval with multiple candidate documents |
| | documents = [ |
| | ["Document 1 content..." for _ in range(20)] # 20 candidate documents |
| | ] |
| | |
| | questions = ["Your question here"] |
| | |
| | # Generate answer with retrieval and reranking |
| | # The top-k is decided by generation_top_k in config.json |
| | output, topk_indices = model.generate_from_questions( |
| | questions=questions, |
| | documents=documents, |
| | max_new_tokens=64 |
| | ) |
| | |
| | print('Generated answer:', output[0]) |
| | print('Top-k selected document indices:', topk_indices) |
| | ``` |
| |
|
| | ## Key Parameters |
| |
|
| | - `max_new_tokens`: Maximum number of tokens to generate (default: 128) |
| | - `generation_top_k`: Number of top documents to select (configured in model config) |
| |
|
| | ## Model Methods |
| |
|
| | - `generate_from_paraphrase()` - Stage 1: Generate paraphrases |
| | - `generate_from_text()` - Stage 2: Generate answers from compressed docs |
| | - `generate_from_questions()` - Stage 3: Generate with retrieval and reranking |
| |
|
| |
|
| |
|
| |
|