Instructions to use LenckCuak/Self-RAG with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use LenckCuak/Self-RAG with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="LenckCuak/Self-RAG")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("LenckCuak/Self-RAG", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use LenckCuak/Self-RAG with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "LenckCuak/Self-RAG" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LenckCuak/Self-RAG", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/LenckCuak/Self-RAG
- SGLang
How to use LenckCuak/Self-RAG with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "LenckCuak/Self-RAG" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LenckCuak/Self-RAG", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "LenckCuak/Self-RAG" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LenckCuak/Self-RAG", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use LenckCuak/Self-RAG with Docker Model Runner:
docker model run hf.co/LenckCuak/Self-RAG
| license: llama2 | |
| language: | |
| - en | |
| library_name: transformers | |
| tags: | |
| - llama2 | |
| - self-rag | |
| - retrieval-augmented-generation | |
| - rag | |
| - text-generation | |
| base_model: | |
| - meta-llama/Llama-2-7b-hf | |
| - facebook/contriever-msmarco | |
| pipeline_tag: text-generation | |
| datasets: | |
| - selfrag/selfrag_train_data | |
| metrics: | |
| - accuracy | |
| - exact-match | |
| # Self-RAG 模型仓库 | |
| 本目录存放 Self-RAG(Self-Reflective Retrieval-Augmented Generation)相关的预训练模型权重。 | |
| > **注意**:这些模型权重均为**直接从 HuggingFace Hub 下载**的官方预训练模型,**并非**本地训练得到。 | |
| --- | |
| ## 目录结构 | |
| ``` | |
| models/ | |
| ├── README.md ← 本文件 | |
| ├── selfrag_llama2_7b/ ← Self-RAG 7B 生成器模型(基于 Llama-2-7B 微调) | |
| │ ├── pytorch_model-*.bin ← 模型权重(~13.5GB) | |
| │ ├── config.json ← 模型配置(vocab_size=32016,含反思特殊 token) | |
| │ ├── tokenizer.model ← SentencePiece tokenizer | |
| │ ├── added_tokens.json ← 新增的反思 token([Retrieval], [Relevant], [Fully supported] 等) | |
| │ └── ... | |
| └── contriever-msmarco/ ← Contriever-MSMARCO 检索器模型(Self-RAG 原始默认检索器) | |
| ├── pytorch_model.bin | |
| ├── config.json | |
| └── ... | |
| ``` | |
| --- | |
| ## 与 GitHub 仓库的关系 | |
| ### 上游官方仓库 | |
| - **仓库地址**:[AkariAsai/self-rag](https://github.com/AkariAsai/self-rag) | |
| - **论文**:[Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection](https://arxiv.org/abs/2310.11511)(ICLR 2024, Oral top 1%) | |
| - **作者**:Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi | |
| ### 我们的 Fork 仓库 | |
| - **仓库地址**:[1571859588/self-rag](https://github.com/1571859588/self-rag) | |
| - **与上游的关系**:Fork 自 `AkariAsai/self-rag`,在 `MAEDA` 分支上扩展了对 **MAEDA Benchmark**(OpenROAD EDA 领域 QA)的集成支持 | |
| - **主要扩展内容**: | |
| - 增加了 BGE 嵌入模型检索(替代原始 Contriever),用于 OpenROAD 知识库的公平评测 | |
| - 增加了 MAEDA 数据格式转换、后处理、评估流水线 | |
| - 修复了 vLLM 0.8+ 兼容性问题 | |
| ### 本地代码位置 | |
| Self-RAG 在 MAEDA 项目中作为 **baseline** 使用,代码位于: | |
| ``` | |
| /mnt/public/sichuan_a/nyt/MAEDA/huada-docqa-demo/huada-docqa-demo/ | |
| experiments/regression_test/MAEDA-DATE26/baselines/self-rag/ | |
| ``` | |
| 该目录是 `1571859588/self-rag` 仓库的克隆(工作在 `MAEDA` 分支上),git remote 配置: | |
| - `origin` → `git@github.com:AkariAsai/self-rag.git`(上游官方仓库) | |
| - `upstream` → `git@github.com:1571859588/self-rag.git`(我们的 Fork) | |
| --- | |
| ## 模型下载方式 | |
| ### selfrag_llama2_7b(生成器) | |
| 通过 `git clone` 从 HuggingFace Hub 下载(需要安装 `git-lfs`): | |
| ```bash | |
| # 安装 git-lfs(如果未安装) | |
| apt install git-lfs # 或 conda install git-lfs | |
| git lfs install | |
| # 克隆模型 | |
| git clone https://huggingface.co/selfrag/selfrag_llama2_7b \ | |
| /mnt/public/sichuan_a/nyt/models/Self-RAG/models/selfrag_llama2_7b | |
| ``` | |
| - **HuggingFace 页面**:[selfrag/selfrag_llama2_7b](https://huggingface.co/selfrag/selfrag_llama2_7b) | |
| - **基座模型**:`meta-llama/Llama-2-7b-hf` | |
| - **模型架构**:`LlamaForCausalLM`,vocab_size=32016(原始 32000 + 16 个反思特殊 token) | |
| - **精度**:`bfloat16` | |
| - **大小**:~13.5GB | |
| ### contriever-msmarco(检索器) | |
| ```bash | |
| git clone https://huggingface.co/facebook/contriever-msmarco \ | |
| /mnt/public/sichuan_a/nyt/models/Self-RAG/models/contriever-msmarco | |
| ``` | |
| - **HuggingFace 页面**:[facebook/contriever-msmarco](https://huggingface.co/facebook/contriever-msmarco) | |
| - **用途**:Self-RAG 原始论文使用的检索器(在 MAEDA 评测中我们使用了 BGE 作为替代) | |
| > **注意**:13B 模型也可从 HuggingFace 下载:[selfrag/selfrag_llama2_13b](https://huggingface.co/selfrag/selfrag_llama2_13b) | |
| --- | |
| ## 环境配置 | |
| ### 推荐环境(已配置好的 Conda 环境) | |
| 项目中已配置好的 conda 环境可直接使用: | |
| ```bash | |
| conda activate huada_docqa_demo_release_v1 | |
| ``` | |
| 该环境包含 `vllm`、`torch`、`transformers`、`spacy`、`sentence_transformers`、`faiss` 等所有必要依赖。 | |
| ### 从零创建环境 | |
| #### 方式一:使用 environment.yml(原始 Self-RAG 环境) | |
| ```bash | |
| cd /mnt/public/sichuan_a/nyt/MAEDA/huada-docqa-demo/huada-docqa-demo/experiments/regression_test/MAEDA-DATE26/baselines/self-rag | |
| # 创建 conda 环境 | |
| conda env create -f environment.yml | |
| # 激活环境 | |
| conda activate selfrag | |
| # 安装 flash-attn(需要 CUDA 环境) | |
| pip3 install flash-attn==2.3.6 | |
| # 安装 faiss-gpu | |
| conda install -c conda-forge faiss-gpu | |
| ``` | |
| - Python 版本:3.8 | |
| - PyTorch:2.1.2 | |
| - vLLM:0.2.6(原始版本,MAEDA 流水线中需要 0.8+ 并配合兼容补丁) | |
| - CUDA:12.1 | |
| #### 方式二:使用 requirements.txt | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### 关键依赖版本 | |
| | 包 | 版本 | 说明 | | |
| |---|------|------| | |
| | `vllm` | 0.2.6(原始) / 0.8+(MAEDA 流水线) | 推理加速引擎 | | |
| | `torch` | 2.1.2 | | | |
| | `transformers` | 4.36.2 | | | |
| | `deepspeed` | 0.12.6 | 训练时使用 | | |
| | `flash-attn` | 2.3.6 | 注意力优化 | | |
| | `faiss-gpu` | - | 向量检索(MAEDA 流水线需要) | | |
| | `sentence_transformers` | - | BGE 嵌入模型(MAEDA 流水线需要) | | |
| --- | |
| ## 训练代码与数据 | |
| > **注意**:本目录中的模型是**直接从 HuggingFace 下载的官方预训练权重**,我们没有在本地进行训练。以下是官方训练流程的说明,供参考。 | |
| ### 官方训练数据 | |
| - **训练数据集**:[selfrag/selfrag_train_data](https://huggingface.co/datasets/selfrag/selfrag_train_data)(150K 条训练实例) | |
| - **Google Drive 备份**:[下载链接](https://drive.google.com/file/d/10G_FozUV4u27EX0NjwVe-3YMUMeTwuLk/view?usp=share_link) | |
| ### 官方训练流程 | |
| Self-RAG 的训练分为 4 步: | |
| 1. **Critic 数据生成**:使用 GPT-4 生成反思 token 的训练数据 | |
| - 代码:`data_creation/critic/` | |
| - 预生成数据:[下载](https://drive.google.com/file/d/1IN1XcIOYtRIGWITJ4LKRgfITT-uUwk_W/view?usp=share_link) | |
| 2. **Critic 模型训练**:微调 Llama-2-7B 使其能预测反思 token | |
| ```bash | |
| cd data_creation | |
| torchrun --nproc_per_node=2 --master_port=2568 train_special_tokens.py \ | |
| --model_name_or_path meta-llama/Llama-2-7b-hf \ | |
| --data_path PATH_TO_TRAIN_DATA_FILE \ | |
| --bf16 True --output_dir PATH_TO_CRITIC_MODEL \ | |
| --num_train_epochs 3 --per_device_train_batch_size 1 \ | |
| --gradient_accumulation_steps 8 --learning_rate 2e-5 \ | |
| --lr_scheduler_type cosine --fsdp "full_shard auto_wrap" | |
| ``` | |
| 3. **Generator 数据生成**:使用 Critic + Retriever 生成生成器的训练数据 | |
| - 代码:`data_creation/generator/` | |
| 4. **Generator 模型训练**:使用 DeepSpeed 训练生成器 | |
| ```bash | |
| cd retrieval_lm | |
| bash script_finetune_7b.sh | |
| ``` | |
| - 硬件要求:8× A100 40GB(7B 模型)/ 4× A100 80GB(13B 模型) | |
| --- | |
| ## 运行方式 | |
| ### 快速推理测试 | |
| ```python | |
| from vllm import LLM, SamplingParams | |
| # 加载模型 | |
| model = LLM( | |
| "/mnt/public/sichuan_a/nyt/models/Self-RAG/models/selfrag_llama2_7b", | |
| dtype="half" | |
| ) | |
| sampling_params = SamplingParams( | |
| temperature=0.0, top_p=1.0, | |
| max_tokens=100, skip_special_tokens=False | |
| ) | |
| def format_prompt(input, paragraph=None): | |
| prompt = "### Instruction:\n{0}\n\n### Response:\n".format(input) | |
| if paragraph is not None: | |
| prompt += "[Retrieval]<paragraph>{0}</paragraph>".format(paragraph) | |
| return prompt | |
| # 不需要检索的问题 | |
| query = "Leave odd one out: twitter, instagram, whatsapp." | |
| preds = model.generate([format_prompt(query)], sampling_params) | |
| print(preds[0].outputs[0].text) | |
| # 需要检索的问题(插入检索到的段落) | |
| query = "Can you tell me the difference between llamas and alpacas?" | |
| paragraph = "The alpaca (Lama pacos) is a species of South American camelid mammal..." | |
| preds = model.generate([format_prompt(query, paragraph)], sampling_params) | |
| print(preds[0].outputs[0].text) | |
| ``` | |
| ### MAEDA Benchmark 评测(完整流水线) | |
| 在 MAEDA 项目中,Self-RAG 作为 baseline 进行评测,使用 BGE 检索器替代原始 Contriever: | |
| ```bash | |
| # 进入工作目录 | |
| cd /mnt/public/sichuan_a/nyt/MAEDA/huada-docqa-demo/huada-docqa-demo/experiments/regression_test/MAEDA-DATE26/baselines/self-rag | |
| # 激活环境 | |
| conda activate huada_docqa_demo_release_v1 | |
| # 运行完整流水线(5 步:检索 → 转换 → 推理 → 后处理 → 评估) | |
| CUDA_VISIBLE_DEVICES=4 bash run_selfrag_maeda.sh | |
| # 或单独运行某一步 | |
| bash run_selfrag_maeda.sh retrieve # Step 1: BGE 检索 | |
| bash run_selfrag_maeda.sh convert # Step 2: 数据格式转换 | |
| bash run_selfrag_maeda.sh inference # Step 3: Self-RAG 推理 | |
| bash run_selfrag_maeda.sh postprocess # Step 4: 后处理 | |
| bash run_selfrag_maeda.sh eval # Step 5: MAEDA 评估 | |
| ``` | |
| ### 流水线步骤说明 | |
| | 步骤 | 脚本 | 说明 | | |
| |------|------|------| | |
| | 1. 检索 | `retrieve_with_bge.py` | 使用 BGE-large-en-v1.5 对 OpenROAD 知识库建立 FAISS 索引并检索 | | |
| | 2. 转换 | `convert_maeda_to_selfrag.py` | 将 MAEDA benchmark JSON 转为 Self-RAG 输入格式 | | |
| | 3. 推理 | `retrieval_lm/run_short_form.py` | 使用 Self-RAG 7B 模型进行自适应检索推理 | | |
| | 4. 后处理 | `postprocess_selfrag_output.py` | 清理反思 token,格式化为 MAEDA 评估器输入 | | |
| | 5. 评估 | `run_eval.py` | 使用外部 LLM API 作为 judge 评估答案质量 | | |
| ### 关键配置 | |
| - **生成器模型路径**:`/mnt/public/sichuan_a/nyt/models/Self-RAG/models/selfrag_llama2_7b` | |
| - **BGE 嵌入模型路径**:`/mnt/public/sichuan_a/nyt/models/RAG-EDA/models/finetuned-models/embedding/bge-large-en-v1.5/output_flagembedding` | |
| - **推理参数**:`mode=adaptive_retrieval`, `ndocs=5`, `threshold=0.2`, `max_new_tokens=300` | |
| - **GPU 要求**:至少 15GB 显存(推理) | |
| ### 恢复中断的推理 | |
| 如果推理过程中断(OOM、vllm 崩溃等),进度会保存到临时文件,可使用 `--resume_file` 恢复: | |
| ```bash | |
| CUDA_VISIBLE_DEVICES=4 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_USE_V1=0 \ | |
| python3 retrieval_lm/run_short_form.py \ | |
| --model_name /mnt/public/sichuan_a/nyt/models/Self-RAG/models/selfrag_llama2_7b \ | |
| --input_file maeda_selfrag_data/selfrag_input.json \ | |
| --mode adaptive_retrieval --max_new_tokens 300 --threshold 0.2 \ | |
| --output_file results/selfrag_raw_results.json \ | |
| --metric match --ndocs 5 --use_groundness --use_utility --use_seqscore \ | |
| --dtype half --world_size 1 \ | |
| --resume_file results/selfrag_raw_results.json_tmp | |
| ``` | |
| --- | |
| ## MAEDA 评测结果(BGE 检索) | |
| | 指标 | Agent | 数值 | | |
| |------|-------|------| | |
| | Accuracy (judge=True) | — | 3.33% (10/300) | | |
| | 检索错误 | Ret-Agent | 30.3% (91/300) | | |
| | 缺失关键点 | SC-Agent | 85.3% (256/300) | | |
| | 矛盾 | SC-Agent | 13.0% (39/300) | | |
| | 错误拒答 | RF-Agent | 0.0% (0/300) | | |
| | 错误未拒答 | RF-Agent | 0.0% (0/300) | | |
| | 指令幻觉 | Halluc-Agent | 23.0% (69/300) | | |
| | 示例幻觉 | Halluc-Agent | 11.0% (33/300) | | |
| > 结果目录:`experiments/regression_test/MAEDA-DATE26/baselines/self-rag/results/` | |
| --- | |
| ## vLLM 兼容性修改 | |
| 为适配 vLLM 0.8+,对 `retrieval_lm/run_short_form.py` 做了以下修改: | |
| - 添加 `_get_logprob()` 辅助函数(vLLM ≥ 0.8 返回 `Logprob` 对象而非 `float`) | |
| - 添加 `max_logprobs=32016` 参数(vLLM ≥ 0.8 默认限制 logprobs 为 20) | |
| - 添加 `enforce_eager=True`(避免 `torch.compile` 相关 msgspec 序列化错误) | |
| - 添加 `--start_from` 和 `--resume_file` 参数(支持中断恢复) | |
| - 设置 `VLLM_USE_V1=0`(禁用 v1 引擎,避免 large logprobs 序列化 bug) | |
| --- | |
| ## 引用 | |
| ```bibtex | |
| @inproceedings{ | |
| asai2024selfrag, | |
| author={Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh}, | |
| title={Self-{RAG}: Learning to Retrieve, Generate, and Critique through Self-Reflection}, | |
| booktitle={The Twelfth International Conference on Learning Representations}, | |
| year={2024}, | |
| url={https://openreview.net/forum?id=hSyW5go0v8} | |
| } | |
| ``` | |