Spaces:
Runtime error
Runtime error
Commit ·
9a4828d
1
Parent(s): 4af5c8c
Add Weaviate index builder Gradio app
Browse filesCo-authored-by: Cursor <cursoragent@cursor.com>
- HF_SPACE_SETUP.md +106 -0
- README.md +74 -50
- app.py +209 -310
- requirements.txt +5 -2
HF_SPACE_SETUP.md
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 在 Hugging Face Space 上构建 Weaviate 索引
|
| 2 |
+
|
| 3 |
+
## 方案概述
|
| 4 |
+
|
| 5 |
+
由于本地网络环境可能存在 SSL 连接问题,我们可以在 Hugging Face Space 上运行索引构建,利用 HF Space 更稳定的网络环境。
|
| 6 |
+
|
| 7 |
+
## 步骤
|
| 8 |
+
|
| 9 |
+
### 1. 准备 GenAICoursesDB Space
|
| 10 |
+
|
| 11 |
+
如果你还没有创建这个 Space:
|
| 12 |
+
|
| 13 |
+
1. 访问 https://huggingface.co/spaces
|
| 14 |
+
2. 点击 "Create new Space"
|
| 15 |
+
3. 设置:
|
| 16 |
+
- **Space name**: `GenAICoursesDB`(或你喜欢的名称)
|
| 17 |
+
- **SDK**: `Gradio`
|
| 18 |
+
- **Hardware**: `CPU basic`(足够使用)
|
| 19 |
+
- **Visibility**: `Public` 或 `Private`
|
| 20 |
+
|
| 21 |
+
### 2. 上传代码和文件
|
| 22 |
+
|
| 23 |
+
#### 方式 A:通过 Git(推荐)
|
| 24 |
+
|
| 25 |
+
```bash
|
| 26 |
+
# 克隆你的 Space(如果还没有)
|
| 27 |
+
git clone https://huggingface.co/spaces/YOUR_USERNAME/GenAICoursesDB
|
| 28 |
+
cd GenAICoursesDB
|
| 29 |
+
|
| 30 |
+
# 从本地项目复制文件
|
| 31 |
+
cp /path/to/AI_Agent_Clare-main/hf_space/GenAICoursesDB_space/app.py .
|
| 32 |
+
cp /path/to/AI_Agent_Clare-main/hf_space/GenAICoursesDB_space/requirements.txt .
|
| 33 |
+
cp /path/to/AI_Agent_Clare-main/hf_space/GenAICoursesDB_space/README.md .
|
| 34 |
+
|
| 35 |
+
# 上传 GENAI COURSES(使用 Git LFS,因为文件可能很大)
|
| 36 |
+
git lfs install
|
| 37 |
+
git lfs track "GENAI COURSES/**"
|
| 38 |
+
cp -r /path/to/AI_Agent_Clare-main/GENAI\ COURSES .
|
| 39 |
+
|
| 40 |
+
git add .
|
| 41 |
+
git commit -m "Add Weaviate index builder app"
|
| 42 |
+
git push
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
#### 方式 B:通过 Web 界面上传
|
| 46 |
+
|
| 47 |
+
1. 访问你的 Space 页面
|
| 48 |
+
2. 点击 "Files" 标签
|
| 49 |
+
3. 上传以下文件:
|
| 50 |
+
- `app.py`
|
| 51 |
+
- `requirements.txt`
|
| 52 |
+
- `README.md`
|
| 53 |
+
- `GENAI COURSES` 文件夹(可能需要压缩为 zip)
|
| 54 |
+
|
| 55 |
+
### 3. 配置 Secrets
|
| 56 |
+
|
| 57 |
+
访问 Space Settings → Secrets,添加:
|
| 58 |
+
|
| 59 |
+
| Secret 名称 | 值 | 说明 |
|
| 60 |
+
|------------|-----|------|
|
| 61 |
+
| `OPENAI_API_KEY` | `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` | OpenAI API Key |
|
| 62 |
+
| `WEAVIATE_URL` | `https://iydyvd4wqnekotfiftma.c0.us-west3.gcp.weaviate.cloud` | Weaviate Cloud REST 地址 |
|
| 63 |
+
| `WEAVIATE_API_KEY` | `your-weaviate-api-key` | Weaviate API Key |
|
| 64 |
+
| `WEAVIATE_COLLECTION` | `GenAICourses` | Collection 名称(可选,默认值) |
|
| 65 |
+
| `EMBEDDING_PROVIDER` | `openai` | Embedding 提供商(可选,默认值) |
|
| 66 |
+
|
| 67 |
+
### 4. 运行索引构建
|
| 68 |
+
|
| 69 |
+
1. Space 会自动构建并启动
|
| 70 |
+
2. 访问 Space 页面,你会看到 Gradio 界面
|
| 71 |
+
3. 点击 "🚀 开始构建索引" 按钮
|
| 72 |
+
4. 等待构建完成(可能需要 5-15 分钟)
|
| 73 |
+
|
| 74 |
+
### 5. 验证结果
|
| 75 |
+
|
| 76 |
+
构建完成后,界面会显示:
|
| 77 |
+
```
|
| 78 |
+
✅ 索引构建成功!
|
| 79 |
+
当前 object count = [数量]
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
你也可以在 Weaviate Console 中验证:
|
| 83 |
+
1. 访问你的 Weaviate Cloud Console
|
| 84 |
+
2. 查看 `GenAICourses` collection
|
| 85 |
+
3. 确认 object count 与构建结果一致
|
| 86 |
+
|
| 87 |
+
## 优势
|
| 88 |
+
|
| 89 |
+
✅ **网络稳定**: HF Space 的网络环境通常比本地更稳定
|
| 90 |
+
✅ **无需下载**: 直接在 HF Space 上完成 embedding 和上传
|
| 91 |
+
✅ **易于使用**: Gradio 界面,一键操作
|
| 92 |
+
✅ **实时进度**: 可以看到构建进度和状态
|
| 93 |
+
|
| 94 |
+
## 注意事项
|
| 95 |
+
|
| 96 |
+
⚠️ **文件大小**: 如果 `GENAI COURSES` 文件夹很大(>1GB),建议使用 Git LFS
|
| 97 |
+
⚠️ **构建时间**: 768 个文档块大约需要 5-15 分钟
|
| 98 |
+
⚠️ **API 费用**: 使用 OpenAI API 会产生费用(约 $0.01-0.05)
|
| 99 |
+
|
| 100 |
+
## 后续步骤
|
| 101 |
+
|
| 102 |
+
索引构建完成后,ClareVoice Space 就可以直接使用 Weaviate 进行检索了。确保 ClareVoice Space 的 Secrets 中也配置了:
|
| 103 |
+
- `WEAVIATE_URL`
|
| 104 |
+
- `WEAVIATE_API_KEY`
|
| 105 |
+
- `WEAVIATE_COLLECTION`
|
| 106 |
+
- `OPENAI_API_KEY`(用于检索时的 embedding)
|
README.md
CHANGED
|
@@ -1,71 +1,95 @@
|
|
| 1 |
-
|
| 2 |
-
title: GenAICoursesDB
|
| 3 |
-
emoji: 🏆
|
| 4 |
-
colorFrom: purple
|
| 5 |
-
colorTo: yellow
|
| 6 |
-
sdk: gradio
|
| 7 |
-
sdk_version: 6.5.1
|
| 8 |
-
app_file: app.py
|
| 9 |
-
pinned: false
|
| 10 |
-
---
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
##
|
| 15 |
|
| 16 |
-
|
| 17 |
-
- **免费方案**:在 Space 的 **Settings → Variables** 里添加 `EMBEDDING_PROVIDER` = `huggingface`,即用本地 **sentence-transformers** 做 embedding,**不花 OpenAI 钱**;提问时只返回检索到的原文(不调用 LLM)。可选变量 `HF_EMBEDDING_MODEL`(默认 `sentence-transformers/all-MiniLM-L6-v2`),可改为如 `BAAI/bge-small-en-v1.5` 等。
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
##
|
| 25 |
|
| 26 |
-
|
| 27 |
-
- **免费方案**:`EMBEDDING_PROVIDER` = `huggingface`(Settings → Variables),无需 API Key 即可建索引与检索
|
| 28 |
-
- **可选**:`GENAI_COURSES_DATASET_ID`(默认:`claudqunwang/genai-courses-data`)
|
| 29 |
-
- **可选**:`GENAI_COURSES_DATASET_SUBDIR`(默认:`GENAI COURSES`)
|
| 30 |
-
- **可选**:`HF_EMBEDDING_MODEL`(免费方案时生效,默认:`sentence-transformers/all-MiniLM-L6-v2`)
|
| 31 |
-
- **可选(方案 A)**:`INDEX_DATASET_ID`(预构建索引 Dataset,设置后启动时直接下载,无需 embedding)
|
| 32 |
|
| 33 |
-
##
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
| 36 |
|
| 37 |
-
#
|
|
|
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
1.
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
-
```bash
|
| 47 |
-
cd hf_space/GenAICoursesDB_space
|
| 48 |
-
# 设置与 Space 一致(推荐用免费 embedding)
|
| 49 |
-
export EMBEDDING_PROVIDER=huggingface
|
| 50 |
-
# 若用默认 Dataset,可设置 INDEX_DATASET_ID=你的用户名/genai-courses-index
|
| 51 |
-
python build_and_upload_index.py
|
| 52 |
-
```
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
-
##
|
| 62 |
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
| 1 |
+
# Weaviate 索引构建工具(Hugging Face Space 版)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
在 Hugging Face Space 上使用 OpenAI API 进行 embedding,并直接上传到 Weaviate Cloud。
|
| 4 |
|
| 5 |
+
## 🚀 快速开始
|
| 6 |
|
| 7 |
+
### 1. 在 Hugging Face Space 中配置 Secrets
|
|
|
|
| 8 |
|
| 9 |
+
访问你的 Space Settings → Secrets,添加以下环境变量:
|
| 10 |
|
| 11 |
+
- **`OPENAI_API_KEY`**: `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`
|
| 12 |
+
- **`WEAVIATE_URL`**: 你的 Weaviate Cloud REST 地址(例如:`https://xxx.c0.us-west3.gcp.weaviate.cloud`)
|
| 13 |
+
- **`WEAVIATE_API_KEY`**: 你的 Weaviate API Key
|
| 14 |
+
- **`WEAVIATE_COLLECTION`**: Collection 名称(默认:`GenAICourses`)
|
| 15 |
+
- **`EMBEDDING_PROVIDER`**: `openai` 或 `huggingface`(默认:`openai`)
|
| 16 |
|
| 17 |
+
### 2. 上传 GENAI COURSES 文件夹
|
| 18 |
|
| 19 |
+
有两种方式:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
#### 方式 A:通过 Git LFS 上传(推荐)
|
| 22 |
|
| 23 |
+
```bash
|
| 24 |
+
# 在本地项目目录
|
| 25 |
+
cd hf_space/GenAICoursesDB_space
|
| 26 |
|
| 27 |
+
# 将 GENAI COURSES 复制到 Space 目录
|
| 28 |
+
cp -r ../../GENAI\ COURSES .
|
| 29 |
|
| 30 |
+
# 提交并推送
|
| 31 |
+
git add GENAI\ COURSES
|
| 32 |
+
git commit -m "Add GENAI COURSES for indexing"
|
| 33 |
+
git push
|
| 34 |
+
```
|
| 35 |
|
| 36 |
+
#### 方式 B:通过 HF Space 的文件上传功能
|
| 37 |
|
| 38 |
+
1. 访问你的 Space 页面
|
| 39 |
+
2. 点击 "Files" 标签
|
| 40 |
+
3. 上传 `GENAI COURSES` 文件夹(可能需要压缩为 zip 后上传,然后在 Space 中解压)
|
| 41 |
|
| 42 |
+
### 3. 运行索引构建
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
1. 访问你的 Space 页面
|
| 45 |
+
2. 在 Gradio 界面中:
|
| 46 |
+
- 选择是否清空旧索引(推荐勾选)
|
| 47 |
+
- 点击 "🚀 开始构建索引" 按钮
|
| 48 |
+
- 等待构建完成(可能需要几分钟)
|
| 49 |
|
| 50 |
+
## 📋 功能说明
|
| 51 |
|
| 52 |
+
- ✅ 使用 OpenAI `text-embedding-3-small` 进行 embedding
|
| 53 |
+
- ✅ 自动读取 `GENAI COURSES` 目录下的所有文档(.md, .pdf, .txt, .py, .ipynb, .docx)
|
| 54 |
+
- ✅ 直接上传到 Weaviate Cloud(无需下载)
|
| 55 |
+
- ✅ 实时显示构建进度
|
| 56 |
+
- ✅ 自动验证索引构建结果
|
| 57 |
|
| 58 |
+
## 🔧 技术细节
|
| 59 |
|
| 60 |
+
- **Embedding 模型**: OpenAI `text-embedding-3-small`(1536 维)
|
| 61 |
+
- **向量数据库**: Weaviate Cloud
|
| 62 |
+
- **文档处理**: LlamaIndex SimpleDirectoryReader
|
| 63 |
+
- **界面**: Gradio
|
| 64 |
|
| 65 |
+
## ⚠️ 注意事项
|
| 66 |
+
|
| 67 |
+
1. **文件大小限制**: Hugging Face Space 有文件大小限制,如果 `GENAI COURSES` 太大,可能需要使用 Git LFS
|
| 68 |
+
2. **构建时间**: 768 个文档块大约需要 5-15 分钟,取决于网络速度
|
| 69 |
+
3. **网络稳定性**: HF Space 的网络通常比本地更稳定,适合处理大量文档
|
| 70 |
+
4. **成本**: 使用 OpenAI API 会产生费用,768 个文档块大约需要 $0.01-0.05(取决于文档长度)
|
| 71 |
+
|
| 72 |
+
## 🐛 故障排除
|
| 73 |
+
|
| 74 |
+
### 错误:课程目录不存在
|
| 75 |
+
- 确保 `GENAI COURSES` 文件夹已上传到 Space 根目录
|
| 76 |
+
- 检查文件夹名称是否正确(区分大小写)
|
| 77 |
+
|
| 78 |
+
### 错误:OPENAI_API_KEY 未设置
|
| 79 |
+
- 检查 Space Settings → Secrets 中是否已添加 `OPENAI_API_KEY`
|
| 80 |
+
- 确保 Secret 名称完全匹配(区分大小写)
|
| 81 |
+
|
| 82 |
+
### 错误:Weaviate 连接失败
|
| 83 |
+
- 检查 `WEAVIATE_URL` 格式是否正确(应以 `https://` 开头)
|
| 84 |
+
- 验证 `WEAVIATE_API_KEY` 是否有效
|
| 85 |
+
- 确认网络连接正常
|
| 86 |
+
|
| 87 |
+
### 构建成功但 object count = 0
|
| 88 |
+
- 检查 Weaviate Console 中的 collection 名称是否匹配
|
| 89 |
+
- 确认使用的是同一 Weaviate 集群和账号
|
| 90 |
+
- 等待几秒钟后再次检查(可能有延迟)
|
| 91 |
+
|
| 92 |
+
## 📚 相关文档
|
| 93 |
|
| 94 |
+
- `build_weaviate_index.py`: 命令行版本的索引构建脚本(用于本地运行)
|
| 95 |
+
- `app.py`: Gradio 应用(用于 HF Space)
|
app.py
CHANGED
|
@@ -1,331 +1,230 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import os
|
| 2 |
-
from pathlib import Path
|
| 3 |
-
from typing import Optional, Tuple
|
| 4 |
-
|
| 5 |
import gradio as gr
|
| 6 |
-
from
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
Settings,
|
| 10 |
-
SimpleDirectoryReader,
|
| 11 |
-
StorageContext,
|
| 12 |
-
VectorStoreIndex,
|
| 13 |
-
load_index_from_storage,
|
| 14 |
-
)
|
| 15 |
-
from llama_index.embeddings.openai import OpenAIEmbedding
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
load_dotenv()
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
DATASET_ID = (os.getenv("GENAI_COURSES_DATASET_ID") or "claudqunwang/genai-courses-data").strip()
|
| 22 |
-
DATASET_SUBDIR = (os.getenv("GENAI_COURSES_DATASET_SUBDIR") or "GENAI COURSES").strip()
|
| 23 |
|
| 24 |
-
#
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
#
|
| 28 |
-
|
|
|
|
| 29 |
|
| 30 |
-
#
|
| 31 |
-
|
| 32 |
-
HF_EMBEDDING_MODEL = (os.getenv("HF_EMBEDDING_MODEL") or "sentence-transformers/all-MiniLM-L6-v2").strip()
|
| 33 |
|
| 34 |
|
| 35 |
-
def
|
| 36 |
-
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
raise RuntimeError(
|
| 42 |
-
f"无法加载 Hugging Face 免费 embedding({HF_EMBEDDING_MODEL}):{e!r}\n"
|
| 43 |
-
"请确认已安装: pip install llama-index-embeddings-huggingface sentence-transformers"
|
| 44 |
)
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
key = (os.getenv("OPENAI_API_KEY") or "").strip()
|
| 56 |
-
if not key:
|
| 57 |
-
raise RuntimeError("OPENAI_API_KEY 未设置。请到 Space: Settings → Secrets 添加 OPENAI_API_KEY;或设置 EMBEDDING_PROVIDER=huggingface 使用免费 embedding。")
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
def _get_courses_dir() -> Path:
|
| 61 |
-
"""
|
| 62 |
-
从 Dataset 下载课程文件到本地临时目录,并返回实际目录路径。
|
| 63 |
-
这里**完全绕开 snapshot_download**,避免某些环境下出现的
|
| 64 |
-
“No files found ... GENAI COURSES” 之类缓存异常。
|
| 65 |
-
|
| 66 |
-
实现思路:
|
| 67 |
-
1. 用 HfApi.list_repo_files 列出 Dataset 中的所有文件路径;
|
| 68 |
-
2. 过滤出属于 DATASET_SUBDIR 下的文件;
|
| 69 |
-
3. 通过 hf_hub_download 逐个拉到 /tmp/genai_courses_data,并还原子目录结构。
|
| 70 |
-
"""
|
| 71 |
-
api = HfApi()
|
| 72 |
-
try:
|
| 73 |
-
all_files = api.list_repo_files(repo_id=DATASET_ID, repo_type="dataset")
|
| 74 |
-
except Exception as e:
|
| 75 |
-
raise RuntimeError(f"无法列出 Dataset 文件({DATASET_ID}):{e!r}")
|
| 76 |
-
|
| 77 |
-
if not all_files:
|
| 78 |
-
raise RuntimeError(f"Dataset {DATASET_ID!r} 为空,请确认上传了课程文件。")
|
| 79 |
-
|
| 80 |
-
# 归一化子目录名,兼容空格/大小写差异
|
| 81 |
-
sub_norm = "".join(DATASET_SUBDIR.strip().lower().split("/")).replace(" ", "")
|
| 82 |
-
|
| 83 |
-
def _belongs_to_subdir(path: str) -> bool:
|
| 84 |
-
# path 形如 "GENAI COURSES/Module 1/...docx"
|
| 85 |
-
if "/" not in path:
|
| 86 |
-
return False
|
| 87 |
-
top = path.split("/", 1)[0]
|
| 88 |
-
top_norm = "".join(top.strip().lower().split("/")).replace(" ", "")
|
| 89 |
-
return top_norm == sub_norm
|
| 90 |
-
|
| 91 |
-
course_files = [p for p in all_files if _belongs_to_subdir(p)]
|
| 92 |
-
if not course_files:
|
| 93 |
-
raise RuntimeError(
|
| 94 |
-
"在 Dataset 中没有找到课程子目录。\n"
|
| 95 |
-
f"- Dataset: {DATASET_ID!r}\n"
|
| 96 |
-
f"- 期望子目录: {DATASET_SUBDIR!r}\n"
|
| 97 |
-
f"- 实际顶层内容示例: {[p.split('/',1)[0] for p in all_files[:20]]!r}"
|
| 98 |
)
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
for rel_path in course_files:
|
| 104 |
-
# 将文件下载到对应的本地路径(保持目录结构)
|
| 105 |
-
local_path = local_root / rel_path
|
| 106 |
-
local_path.parent.mkdir(parents=True, exist_ok=True)
|
| 107 |
try:
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
)
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
#
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
PERSIST_DIR.mkdir(parents=True, exist_ok=True)
|
| 149 |
-
from huggingface_hub import snapshot_download
|
| 150 |
-
snapshot_download(
|
| 151 |
-
repo_id=INDEX_DATASET_ID,
|
| 152 |
-
repo_type="dataset",
|
| 153 |
-
local_dir=str(PERSIST_DIR),
|
| 154 |
-
local_dir_use_symlinks=False,
|
| 155 |
-
)
|
| 156 |
-
storage_context = StorageContext.from_defaults(persist_dir=str(PERSIST_DIR))
|
| 157 |
-
idx = load_index_from_storage(storage_context)
|
| 158 |
-
print(f"[index] 已从预构建索引加载: {INDEX_DATASET_ID}")
|
| 159 |
-
return True, idx
|
| 160 |
except Exception as e:
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
try:
|
| 183 |
-
import shutil
|
| 184 |
-
shutil.rmtree(PERSIST_DIR)
|
| 185 |
-
except Exception:
|
| 186 |
-
pass
|
| 187 |
-
PERSIST_DIR.mkdir(parents=True, exist_ok=True)
|
| 188 |
-
|
| 189 |
-
courses_dir = _get_courses_dir()
|
| 190 |
-
print(f"[index] building from: {courses_dir}")
|
| 191 |
-
|
| 192 |
-
reader = SimpleDirectoryReader(
|
| 193 |
-
input_dir=str(courses_dir),
|
| 194 |
-
recursive=True,
|
| 195 |
-
required_exts=[".md", ".pdf", ".txt", ".py", ".ipynb", ".docx"],
|
| 196 |
)
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
global INDEX, INDEX_ERR
|
| 243 |
-
if not question or not question.strip():
|
| 244 |
-
return "请先输入一个问题。"
|
| 245 |
-
|
| 246 |
-
if rebuild or INDEX is None:
|
| 247 |
-
try:
|
| 248 |
-
INDEX = get_index(force_rebuild=True)
|
| 249 |
-
INDEX_ERR = None
|
| 250 |
-
except Exception as e:
|
| 251 |
-
INDEX = None
|
| 252 |
-
INDEX_ERR = repr(e)
|
| 253 |
-
|
| 254 |
-
if INDEX is None:
|
| 255 |
-
return f"索引不可用:{INDEX_ERR or 'unknown error'}"
|
| 256 |
-
|
| 257 |
-
# 使用免费 HuggingFace embedding 时,用 Retriever 直接检索,不创建 QueryEngine,避免触发 Settings.llm(无需安装 llama-index-llms-openai)
|
| 258 |
-
if EMBEDDING_PROVIDER == "huggingface":
|
| 259 |
-
nodes = _retrieve_nodes(question, top_k=5)
|
| 260 |
-
parts = [node.get_content() for node in nodes]
|
| 261 |
-
return "---\n\n".join(parts) if parts else "未检索到相关内容。"
|
| 262 |
-
qe = INDEX.as_query_engine()
|
| 263 |
-
resp = qe.query(question)
|
| 264 |
-
return str(resp)
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
def retrieve_chunks(question: str, top_k: int = 5) -> str:
|
| 268 |
-
"""
|
| 269 |
-
仅检索,不生成回答。供 Clare 等外部调用:返回检索到的课程片段,作为 RAG context。
|
| 270 |
-
Gradio api_name="retrieve" 暴露此接口。
|
| 271 |
-
"""
|
| 272 |
-
nodes = _retrieve_nodes(question, top_k=top_k)
|
| 273 |
-
parts = [node.get_content() for node in nodes]
|
| 274 |
-
return "\n\n---\n\n".join(parts) if parts else ""
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
def status_md() -> str:
|
| 278 |
-
emb_line = f"- **Embedding**: `{EMBEDDING_PROVIDER}` (免费)" if EMBEDDING_PROVIDER == "huggingface" else f"- **Embedding**: OpenAI (付费)"
|
| 279 |
-
idx_src = f"- **索引来源**: 预构建 `{INDEX_DATASET_ID}`" if INDEX_DATASET_ID else "- **索引来源**: 运行时构建"
|
| 280 |
-
if INDEX is not None:
|
| 281 |
-
return (
|
| 282 |
-
"✅ **Index ready**\n\n"
|
| 283 |
-
f"- **Dataset**: `{DATASET_ID}`\n"
|
| 284 |
-
f"- **Subdir**: `{DATASET_SUBDIR}`\n"
|
| 285 |
-
f"{idx_src}\n"
|
| 286 |
-
f"{emb_line}\n"
|
| 287 |
-
f"- **Index dir**: `{str(PERSIST_DIR)}`\n"
|
| 288 |
)
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
f"Error: `{INDEX_ERR or 'unknown'}`"
|
| 297 |
)
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
|
|
|
| 304 |
)
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
rebuild = gr.Checkbox(label="强制重建索引(慢,会重新做 Embedding)", value=False)
|
| 311 |
-
|
| 312 |
-
out = gr.Markdown(label="回答")
|
| 313 |
-
btn = gr.Button("提问")
|
| 314 |
-
btn.click(fn=ask, inputs=[question, rebuild], outputs=out).then(fn=status_md, inputs=None, outputs=status)
|
| 315 |
-
|
| 316 |
-
# Clare 调用:仅检索,不生成回答。gradio_client 用 api_name="retrieve" 调用
|
| 317 |
-
with gr.Accordion("API(Clare 等外部��用)", open=False):
|
| 318 |
-
api_question = gr.Textbox(label="检索问题", placeholder="输入问题,返回检索到的课程片段")
|
| 319 |
-
api_out = gr.Textbox(label="检索结果(原始文本)", lines=8)
|
| 320 |
-
api_btn = gr.Button("Retrieve")
|
| 321 |
-
api_btn.click(
|
| 322 |
-
fn=retrieve_chunks,
|
| 323 |
-
inputs=[api_question],
|
| 324 |
-
outputs=api_out,
|
| 325 |
-
api_name="retrieve",
|
| 326 |
)
|
| 327 |
|
| 328 |
|
| 329 |
if __name__ == "__main__":
|
| 330 |
-
|
| 331 |
-
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Hugging Face Space 应用:在 HF Space 上运行 Weaviate 索引构建
|
| 3 |
+
使用 OpenAI API 进行 embedding,直接上传到 Weaviate Cloud
|
| 4 |
+
"""
|
| 5 |
import os
|
|
|
|
|
|
|
|
|
|
| 6 |
import gradio as gr
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
import threading
|
| 9 |
+
import time
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
+
# 从环境变量读取配置(HF Space Secrets)
|
| 12 |
+
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "").strip()
|
| 13 |
+
WEAVIATE_URL = os.getenv("WEAVIATE_URL", "").strip()
|
| 14 |
+
WEAVIATE_API_KEY = os.getenv("WEAVIATE_API_KEY", "").strip()
|
| 15 |
+
WEAVIATE_COLLECTION = os.getenv("WEAVIATE_COLLECTION", "GenAICourses").strip()
|
| 16 |
+
EMBEDDING_PROVIDER = os.getenv("EMBEDDING_PROVIDER", "openai").strip().lower()
|
| 17 |
|
| 18 |
+
# 课程文档路径(需要上传到 HF Space)
|
| 19 |
+
SCRIPT_DIR = Path(__file__).resolve().parent
|
| 20 |
+
COURSES_DIR = SCRIPT_DIR / "GENAI COURSES"
|
| 21 |
|
| 22 |
+
# 全局状态
|
| 23 |
+
build_status = {"running": False, "progress": "", "error": None, "result": None}
|
|
|
|
| 24 |
|
| 25 |
|
| 26 |
+
def build_index_worker(clear_first: bool, progress_callback=None):
|
| 27 |
+
"""后台工作线程:构建索引"""
|
| 28 |
+
global build_status
|
| 29 |
+
|
| 30 |
+
try:
|
| 31 |
+
build_status["running"] = True
|
| 32 |
+
build_status["error"] = None
|
| 33 |
+
build_status["progress"] = "开始构建索引..."
|
| 34 |
+
|
| 35 |
+
# 检查配置
|
| 36 |
+
if not OPENAI_API_KEY:
|
| 37 |
+
raise RuntimeError("请在 HF Space Settings → Secrets 中添加 OPENAI_API_KEY")
|
| 38 |
+
if not WEAVIATE_URL or not WEAVIATE_API_KEY:
|
| 39 |
+
raise RuntimeError("请在 HF Space Settings → Secrets 中添加 WEAVIATE_URL 和 WEAVIATE_API_KEY")
|
| 40 |
+
|
| 41 |
+
# 检查课程目录
|
| 42 |
+
if not COURSES_DIR.exists():
|
| 43 |
+
raise FileNotFoundError(
|
| 44 |
+
f"课程目录不存在:{COURSES_DIR}\n"
|
| 45 |
+
"请将 GENAI COURSES 文件夹上传到 Space 的根目录"
|
| 46 |
+
)
|
| 47 |
+
|
| 48 |
+
# 导入依赖
|
| 49 |
+
build_status["progress"] = "加载依赖库..."
|
| 50 |
+
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
|
| 51 |
+
from llama_index.core import StorageContext
|
| 52 |
+
from llama_index.vector_stores.weaviate import WeaviateVectorStore
|
| 53 |
+
import weaviate
|
| 54 |
+
from weaviate.classes.init import Auth
|
| 55 |
+
|
| 56 |
+
# 设置 embedding
|
| 57 |
+
build_status["progress"] = "配置 embedding 模型..."
|
| 58 |
+
if EMBEDDING_PROVIDER == "openai":
|
| 59 |
+
from llama_index.embeddings.openai import OpenAIEmbedding
|
| 60 |
+
Settings.embed_model = OpenAIEmbedding(
|
| 61 |
+
model="text-embedding-3-small",
|
| 62 |
+
api_key=OPENAI_API_KEY,
|
| 63 |
+
)
|
| 64 |
+
else:
|
| 65 |
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
| 66 |
+
Settings.embed_model = HuggingFaceEmbedding(
|
| 67 |
+
model_name="sentence-transformers/all-MiniLM-L6-v2"
|
|
|
|
|
|
|
|
|
|
| 68 |
)
|
| 69 |
+
|
| 70 |
+
# 连接 Weaviate
|
| 71 |
+
build_status["progress"] = "连接 Weaviate Cloud..."
|
| 72 |
+
url = WEAVIATE_URL
|
| 73 |
+
if not url.startswith("http"):
|
| 74 |
+
url = "https://" + url
|
| 75 |
+
|
| 76 |
+
client = weaviate.connect_to_weaviate_cloud(
|
| 77 |
+
cluster_url=url,
|
| 78 |
+
auth_credentials=Auth.api_key(WEAVIATE_API_KEY),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
)
|
| 80 |
+
|
| 81 |
+
if not client.is_ready():
|
| 82 |
+
raise RuntimeError("Weaviate 连接失败")
|
| 83 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
try:
|
| 85 |
+
# 清空旧 collection(如果需要)
|
| 86 |
+
if clear_first:
|
| 87 |
+
build_status["progress"] = f"删除旧 collection: {WEAVIATE_COLLECTION}..."
|
| 88 |
+
try:
|
| 89 |
+
if hasattr(client.collections, "delete"):
|
| 90 |
+
client.collections.delete(WEAVIATE_COLLECTION)
|
| 91 |
+
build_status["progress"] = "旧 collection 已删除"
|
| 92 |
+
except Exception as e:
|
| 93 |
+
if "404" not in str(e) and "not found" not in str(e).lower():
|
| 94 |
+
build_status["progress"] = f"删除旧 collection 时警告: {e}"
|
| 95 |
+
|
| 96 |
+
# 读取文档
|
| 97 |
+
build_status["progress"] = f"读取课程目录: {COURSES_DIR}..."
|
| 98 |
+
reader = SimpleDirectoryReader(
|
| 99 |
+
input_dir=str(COURSES_DIR),
|
| 100 |
+
recursive=True,
|
| 101 |
+
required_exts=[".md", ".pdf", ".txt", ".py", ".ipynb", ".docx"],
|
| 102 |
)
|
| 103 |
+
documents = reader.load_data()
|
| 104 |
+
build_status["progress"] = f"已加载 {len(documents)} 个文档块"
|
| 105 |
+
|
| 106 |
+
# 创建 vector store
|
| 107 |
+
build_status["progress"] = "创建 Weaviate vector store..."
|
| 108 |
+
vector_store = WeaviateVectorStore(
|
| 109 |
+
weaviate_client=client,
|
| 110 |
+
index_name=WEAVIATE_COLLECTION,
|
| 111 |
+
)
|
| 112 |
+
storage_context = StorageContext.from_defaults(vector_store=vector_store)
|
| 113 |
+
|
| 114 |
+
# 构建索引(这会自动进行 embedding 并上传)
|
| 115 |
+
build_status["progress"] = f"正在 embedding 并上传到 Weaviate (collection={WEAVIATE_COLLECTION})...\n这可能需要几分钟时间,请耐心等待..."
|
| 116 |
+
index = VectorStoreIndex.from_documents(
|
| 117 |
+
documents,
|
| 118 |
+
storage_context=storage_context,
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
# 等待 batch 提交完成
|
| 122 |
+
time.sleep(3)
|
| 123 |
+
|
| 124 |
+
# 验证
|
| 125 |
+
build_status["progress"] = "验证索引..."
|
| 126 |
+
coll = client.collections.get(WEAVIATE_COLLECTION)
|
| 127 |
+
agg = coll.aggregate.over_all(total_count=True)
|
| 128 |
+
n = agg.total_count
|
| 129 |
+
|
| 130 |
+
build_status["result"] = f"✅ 索引构建成功!\n当前 object count = {n}"
|
| 131 |
+
build_status["progress"] = build_status["result"]
|
| 132 |
+
|
| 133 |
+
finally:
|
| 134 |
+
client.close()
|
| 135 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
except Exception as e:
|
| 137 |
+
build_status["error"] = str(e)
|
| 138 |
+
build_status["progress"] = f"❌ 错误: {str(e)}"
|
| 139 |
+
finally:
|
| 140 |
+
build_status["running"] = False
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
def start_build(clear_first: bool):
|
| 144 |
+
"""启动索引构建"""
|
| 145 |
+
global build_status
|
| 146 |
+
|
| 147 |
+
if build_status["running"]:
|
| 148 |
+
return "⚠️ 索引构建正在进行中,请等待完成..."
|
| 149 |
+
|
| 150 |
+
# 重置状态
|
| 151 |
+
build_status = {"running": False, "progress": "", "error": None, "result": None}
|
| 152 |
+
|
| 153 |
+
# 启动后台线程
|
| 154 |
+
thread = threading.Thread(
|
| 155 |
+
target=build_index_worker,
|
| 156 |
+
args=(clear_first,),
|
| 157 |
+
daemon=True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
)
|
| 159 |
+
thread.start()
|
| 160 |
+
|
| 161 |
+
return "🚀 索引构建已启动,请查看下方进度..."
|
| 162 |
+
|
| 163 |
+
|
| 164 |
+
def get_progress():
|
| 165 |
+
"""获取当前进度"""
|
| 166 |
+
if build_status["running"]:
|
| 167 |
+
return build_status["progress"] or "处理中..."
|
| 168 |
+
elif build_status["error"]:
|
| 169 |
+
return f"❌ 错误: {build_status['error']}"
|
| 170 |
+
elif build_status["result"]:
|
| 171 |
+
return build_status["result"]
|
| 172 |
+
else:
|
| 173 |
+
return "等待开始..."
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
# Gradio 界面
|
| 177 |
+
with gr.Blocks(title="Weaviate 索引构建工具") as app:
|
| 178 |
+
gr.Markdown("""
|
| 179 |
+
# 🔍 Weaviate 索引构建工具
|
| 180 |
+
|
| 181 |
+
在 Hugging Face Space 上使用 OpenAI API 进行 embedding,并直接上传到 Weaviate Cloud。
|
| 182 |
+
|
| 183 |
+
## 配置要求
|
| 184 |
+
|
| 185 |
+
请在 **Settings → Secrets** 中添加以下环境变量:
|
| 186 |
+
- `OPENAI_API_KEY`: OpenAI API Key(用于 embedding)
|
| 187 |
+
- `WEAVIATE_URL`: Weaviate Cloud REST 地址
|
| 188 |
+
- `WEAVIATE_API_KEY`: Weaviate API Key
|
| 189 |
+
- `WEAVIATE_COLLECTION`: Collection 名称(默认: GenAICourses)
|
| 190 |
+
- `EMBEDDING_PROVIDER`: openai 或 huggingface(默认: openai)
|
| 191 |
+
|
| 192 |
+
## 使用步骤
|
| 193 |
+
|
| 194 |
+
1. 确保已将 `GENAI COURSES` 文件夹上传到 Space 根目录
|
| 195 |
+
2. 点击下方按钮开始构建索引
|
| 196 |
+
3. 等待构建完成(可能需要几分钟)
|
| 197 |
+
""")
|
| 198 |
+
|
| 199 |
+
with gr.Row():
|
| 200 |
+
clear_first = gr.Checkbox(
|
| 201 |
+
label="清空旧索引后重建",
|
| 202 |
+
value=True,
|
| 203 |
+
info="如果勾选,会先删除旧的 collection 再重建"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
)
|
| 205 |
+
build_btn = gr.Button("🚀 开始构建索引", variant="primary", size="lg")
|
| 206 |
+
|
| 207 |
+
progress_output = gr.Textbox(
|
| 208 |
+
label="构建进度",
|
| 209 |
+
lines=10,
|
| 210 |
+
interactive=False,
|
| 211 |
+
value="等待开始..."
|
|
|
|
| 212 |
)
|
| 213 |
+
|
| 214 |
+
# 自动刷新进度
|
| 215 |
+
app.load(
|
| 216 |
+
fn=get_progress,
|
| 217 |
+
inputs=[],
|
| 218 |
+
outputs=progress_output,
|
| 219 |
+
every=2, # 每2秒刷新一次
|
| 220 |
)
|
| 221 |
+
|
| 222 |
+
build_btn.click(
|
| 223 |
+
fn=start_build,
|
| 224 |
+
inputs=[clear_first],
|
| 225 |
+
outputs=progress_output,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 226 |
)
|
| 227 |
|
| 228 |
|
| 229 |
if __name__ == "__main__":
|
| 230 |
+
app.launch()
|
|
|
requirements.txt
CHANGED
|
@@ -1,7 +1,6 @@
|
|
| 1 |
gradio>=5.0.0
|
| 2 |
python-dotenv>=1.0.0
|
| 3 |
openai>=1.44.0
|
| 4 |
-
huggingface_hub>=0.23.0
|
| 5 |
|
| 6 |
llama-index-core>=0.10.0
|
| 7 |
llama-index-embeddings-openai>=0.1.0
|
|
@@ -9,7 +8,11 @@ llama-index-embeddings-openai>=0.1.0
|
|
| 9 |
llama-index-embeddings-huggingface>=0.1.0
|
| 10 |
sentence-transformers>=2.2.0
|
| 11 |
|
| 12 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
pypdf
|
| 14 |
python-docx
|
| 15 |
nbformat
|
|
|
|
| 1 |
gradio>=5.0.0
|
| 2 |
python-dotenv>=1.0.0
|
| 3 |
openai>=1.44.0
|
|
|
|
| 4 |
|
| 5 |
llama-index-core>=0.10.0
|
| 6 |
llama-index-embeddings-openai>=0.1.0
|
|
|
|
| 8 |
llama-index-embeddings-huggingface>=0.1.0
|
| 9 |
sentence-transformers>=2.2.0
|
| 10 |
|
| 11 |
+
# Weaviate Cloud 向量库
|
| 12 |
+
llama-index-vector-stores-weaviate>=0.2.0
|
| 13 |
+
weaviate-client>=4.0.0
|
| 14 |
+
|
| 15 |
+
# Readers for common course files(仅 build_weaviate_index.py 需要)
|
| 16 |
pypdf
|
| 17 |
python-docx
|
| 18 |
nbformat
|