Spaces:
Runtime error
Runtime error
Commit ·
fb6c7c6
1
Parent(s): 02cf9c7
Switch to OpenAI embedding (text-embedding-3-small) for Weaviate retrieval
Browse files- OPENAI_EMBEDDING_SETUP.md +77 -0
- app.py +8 -4
- server.py +9 -2
OPENAI_EMBEDDING_SETUP.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OpenAI Embedding 配置说明
|
| 2 |
+
|
| 3 |
+
ClareVoice 现在使用 **OpenAI `text-embedding-3-small`** 进行 Weaviate 向量数据库的 embedding 和检索。
|
| 4 |
+
|
| 5 |
+
## 1. 在 Hugging Face Space 中配置 API Key
|
| 6 |
+
|
| 7 |
+
### 步骤:
|
| 8 |
+
1. 访问你的 ClareVoice Space: https://huggingface.co/spaces/claudqunwang/ClareVoice
|
| 9 |
+
2. 点击右上角的 **Settings**(设置)
|
| 10 |
+
3. 在左侧菜单找到 **Secrets**(密钥)
|
| 11 |
+
4. 添加或更新以下 Secret:
|
| 12 |
+
- **Key**: `OPENAI_API_KEY`
|
| 13 |
+
- **Value**: `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`
|
| 14 |
+
5. 点击 **Save**(保存)
|
| 15 |
+
|
| 16 |
+
### 重要提示:
|
| 17 |
+
- 保存后,Space 会自动重启
|
| 18 |
+
- 如果 Space 已经在运行,重启后新的 API key 才会生效
|
| 19 |
+
- 确保 `WEAVIATE_URL` 和 `WEAVIATE_API_KEY` 也已正确配置在 Secrets 中
|
| 20 |
+
|
| 21 |
+
## 2. 本地开发配置
|
| 22 |
+
|
| 23 |
+
如果你在本地运行 ClareVoice,在项目根目录创建 `.env` 文件:
|
| 24 |
+
|
| 25 |
+
```bash
|
| 26 |
+
OPENAI_API_KEY=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
|
| 27 |
+
WEAVIATE_URL=https://your-weaviate-cluster-url
|
| 28 |
+
WEAVIATE_API_KEY=your-weaviate-api-key
|
| 29 |
+
WEAVIATE_COLLECTION=GenAICourses
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
## 3. 重建 Weaviate 索引(如果需要)
|
| 33 |
+
|
| 34 |
+
如果你需要重新构建 Weaviate 向量数据库索引以使用 OpenAI embedding:
|
| 35 |
+
|
| 36 |
+
1. 进入 `hf_space/GenAICoursesDB_space/` 目录
|
| 37 |
+
2. 设置环境变量:
|
| 38 |
+
```bash
|
| 39 |
+
export OPENAI_API_KEY=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
|
| 40 |
+
export EMBEDDING_PROVIDER=openai
|
| 41 |
+
export WEAVIATE_URL=your-weaviate-url
|
| 42 |
+
export WEAVIATE_API_KEY=your-weaviate-key
|
| 43 |
+
```
|
| 44 |
+
3. 运行索引构建脚本:
|
| 45 |
+
```bash
|
| 46 |
+
python build_weaviate_index.py
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
## 4. 技术细节
|
| 50 |
+
|
| 51 |
+
### Embedding 模型
|
| 52 |
+
- **模型**: `text-embedding-3-small`
|
| 53 |
+
- **维度**: 1536
|
| 54 |
+
- **提供商**: OpenAI API
|
| 55 |
+
|
| 56 |
+
### 代码位置
|
| 57 |
+
- `app.py`: `_get_weaviate_embed_model()` 函数
|
| 58 |
+
- `server.py`: `_get_weaviate_embed_model()` 函数
|
| 59 |
+
- 两者都使用相同的 OpenAI embedding 配置,确保检索时与索引构建时一致
|
| 60 |
+
|
| 61 |
+
### 优势
|
| 62 |
+
- **一致性**: 检索和索引使用相同的 embedding 模型
|
| 63 |
+
- **质量**: OpenAI embedding 通常比本地 HuggingFace 模型质量更高
|
| 64 |
+
- **性能**: 通过 API 调用,无需本地加载模型
|
| 65 |
+
|
| 66 |
+
## 5. 验证配置
|
| 67 |
+
|
| 68 |
+
启动 ClareVoice 后,检查日志输出:
|
| 69 |
+
- 应该看到 `[ClareVoice] Weaviate 直连: 已配置`
|
| 70 |
+
- 如果看到 `OPENAI_API_KEY is required for Weaviate embedding` 错误,说明 API key 未正确配置
|
| 71 |
+
|
| 72 |
+
## 注意事项
|
| 73 |
+
|
| 74 |
+
⚠️ **安全提示**:
|
| 75 |
+
- 不要在代码中硬编码 API key
|
| 76 |
+
- 使用 Hugging Face Secrets 或 `.env` 文件(已加入 `.gitignore`)
|
| 77 |
+
- 定期轮换 API key 以提高安全性
|
app.py
CHANGED
|
@@ -115,12 +115,16 @@ def _warmup_weaviate_embed():
|
|
| 115 |
|
| 116 |
|
| 117 |
def _get_weaviate_embed_model():
|
| 118 |
-
"""懒加载并缓存 embedding 模型(与建索引时一致)。"""
|
| 119 |
global _WEAVIATE_EMBED_MODEL
|
| 120 |
if _WEAVIATE_EMBED_MODEL is None:
|
| 121 |
-
from llama_index.embeddings.
|
| 122 |
-
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
)
|
| 125 |
return _WEAVIATE_EMBED_MODEL
|
| 126 |
|
|
|
|
| 115 |
|
| 116 |
|
| 117 |
def _get_weaviate_embed_model():
|
| 118 |
+
"""懒加载并缓存 embedding 模型(使用 OpenAI text-embedding-3-small,与建索引时一致)。"""
|
| 119 |
global _WEAVIATE_EMBED_MODEL
|
| 120 |
if _WEAVIATE_EMBED_MODEL is None:
|
| 121 |
+
from llama_index.embeddings.openai import OpenAIEmbedding
|
| 122 |
+
from config import OPENAI_API_KEY
|
| 123 |
+
if not OPENAI_API_KEY:
|
| 124 |
+
raise RuntimeError("OPENAI_API_KEY is required for Weaviate embedding")
|
| 125 |
+
_WEAVIATE_EMBED_MODEL = OpenAIEmbedding(
|
| 126 |
+
model="text-embedding-3-small",
|
| 127 |
+
api_key=OPENAI_API_KEY,
|
| 128 |
)
|
| 129 |
return _WEAVIATE_EMBED_MODEL
|
| 130 |
|
server.py
CHANGED
|
@@ -62,10 +62,17 @@ if not preloaded_topics:
|
|
| 62 |
_WEAVIATE_EMBED_MODEL = None
|
| 63 |
|
| 64 |
def _get_weaviate_embed_model():
|
|
|
|
| 65 |
global _WEAVIATE_EMBED_MODEL
|
| 66 |
if _WEAVIATE_EMBED_MODEL is None:
|
| 67 |
-
from llama_index.embeddings.
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
return _WEAVIATE_EMBED_MODEL
|
| 70 |
|
| 71 |
def _retrieve_from_weaviate(question: str, top_k: int = 5, timeout_sec: float = 45.0) -> str:
|
|
|
|
| 62 |
_WEAVIATE_EMBED_MODEL = None
|
| 63 |
|
| 64 |
def _get_weaviate_embed_model():
|
| 65 |
+
"""使用 OpenAI text-embedding-3-small(与建索引时一致)。"""
|
| 66 |
global _WEAVIATE_EMBED_MODEL
|
| 67 |
if _WEAVIATE_EMBED_MODEL is None:
|
| 68 |
+
from llama_index.embeddings.openai import OpenAIEmbedding
|
| 69 |
+
from config import OPENAI_API_KEY
|
| 70 |
+
if not OPENAI_API_KEY:
|
| 71 |
+
raise RuntimeError("OPENAI_API_KEY is required for Weaviate embedding")
|
| 72 |
+
_WEAVIATE_EMBED_MODEL = OpenAIEmbedding(
|
| 73 |
+
model="text-embedding-3-small",
|
| 74 |
+
api_key=OPENAI_API_KEY,
|
| 75 |
+
)
|
| 76 |
return _WEAVIATE_EMBED_MODEL
|
| 77 |
|
| 78 |
def _retrieve_from_weaviate(question: str, top_k: int = 5, timeout_sec: float = 45.0) -> str:
|