Spaces:

k-l-lambda
/

trigo

Running

App Files Files Community

trigo / trigo-web /ONNXRUNTIME_NODE_FIX.md

k-l-lambda

fix onnx model loading

4b26793 2 months ago

preview code

raw

history blame contribute delete

4.58 kB

	# onnxruntime-node 安装问题修复方案

	## 问题描述

	`onnxruntime-node@1.23.2` 的 postinstall 脚本在下载二进制文件时无法正确处理 HTTP 302 重定向，导致安装失败：

	```
	Error: Failed to download build list. HTTP status code = 302
	```

	## 根本原因

	Node.js 原生的 `https.get()` 不会自动跟随 HTTP 重定向。`onnxruntime-node` 的 `install-utils.js` 中的 `downloadFile()` 和 `downloadJson()` 函数缺少重定向处理逻辑。

	## 解决方案

	### 方案 A: 应用 Patch（推荐）

	已创建 patch 文件来修复重定向问题：

	```bash
	cd ~/work/trigoRL/third_party/trigo/trigo-web/node_modules/onnxruntime-node/script
	patch -p0 < ~/work/trigoRL/third_party/trigo/trigo-web/onnxruntime-node-redirect-fix.patch
	```

	Patch 修改内容：
	1. 为 `downloadFile()` 和 `downloadJson()` 添加 `maxRedirects` 参数（默认值 5）
	2. 检测 HTTP 状态码 300-399 并提取 `Location` 头
	3. 递归调用自身跟随重定向，直到获得 200 状态码或达到最大重定向次数

	### 方案 B: 跳过安装（仅用于不需要 ONNX 推理的场景）

	```bash
	yarn install --ignore-scripts
	```

	限制： onnxruntime-node 的 CUDA 和 TensorRT 二进制文件不会被下载，无法在 Node.js 中进行 GPU 推理。

	## 修复验证

	### 1. 测试安装脚本

	```bash
	cd ~/work/trigoRL/third_party/trigo/trigo-web/node_modules/onnxruntime-node
	node ./script/install
	```

	预期输出：
	```
	Following redirect to https://nuget.azure.cn/v3/index.json
	Following redirect to https://nuget.azure.cn/v3-flatcontainer/...
	Downloading https://api.nuget.org/v3-flatcontainer/...
	Extracting runtimes/linux-x64/native/libonnxruntime_providers_cuda.so to ...
	Extracting runtimes/linux-x64/native/libonnxruntime_providers_shared.so to ...
	Extracting runtimes/linux-x64/native/libonnxruntime_providers_tensorrt.so to ...
	```

	### 2. 验证模块加载

	```bash
	cd ~/work/trigoRL/third_party/trigo/trigo-web
	node -e "const ort = require('onnxruntime-node'); console.log('Supported backends:', ort.listSupportedBackends());"
	```

	预期输出：
	```javascript
	Supported backends: [
	{ name: 'cpu', bundled: true },
	{ name: 'webgpu', bundled: true },
	{ name: 'cuda', bundled: false },
	{ name: 'tensorrt', bundled: false }
	]
	```

	### 3. 检查二进制文件

	```bash
	ls -lh ~/work/trigoRL/third_party/trigo/trigo-web/node_modules/onnxruntime-node/bin/napi-v6/linux/x64/*.so
	```

	预期输出：
	```
	-rw-rw-rw- 1 camus camus 352M libonnxruntime_providers_cuda.so
	-rw-rw-rw- 1 camus camus 15K libonnxruntime_providers_shared.so
	-rw-rw-rw- 1 camus camus 811K libonnxruntime_providers_tensorrt.so
	```

	## 技术细节

	### 修改的函数

	#### downloadFile()
	添加递归重定向处理：
	```javascript
	async function downloadFile(url, dest, maxRedirects = 5) {
	// 检查重定向次数
	if (maxRedirects < 0) {
	reject(new Error('Too many redirects'));
	return;
	}

	// 检测 300-399 状态码
	if (res.statusCode >= 300 && res.statusCode < 400 && res.headers.location) {
	const redirectUrl = new URL(res.headers.location, url).toString();
	downloadFile(redirectUrl, dest, maxRedirects - 1).then(resolve).catch(reject);
	return;
	}
	// ...
	}
	```

	#### downloadJson()
	类似的重定向处理逻辑应用于 JSON 下载。

	### 为什么需要这个修复

	1. NuGet CDN 重定向： onnxruntime 使用 NuGet 包托管，某些 CDN 节点会返回 302 重定向
	2. 代理环境：在使用代理的网络环境中，重定向更为常见
	3. 地理位置优化： NuGet 会根据地理位置重定向到最近的镜像服务器

	## 未来维护

	当 `onnxruntime-node` 更新到新版本时：

	1. 检查是否已修复：新版本可能已经包含重定向处理
	2. 重新应用 patch：如果未修复，需要在新版本的 `install-utils.js` 上重新应用 patch
	3. 提交上游 PR：可以向 onnxruntime 项目提交 Pull Request 来永久修复这个问题

	## 相关文件

	- Patch 文件： `onnxruntime-node-redirect-fix.patch`
	- 原始脚本： `node_modules/onnxruntime-node/script/install-utils.js`
	- 备份文件： `node_modules/onnxruntime-node/script/install-utils.js.backup`

	## 环境信息

	- 修复日期： 2025-11-21
	- onnxruntime-node 版本： 1.23.2
	- Node.js 版本： v24.11.1
	- 操作系统： Ubuntu 22.04 (LXD container)
	- GPU： 8x NVIDIA H20

	## 状态

	✅ 已彻底解决 - onnxruntime-node 的 CUDA 和 TensorRT 后端已成功安装并可用于 Node.js ONNX 模型推理。