hf-upload

Upload inference bundle

0748838 about 2 months ago

7.16 kB

	# judgment_partition_infer

	Standalone inference bundle for partitioning a Chinese judgment document (or a truncated excerpt) into 7 zones (Z1..Z7) by predicting 6 boundaries.

	## Install

	```bash
	pip install -r requirements.txt
	```

	If you want to download this bundle from Hugging Face Hub:
	```bash
	pip install huggingface_hub
	python - <<'PY'
	from huggingface_hub import snapshot_download
	snapshot_download(
	repo_id="<USER_OR_ORG>/<REPO_NAME>",
	repo_type="model",
	local_dir="judgment_partition_infer_bundle",
	local_dir_use_symlinks=False,
	)
	print("Downloaded -> judgment_partition_infer_bundle")
	PY
	```

	## Input format (JSONL)

	One JSON object per line. Required field: `text` (or `full_text`).

	Example:
	```json
	{"sample_id":"demo_1","text":"...全文..."}
	```

	Optional fields `case_no` and `case_name` are passed through to outputs.

	## Run inference (CLI)

	From this folder:
	```bash
	python infer_cli.py --input examples/input.jsonl
	```

	Outputs are written under `output/<YYYYMMDD_HHMMSS>/` by default:
	- `predictions.jsonl`
	- `run_meta.json`

	You can also write to an explicit path:
	```bash
	python infer_cli.py \
	--input examples/input.jsonl \
	--output examples/output.example.jsonl \
	--device cpu
	```

	## Anchor behavior

	Default: `--anchor auto`
	- If anchors are detected, enforce:
	- boundary[0] = Z1 anchor ("号" within first 100 chars)
	- boundary[3] = Z4 anchor ("判决如下"/"如下判决")
	- If anchors are missing/invalid, keep model boundaries and set `anchor_status` accordingly.

	## Python API

	```python
	from judgment_partition_infer import Predictor

	pred = Predictor() # loads ./assets/best_model.pt + ./assets/vocab.json (or env override)
	out = pred.predict_text("...全文/片段...")
	print(out["boundaries"])
	```
	Note: for Hub compatibility, the model may be stored as
	`assets/best_model.pt.b64.part-*` (text shards).
	`Predictor()` will automatically reassemble/decode and load the model.

	If you `pip install` only the code (without assets), pass explicit paths:
	```python
	from judgment_partition_infer import Predictor
	pred = Predictor(
	model_path="path/to/best_model.pt",
	vocab_path="path/to/vocab.json",
	device="cpu",
	)
	```

	## Publish to Hugging Face Hub (maintainers)

	1) Install publishing dependency:
	```bash
	pip install -r requirements-publish.txt
	```

	2) Login (recommended) or set env token:
	- `huggingface-cli login`
	- or `export HF_TOKEN=...` / `export HUGGINGFACE_HUB_TOKEN=...`

	3) Create + upload (model repo):
	```bash
	python publish_to_hf.py --repo-id <USER_OR_ORG>/<REPO_NAME> --public
	```
	If you already created the repo on the website:
	```bash
	python publish_to_hf.py --repo-id <USER_OR_ORG>/<REPO_NAME> --skip-create
	```

	### If HTTPS to huggingface.co is blocked

	You can push via SSH (host `hf.co`) instead:
	```bash
	chmod +x push_to_hf_ssh.sh
	./push_to_hf_ssh.sh <USER_OR_ORG>/<REPO_NAME>
	```
	(The script prints an SSH public key; add it at https://huggingface.co/settings/keys, then rerun.)

	# judgment_partition_infer

	这是一个独立的推理工具包，用于通过预测 6 个边界位置，将中文裁判文书全文（或截断的片段）自动切分为 7 个固定结构分区（Z1~Z7）。

	## 安装依赖

	```bash
	pip install -r requirements.txt
	```

	如果你希望从 Hugging Face Hub 下载这个推理包（包含权重与词表）：

	```bash
	pip install huggingface_hub
	python - <<'PY'
	from huggingface_hub import snapshot_download
	snapshot_download(
	repo_id="<USER_OR_ORG>/<REPO_NAME>",
	repo_type="model",
	local_dir="judgment_partition_infer_bundle",
	local_dir_use_symlinks=False,
	)
	print("Downloaded -> judgment_partition_infer_bundle")
	PY
	```

	## 输入格式 (JSONL)

	输入文件必须为 JSONL 格式（每行一个独立的 JSON 对象）。
	必填字段：`text`（系统也兼容读取 `full_text` 字段）。

	数据示例：

	```json
	{"sample_id":"demo_1","text":"...全文..."}
	```

	> 注：可选的元数据字段如 `case_no`（案号）和 `case_name`（案名）在处理过程中不会被修改，并会原样透传到输出结果中。

	## 运行推理 (命令行方式)

	请在当前工具包根目录下执行以下命令：

	```bash
	python infer_cli.py --input examples/input.jsonl

	```

	默认情况下，推理结果会保存在按时间戳自动生成的 `output/<YYYYMMDD_HHMMSS>/` 目录下，包含以下两个文件：

	* `predictions.jsonl`：包含边界坐标、各个分区文本等最终预测结果。
	* `run_meta.json`：本次推理任务的运行元数据及统计信息。

	你也可以直接输出到一个固定文件路径（便于对接其他系统或做示例）：

	```bash
	python infer_cli.py \
	--input examples/input.jsonl \
	--output examples/output.example.jsonl \
	--device cpu
	```

	## 锚点规则 (Anchor Behavior)

	系统默认启用自动锚点策略：`--anchor auto`

	* 当检测到业务锚点时，强制执行以下约束：
	* `boundary[0]`（第 1 条边界）强制对齐至 Z1 锚点（即正文前 100 个字符内出现的最后一个“号”字）。
	* `boundary[3]`（第 4 条边界）强制对齐至 Z4 锚点（匹配“判决如下”或“如下判决”）。


	* 当锚点缺失或无效时：
	* 系统将直接保留模型预测的原始句子边界，并在输出结果中相应地更新 `anchor_status` 字段（标明锚点缺失）。



	## Python API 调用 (代码内嵌方式)

	如果你希望在自己的 Python 代码中直接调用该模型，可以使用以下接口：

	```python
	from judgment_partition_infer import Predictor

	# 初始化预测器（默认加载 ./assets/best_model.pt 和 ./assets/vocab.json，或用环境变量覆盖）
	pred = Predictor()

	# 传入文书全文或片段进行推理
	out = pred.predict_text("...全文/片段...")

	# 打印预测出的 6 个边界位置
	print(out["boundaries"])
	```
	说明：为了兼容 Hub 的文件大小与二进制限制，模型可能以
	`assets/best_model.pt.b64.part-*` 文本分片形式存储；
	`Predictor()` 会自动拼接、解码并加载，不需要手动处理。

	如果你只安装了代码（没有把 assets 一起下载到本地），请显式传入路径：

	```python
	from judgment_partition_infer import Predictor
	pred = Predictor(
	model_path="path/to/best_model.pt",
	vocab_path="path/to/vocab.json",
	device="cpu",
	)
	```

	## 发布到 Hugging Face Hub（维护者用）

	1) 安装发布依赖：
	```bash
	pip install -r requirements-publish.txt
	```

	2) 登录（推荐）或通过环境变量提供 token：
	- `huggingface-cli login`
	- 或 `export HF_TOKEN=...` / `export HUGGINGFACE_HUB_TOKEN=...`

	3) 创建并上传（Model Repo）：
	```bash
	python publish_to_hf.py --repo-id <USER_OR_ORG>/<REPO_NAME> --public
	```
	如果你已经在网页端创建了仓库：
	```bash
	python publish_to_hf.py --repo-id <USER_OR_ORG>/<REPO_NAME> --skip-create
	```

	### 如果当前网络无法访问 huggingface.co

	可以改用 SSH（host 为 `hf.co`）推送：
	```bash
	chmod +x push_to_hf_ssh.sh
	./push_to_hf_ssh.sh <USER_OR_ORG>/<REPO_NAME>
	```
	（脚本会打印 SSH 公钥；请复制到 https://huggingface.co/settings/keys，然后再运行一次脚本。）