# judgment_partition_infer Standalone inference bundle for partitioning a Chinese judgment document (or a truncated excerpt) into 7 zones (Z1..Z7) by predicting 6 boundaries. ## Install ```bash pip install -r requirements.txt ``` If you want to download this bundle from Hugging Face Hub: ```bash pip install huggingface_hub python - <<'PY' from huggingface_hub import snapshot_download snapshot_download( repo_id="/", repo_type="model", local_dir="judgment_partition_infer_bundle", local_dir_use_symlinks=False, ) print("Downloaded -> judgment_partition_infer_bundle") PY ``` ## Input format (JSONL) One JSON object per line. Required field: `text` (or `full_text`). Example: ```json {"sample_id":"demo_1","text":"...全文..."} ``` Optional fields `case_no` and `case_name` are passed through to outputs. ## Run inference (CLI) From this folder: ```bash python infer_cli.py --input examples/input.jsonl ``` Outputs are written under `output//` by default: - `predictions.jsonl` - `run_meta.json` You can also write to an explicit path: ```bash python infer_cli.py \ --input examples/input.jsonl \ --output examples/output.example.jsonl \ --device cpu ``` ## Anchor behavior Default: `--anchor auto` - If anchors are detected, enforce: - boundary[0] = Z1 anchor ("号" within first 100 chars) - boundary[3] = Z4 anchor ("判决如下"/"如下判决") - If anchors are missing/invalid, keep model boundaries and set `anchor_status` accordingly. ## Python API ```python from judgment_partition_infer import Predictor pred = Predictor() # loads ./assets/best_model.pt + ./assets/vocab.json (or env override) out = pred.predict_text("...全文/片段...") print(out["boundaries"]) ``` Note: for Hub compatibility, the model may be stored as `assets/best_model.pt.b64.part-*` (text shards). `Predictor()` will automatically reassemble/decode and load the model. If you `pip install` only the code (without assets), pass explicit paths: ```python from judgment_partition_infer import Predictor pred = Predictor( model_path="path/to/best_model.pt", vocab_path="path/to/vocab.json", device="cpu", ) ``` ## Publish to Hugging Face Hub (maintainers) 1) Install publishing dependency: ```bash pip install -r requirements-publish.txt ``` 2) Login (recommended) or set env token: - `huggingface-cli login` - or `export HF_TOKEN=...` / `export HUGGINGFACE_HUB_TOKEN=...` 3) Create + upload (model repo): ```bash python publish_to_hf.py --repo-id / --public ``` If you already created the repo on the website: ```bash python publish_to_hf.py --repo-id / --skip-create ``` ### If HTTPS to huggingface.co is blocked You can push via SSH (host `hf.co`) instead: ```bash chmod +x push_to_hf_ssh.sh ./push_to_hf_ssh.sh / ``` (The script prints an SSH public key; add it at https://huggingface.co/settings/keys, then rerun.) # judgment_partition_infer 这是一个独立的推理工具包,用于通过预测 6 个边界位置,将中文裁判文书全文(或截断的片段)自动切分为 7 个固定结构分区(Z1~Z7)。 ## 安装依赖 ```bash pip install -r requirements.txt ``` 如果你希望从 Hugging Face Hub 下载这个推理包(包含权重与词表): ```bash pip install huggingface_hub python - <<'PY' from huggingface_hub import snapshot_download snapshot_download( repo_id="/", repo_type="model", local_dir="judgment_partition_infer_bundle", local_dir_use_symlinks=False, ) print("Downloaded -> judgment_partition_infer_bundle") PY ``` ## 输入格式 (JSONL) 输入文件必须为 JSONL 格式(每行一个独立的 JSON 对象)。 **必填字段**:`text`(系统也兼容读取 `full_text` 字段)。 **数据示例:** ```json {"sample_id":"demo_1","text":"...全文..."} ``` > **注**:可选的元数据字段如 `case_no`(案号)和 `case_name`(案名)在处理过程中不会被修改,并会原样透传到输出结果中。 ## 运行推理 (命令行方式) 请在当前工具包根目录下执行以下命令: ```bash python infer_cli.py --input examples/input.jsonl ``` 默认情况下,推理结果会保存在按时间戳自动生成的 `output//` 目录下,包含以下两个文件: * `predictions.jsonl`:包含边界坐标、各个分区文本等最终预测结果。 * `run_meta.json`:本次推理任务的运行元数据及统计信息。 你也可以直接输出到一个固定文件路径(便于对接其他系统或做示例): ```bash python infer_cli.py \ --input examples/input.jsonl \ --output examples/output.example.jsonl \ --device cpu ``` ## 锚点规则 (Anchor Behavior) 系统默认启用自动锚点策略:`--anchor auto` * **当检测到业务锚点时,强制执行以下约束:** * `boundary[0]`(第 1 条边界)强制对齐至 **Z1 锚点**(即正文前 100 个字符内出现的最后一个“号”字)。 * `boundary[3]`(第 4 条边界)强制对齐至 **Z4 锚点**(匹配“判决如下”或“如下判决”)。 * **当锚点缺失或无效时:** * 系统将直接保留模型预测的原始句子边界,并在输出结果中相应地更新 `anchor_status` 字段(标明锚点缺失)。 ## Python API 调用 (代码内嵌方式) 如果你希望在自己的 Python 代码中直接调用该模型,可以使用以下接口: ```python from judgment_partition_infer import Predictor # 初始化预测器(默认加载 ./assets/best_model.pt 和 ./assets/vocab.json,或用环境变量覆盖) pred = Predictor() # 传入文书全文或片段进行推理 out = pred.predict_text("...全文/片段...") # 打印预测出的 6 个边界位置 print(out["boundaries"]) ``` 说明:为了兼容 Hub 的文件大小与二进制限制,模型可能以 `assets/best_model.pt.b64.part-*` 文本分片形式存储; `Predictor()` 会自动拼接、解码并加载,不需要手动处理。 如果你只安装了代码(没有把 assets 一起下载到本地),请显式传入路径: ```python from judgment_partition_infer import Predictor pred = Predictor( model_path="path/to/best_model.pt", vocab_path="path/to/vocab.json", device="cpu", ) ``` ## 发布到 Hugging Face Hub(维护者用) 1) 安装发布依赖: ```bash pip install -r requirements-publish.txt ``` 2) 登录(推荐)或通过环境变量提供 token: - `huggingface-cli login` - 或 `export HF_TOKEN=...` / `export HUGGINGFACE_HUB_TOKEN=...` 3) 创建并上传(Model Repo): ```bash python publish_to_hf.py --repo-id / --public ``` 如果你已经在网页端创建了仓库: ```bash python publish_to_hf.py --repo-id / --skip-create ``` ### 如果当前网络无法访问 huggingface.co 可以改用 SSH(host 为 `hf.co`)推送: ```bash chmod +x push_to_hf_ssh.sh ./push_to_hf_ssh.sh / ``` (脚本会打印 SSH 公钥;请复制到 https://huggingface.co/settings/keys,然后再运行一次脚本。)