corpus-refiner-jp

corpus-refiner-jp is a Japanese line-level keep classifier for corpus cleanup. Given a multi-line text, it predicts whether each line should be kept.

This model is fine-tuned from sbintuitions/modernbert-ja-130m. The model uses ModernBERT hidden states at special line-token positions and applies a binary classifier to each line.

Quick Start

import torch
from transformers import AutoModel, AutoTokenizer


LINE_TOKEN = "<line>"
THRESHOLD = 0.6
TEXT = """富士山は日本で最も高い山で、標高は3,776メートルである。
山頂付近は夏でも気温が低く、天候が急に変化することがある。
外部リンク: https://example.com/fuji
この記事は検証可能な参考文献が不足しています。
登山道は複数あり、利用者は体力や経験に応じて経路を選ぶ。
カテゴリ: 日本の山 | 火山 | 世界遺産"""


def main() -> None:
    # ---------------------------------------------------------
    # Load tokenizer and AutoModel-compatible line classifier.
    # ---------------------------------------------------------
    tokenizer = AutoTokenizer.from_pretrained("MK0727/corpus-refiner-jp")
    model = AutoModel.from_pretrained("MK0727/corpus-refiner-jp", trust_remote_code=True)
    model.eval()

    # ---------------------------------------------------------
    # Add the line marker before each input line.
    # ---------------------------------------------------------
    lines = TEXT.split("\n")
    text = "".join(f"{LINE_TOKEN}{line}" for line in lines)
    inputs = tokenizer(text, return_tensors="pt")

    # ---------------------------------------------------------
    # Predict the keep probability for each line marker.
    # ---------------------------------------------------------
    with torch.no_grad():
        logits = model(**inputs).logits
    probabilities = torch.softmax(logits, dim=-1)[:, 1].detach().cpu().tolist()

    # ---------------------------------------------------------
    # Print each line with its predicted label and probability.
    # ---------------------------------------------------------
    for line_number, (line, probability) in enumerate(zip(lines, probabilities, strict=True), start=1):
        label = "KEEP" if probability >= THRESHOLD else "DELETE"
        print(f"{line_number:02d} [{label:<6}] {probability:.4f} {line}")


if __name__ == "__main__":
    main()

Example output:

01 [KEEP  ] 0.9673 富士山は日本で最も高い山で、標高は3,776メートルである。
02 [KEEP  ] 0.9888 山頂付近は夏でも気温が低く、天候が急に変化することがある。
03 [DELETE] 0.0240 外部リンク: https://example.com/fuji
04 [DELETE] 0.0817 この記事は検証可能な参考文献が不足しています。
05 [KEEP  ] 0.8815 登山道は複数あり、利用者は体力や経験に応じて経路を選ぶ。
06 [DELETE] 0.0447 カテゴリ: 日本の山 | 火山 | 世界遺産

Intended Use

This model is intended for preprocessing Japanese web corpus before language model training. It is useful when a dataset contains boilerplate text, navigation fragments, repeated links, low-value fragments, or other content that should be removed while preserving useful body text.

Model Details

  • Base model: sbintuitions/modernbert-ja-130m
  • Task: binary line-level classification
  • Positive label: line should be kept
  • Negative label: line should be deleted
  • Input length: up to 4096 tokens per window
  • Long documents: split into overlapping line windows.
  • Architecture: ModernBERT encoder plus a classification head over line-token hidden states

Each input line has to be prefixed with a special line token, <line>. The model classifies the hidden state corresponding to each line token, so one forward pass can produce predictions for multiple lines.

Training Data

The model is trained on MK0727/noise-line-label-jp, a Japanese line-level dataset with lines_to_keep annotations.

Performance

The model was evaluated on a test split from MK0727/noise-line-label-jp.

The most important metric is F1 keep, because the model is intended to preserve useful corpus lines while removing noisy ones.

Metric Score (higher is better, max 1.0) Meaning
F1 keep 0.927 Balance between keeping useful lines and avoiding noisy lines
Precision keep 0.946 How often kept lines are actually useful
Recall keep 0.908 How many useful lines the model keeps

This means the model is conservative about keeping lines: some useful short lines, metadata-like lines, or code-like lines may be removed.

Inference Notes

For short texts, all lines can be processed in a single 4096-token window. For longer texts, split the document into overlapping windows and average probabilities for lines that appear in more than one window.

Limitations

This model is specialized for Japanese web-text cleanup. It may perform poorly on other languages, code, and tables.

Downloads last month
464
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MK0727/corpus-refiner-jp

Finetuned
(17)
this model