Instructions to use MK0727/corpus-refiner-jp with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MK0727/corpus-refiner-jp with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="MK0727/corpus-refiner-jp", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("MK0727/corpus-refiner-jp", trust_remote_code=True) model = AutoModel.from_pretrained("MK0727/corpus-refiner-jp", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
corpus-refiner-jp
corpus-refiner-jp is a Japanese line-level keep classifier for corpus cleanup. Given a multi-line text, it predicts whether each line should be kept.
This model is fine-tuned from sbintuitions/modernbert-ja-130m. The model uses ModernBERT hidden states at special line-token positions and applies a binary classifier to each line.
Quick Start
import torch
from transformers import AutoModel, AutoTokenizer
LINE_TOKEN = "<line>"
THRESHOLD = 0.6
TEXT = """富士山は日本で最も高い山で、標高は3,776メートルである。
山頂付近は夏でも気温が低く、天候が急に変化することがある。
外部リンク: https://example.com/fuji
この記事は検証可能な参考文献が不足しています。
登山道は複数あり、利用者は体力や経験に応じて経路を選ぶ。
カテゴリ: 日本の山 | 火山 | 世界遺産"""
def main() -> None:
# ---------------------------------------------------------
# Load tokenizer and AutoModel-compatible line classifier.
# ---------------------------------------------------------
tokenizer = AutoTokenizer.from_pretrained("MK0727/corpus-refiner-jp")
model = AutoModel.from_pretrained("MK0727/corpus-refiner-jp", trust_remote_code=True)
model.eval()
# ---------------------------------------------------------
# Add the line marker before each input line.
# ---------------------------------------------------------
lines = TEXT.split("\n")
text = "".join(f"{LINE_TOKEN}{line}" for line in lines)
inputs = tokenizer(text, return_tensors="pt")
# ---------------------------------------------------------
# Predict the keep probability for each line marker.
# ---------------------------------------------------------
with torch.no_grad():
logits = model(**inputs).logits
probabilities = torch.softmax(logits, dim=-1)[:, 1].detach().cpu().tolist()
# ---------------------------------------------------------
# Print each line with its predicted label and probability.
# ---------------------------------------------------------
for line_number, (line, probability) in enumerate(zip(lines, probabilities, strict=True), start=1):
label = "KEEP" if probability >= THRESHOLD else "DELETE"
print(f"{line_number:02d} [{label:<6}] {probability:.4f} {line}")
if __name__ == "__main__":
main()
Example output:
01 [KEEP ] 0.9673 富士山は日本で最も高い山で、標高は3,776メートルである。
02 [KEEP ] 0.9888 山頂付近は夏でも気温が低く、天候が急に変化することがある。
03 [DELETE] 0.0240 外部リンク: https://example.com/fuji
04 [DELETE] 0.0817 この記事は検証可能な参考文献が不足しています。
05 [KEEP ] 0.8815 登山道は複数あり、利用者は体力や経験に応じて経路を選ぶ。
06 [DELETE] 0.0447 カテゴリ: 日本の山 | 火山 | 世界遺産
Intended Use
This model is intended for preprocessing Japanese web corpus before language model training. It is useful when a dataset contains boilerplate text, navigation fragments, repeated links, low-value fragments, or other content that should be removed while preserving useful body text.
Model Details
- Base model:
sbintuitions/modernbert-ja-130m - Task: binary line-level classification
- Positive label: line should be kept
- Negative label: line should be deleted
- Input length: up to 4096 tokens per window
- Long documents: split into overlapping line windows.
- Architecture: ModernBERT encoder plus a classification head over line-token hidden states
Each input line has to be prefixed with a special line token, <line>. The model classifies the hidden state corresponding to each line token, so one forward pass can produce predictions for multiple lines.
Training Data
The model is trained on MK0727/noise-line-label-jp, a Japanese line-level dataset with lines_to_keep annotations.
Performance
The model was evaluated on a test split from MK0727/noise-line-label-jp.
The most important metric is F1 keep, because the model is intended to preserve useful corpus lines while removing noisy ones.
| Metric | Score (higher is better, max 1.0) | Meaning |
|---|---|---|
| F1 keep | 0.927 | Balance between keeping useful lines and avoiding noisy lines |
| Precision keep | 0.946 | How often kept lines are actually useful |
| Recall keep | 0.908 | How many useful lines the model keeps |
This means the model is conservative about keeping lines: some useful short lines, metadata-like lines, or code-like lines may be removed.
Inference Notes
For short texts, all lines can be processed in a single 4096-token window. For longer texts, split the document into overlapping windows and average probabilities for lines that appear in more than one window.
Limitations
This model is specialized for Japanese web-text cleanup. It may perform poorly on other languages, code, and tables.
- Downloads last month
- 464
Model tree for MK0727/corpus-refiner-jp
Base model
sbintuitions/modernbert-ja-130m