Update README.md

e4bc922 verified 6 months ago

16.9 kB

library_name: transformers
language:
  - ja
base_model:
  - LiquidAI/LFM2-350M
license: other
license_name: lfm1.0
license_link: LICENSE

（日本語はこちらから）

LFM2-350M-PII-Extract-JP

Based on LFM2-350M, this checkpoint is designed to extract personally identifiable information (PII) from Japanese text and output it in JSON format. The output can then be used to mask out sensitive information in contracts, emails, personal medical reports, insurance bills, etc. directly on-device.

In particular, it is trained to extract:

Address/locations (JSON key: address)
Company/institute/organization names (JSON key: company_name)
Email addresses (JSON key: email_address)
Human names (JSON key: human_name)
Phone numbers (JSON key: phone_number) from Japanese documents and texts.

Demo

Extraction Quality

We evaluated several models including GPT5 and a 32B parameter Qwen3 model with thinking mode enabled, on 1k random samples taken from finepdf. LFM2-350M-PII-Extract-JP boasts GPT5-level performance on a tiny 350M parameter footprint, bringing cloud level performance on device!

📝 While LFM2-350M-PII-Extract-JP provides strong out-of-the-box PII entity extraction for the categories listed above, our primary goal is to deliver a versatile, community-driven base model—a foundation that makes it easy to build best-in-class, privacy-focused masking systems.

Like any base model, there remain areas for continued development, particularly for specialized use cases:

Supporting extraction of organization-specific identification numbers

Expanding coverage to additional categories such as date of birth, passport numbers

Further improving extraction performance on particular categories

These are precisely the kinds of challenges that fine-tuning—by both Liquid AI and our developer community can address. We see this model not just as an endpoint, but as a catalyst for a rich ecosystem of fine-tuned PII extraction models tailored to real-world needs.

Model Details

Generation parameters: We strongly recommend using greedy decoding with a temperature=0.

System prompts: This checkpoint requires the following system prompt:

Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number>

Note the model can handle extraction of particular entities. E.g. The model will only output human names when the system prompt is set to Extract <human_name>.

⚠️ For best performance, ensure alphabetical order of entity categories as shown above.

Chat template: LFM2-PII-Extract-JP uses a ChatML-like chat template as follows:

<|startoftext|><|im_start|>system
Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number><|im_end|>
<|im_start|>user
こんにちは、ラミンさんに B200 GPU を 10000 台 至急請求してください。連絡先は celegans@liquid.ai (電話番号010-000-0000) で、これは C. elegans 線虫に着想を得たニューラルネットワークアーキテクチャを 今すぐ構築するために不可欠です。<|im_end|>
<|im_start|>assistant
{"address": [], "company_name": [], "email_address": ["celegans@liquid.ai"], "human_name": ["ラミン"], "phone_number": ["010-000-0000"]}<|im_end|>

You can automatically apply it using the dedicated .apply_chat_template() function from Hugging Face transformers.

⚠️ The model is intended for single turn conversations.

Output format

The model outputs a JSON object containing the fields it was prompted to extract. If no entities are found in a particular category, it returns an empty list for that category. If entities are found, they are returned as a list for each prompted category. The model is trained to output entities exactly as they appear in the text. If the same entity appears multiple times with slight formatting variations, the model outputs all variations to ensure subsequent masking can be performed using exact matches.

🏃 How to run LFM2

Huggingface: LFM2-350M
llama.cpp: LFM2-350M-PII-Extract-JP-GGUF
LEAP: LEAP model library

📬 Contact

If you are interested in custom solutions with edge deployment, please contact our sales team.

LFM2-350M-PII-Extract-JP (日本語)

LFM2-350M をベースにしたこのチェックポイントは、日本語文書から個人を特定できる情報（PII）を抽出し、JSON 形式で出力します。
契約書、電子メール、個人の医療報告書、並びに保険請求書などの機密情報を、デバイス上で直接マスキングできます。

特に以下の情報を抽出するように訓練されています。

住所／所在地（JSON key: address）
企業／研究機関／組織名（JSON key: company_name）
メールアドレス（JSON key: email_address）
人名（JSON key: human_name）
電話番号（JSON key: phone_number）

これらの情報を日本語の文書から抽出します。

デモ

性能

finepdf から無作為に抽出した 1,000 サンプルを用いて、GPT5 や 32B パラメータの Qwen3 モデル（思考モードあり）など、複数のモデルとの比較評価を行いました。
LFM2-350M-PII-Extract-JP は、わずか 350M パラメータ という軽量モデルながら GPT5 と同等レベルの性能を発揮し、クラウドレベルの品質をあなたのデバイス上で実現します！

📝 LFM2-350M-PII-Extract-JP は、上記カテゴリに対して優れた PII 抽出性能を有しますが、私たちの主な目的は、コミュニティによって継続的に改良される柔軟な基盤モデルを提供することです。
このモデルで、誰でもプライバシー重視の高品質なマスキングシステムを容易に構築できます。

ただし、ベースモデルとして今後さらなる改善の余地があります。特に以下のような専門的な利用用途が想定されます。

組織固有の識別番号の抽出対応

生年月日、パスポート番号などの追加カテゴリへの拡張

特定カテゴリにおける抽出性能のさらなる改善

これらの課題は、Liquid AI および開発者コミュニティによるファインチューニングによって解決できると考えています。
LFM2-350M-PII-Extract-JP は完成形ではなく、実運用ニーズに応じた多様な PII 抽出モデル群を生み出す出発点であると位置づけています。

モデル詳細

生成パラメータ: temperature=0 の貪欲デコード（greedy decoding）の使用を強く推奨します。

システムプロンプト: このチェックポイントでは以下のシステムプロンプトが必須です：

Extract , , , ,

モデルは特定のエンティティのみを抽出するように設定することも可能です。
例: Extract <human_name> と設定した場合、人名のみを出力します。

⚠️ モデルの性能を最大限発揮させるには、上記のように エンティティカテゴリをアルファベット順 に並べてください。

チャットテンプレート
LFM2-PII-Extract-JP は以下のような ChatML 風テンプレートを使用します。

<|startoftext|><|im_start|>system Extract , , , , <|im_end|> <|im_start|>user こんにちは、ラミンさんに B200 GPU を 10000 台至急請求してください。連絡先は celegans@liquid.ai (電話番号010-000-0000) で、これは C. elegans 線虫に着想を得たニューラルネットワークアーキテクチャを今すぐ構築するために不可欠です。<|im_end|> <|im_start|>assistant {“address”: [], “company_name”: [], “email_address”: [“celegans@liquid.ai”], “human_name”: [“ラミン”], “phone_number”: [“010-000-0000”]}<|im_end|>

このテンプレートは、Hugging Face Transformers の専用関数 .apply_chat_template() を使用して自動的に適用できます。

⚠️ このモデルは 一問一答形式　（単一ターン）　の会話 に最適化されています。

出力形式

モデルは、指定されたエンティティを含んだ JSON 形式で出力します。
各カテゴリに該当するエンティティが見つからない場合は、空のリストを返します。
該当するエンティティが存在する場合は、そのカテゴリごとに抽出された文字列のリストを返します。

モデルは、テキスト中に現れる形式で正確にエンティティを出力するように訓練されています。
同じエンティティが複数回登場し表記に揺れがある場合でも、すべての表記バリエーションを出力し、マスキング時に完全一致で対応できるようになっています。

🏃 LFM2 の実行方法

Hugging Face: LFM2-350M
llama.cpp: LFM2-350M-PII-Extract-JP-GGUF
LEAP: LEAP モデルライブラリ

📬 お問い合わせ

エッジ環境への導入を含むカスタムソリューションにご興味がある方は、営業チームまでお問い合わせください。