megagonlabs
/

prompt-based-parsing-gemma-2-9b-lora-v1

+---
+base_model: models/gemma-2-9b
+library_name: peft
+license: cc-by-sa-4.0
+datasets:
+- universal-dependencies/universal_dependencies
+language:
+- en
+- ja
+- zh
+- ko
+- fr
+- de
+- sl
+metrics:
+- LAS
+- UAS
+- UPOS
+pipeline_tag: text-generation
+---
+# Model Card for prompt-parsing-v0-gemma-2-9b-lora
+[megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1](https://huggingface.co/megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1) is a dependency parsing model which analyze a gold token sequence in user prompt in step-by-step way.
+This model is trained using the Universal Dependencies datasets over 7 languages, and provides SoTA-level accuracy for UPOS, UAS, and LAS.
+[megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1](https://huggingface.co/megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1)はユーザプロンプトで与えられた正解トークン列に対してstep-by-stepで依存構造解析を行うモデルです。
+このモデルはUniversal Dependenciesの7つの言語のデータセットを用いて訓練されており、UPOS, UAS, LASにおいてSoTAレベルの解析精度を持ちます。
+## Terms of Use
+This LoRA adapter package is released under the CC BY-SA 4.0.
+However, please note the following important conditions regarding its usage:
+- This package **does not contain any part of the original Gemma 2 model**.
+- In order to use this package, you must obtain and use the base model distributed from Google:
+  [Gemma 2 9B base on Hugging Face](https://huggingface.co/google/gemma-2-9b)
+- **Use of the Gemma models requires agreement to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms)**.
+利用規約 (Japanese version of the Terms of Use)
+このLoRAアダプタパッケージは、CC BY-SA 4.0に基づいてリリースされています。
+ただし、使用に関しては以下の重要な利用条件に注意してください。
+- このパッケージには**オリジナルのGemma 2モデルは含まれていません**
+- このパッケージを使用するには、Googleが配布するGemmaモデルを入手して使用する必要があります:
+  [Gemma 2 9B base on Hugging Face](https://huggingface.co/google/gemma-2-9b)
+- **Gemmaモデルの使用には[Gemma Terms of Use](https://ai.google.dev/gemma/terms)への同意が必要です**
+## Usage
+- Install
+```Console
+pip install -U vllm==0.7.2 sudachipy sudachidict-core
+```
+In this first release, we only provide code example using the [sudachipy](https://github.com/WorksApplications/SudachiPy) tokenizer, which matches the token boundaries of UD Japanese datasets.
+Code examples for other languages will be provided in upcoming releases.
+本リリースでは、UD Japanese データセットのトークン境界との親和性の高い[sudachipy](https://github.com/WorksApplications/SudachiPy)をトークナイザーに使用したサンプルコードのみを提供します。
+他の言語向けのサンプルコードは、今後のリリースで提供予定です。
+- Code example
+```Python
+import json
+import sudachipy
+from vllm import LLM, SamplingParams
+from vllm.lora.request import LoRARequest
+base_model = "google/gemma-2-9b"
+adapter_model = "megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1"
+input_language = "Japanese"
+input_sentences = ["銀座でランチをご一緒しましょう。", "この時代から、日本列島に人類が住んだ遺跡や遺物が多く発見されている。"]
+tokenizer = sudachipy.Dictionary().create(mode=sudachipy.Tokenizer.SplitMode.A)
+def tokenize_japanese_space_after(sentence) -> list[str]:
+    tokens = []
+    for m in tokenizer.tokenize(sentence):
+        surface = m.surface()
+        if surface in [" ", "　"]:
+            if tokens and tokens[-1][-1] != " ":
+                tokens[-1] += " "
+        else:
+            tokens.append(surface)
+    if tokens and tokens[-1][-1] != " ":
+        tokens[-1] += " "
+    return tokens
+def apply_template(language: str, sentence: str, tokens: list[str]) -> list:
+    return """You are an <<<LANGUAGE>>> linguist and specialize in <<<LANGUAGE>>> dependency analysis based on Universal Dependencies.
+We will now perform dependency parsing on <<<LANGUAGE>>> sentence.
+After splitting the input sentence into words as shown below, execute following three tasks:
+- Task 1
+Create a TSV with three fields: word index from 1 to <<<TOKEN_NUM>>> + word + part of speech.
+- Task 2
+Add a field for the dependent word indexes to each row to the output of Task 1.
+However, for the word that is the main predicate of the sentence, the dependent word index should be 0.
+- Task 3
+Add a field for the Universal Dependencies relation labels to the output of Task 2.
+input sentence:
+<<<SENTENCE>>>
+words:
+<<<TOKENS>>>
+""".replace("<<<LANGUAGE>>>", language).replace("<<<TOKEN_NUM>>>", str(len(tokens))).replace("<<<SENTENCE>>>", sentence).replace("<<<TOKENS>>>", "\n".join(tokens))
+input_prompts = [
+    [
+        {
+            "role": "user",
+            "content": apply_template(input_language, s, tokenize_japanese_space_after(s)),
+        }
+    ] for s in input_sentences
+]
+llm = LLM(
+    model=base_model,
+    enable_lora=True,
+    tokenizer=adapter_model,
+    dtype="bfloat16",
+    gpu_memory_utilization=0.9,
+    tensor_parallel_size=1,
+    enforce_eager=True,
+)
+sampling_params = SamplingParams(
+    temperature=0.,
+    max_tokens=1024,  # <= 8192
+)
+lora_request = LoRARequest("adapter", 1, adapter_model)
+results = llm.chat(
+    messages=input_prompts,
+    sampling_params=sampling_params,
+    use_tqdm=False,
+    lora_request=lora_request,
+)
+for sentence, result in zip(input_sentences, results):
+    print("# text =", sentence)
+    print(result.outputs[0].text)
+```
+- Output of code example
+```
+# text = 銀座でランチをご一緒しましょう。
+- Task 1
+1	銀座	PROPN
+2	で	ADP
+3	ランチ	NOUN
+4	を	ADP
+5	ご	NOUN
+6	一緒	NOUN
+7	し	AUX
+8	ましょう	AUX
+9	。 	PUNCT
+- Task 2
+1	銀座	PROPN	6
+2	で	ADP	1
+3	ランチ	NOUN	6
+4	を	ADP	3
+5	ご	NOUN	6
+6	一緒	NOUN	0
+7	し	AUX	6
+8	ましょう	AUX	6
+9	。 	PUNCT	6
+- Task 3
+1	銀座	PROPN	6	nmod
+2	で	ADP	1	case
+3	ランチ	NOUN	6	obj
+4	を	ADP	3	case
+5	ご	NOUN	6	compound
+6	一緒	NOUN	0	root
+7	し	AUX	6	aux
+8	ましょう	AUX	6	aux
+9	。 	PUNCT	6	punct
+# text = この時代から、日本列島に人類が住んだ遺跡や遺物が多く発見されている。
+- Task 1
+1	この	DET
+2	時代	NOUN
+3	から	ADP
+4	、	PUNCT
+5	日本	PROPN
+6	列島	NOUN
+7	に	ADP
+8	人類	NOUN
+9	が	ADP
+10	住ん	VERB
+11	だ	AUX
+12	遺跡	NOUN
+13	や	ADP
+14	遺物	NOUN
+15	が	ADP
+16	多く	ADJ
+17	発見	VERB
+18	さ	AUX
+19	れ	AUX
+20	て	SCONJ
+21	いる	VERB
+22	。 	PUNCT
+- Task 2
+1	この	DET	2
+2	時代	NOUN	17
+3	から	ADP	2
+4	、	PUNCT	2
+5	日本	PROPN	6
+6	列島	NOUN	10
+7	に	ADP	6
+8	人類	NOUN	10
+9	が	ADP	8
+10	住ん	VERB	12
+11	だ	AUX	10
+12	遺跡	NOUN	14
+13	や	ADP	12
+14	遺物	NOUN	17
+15	が	ADP	14
+16	多く	ADJ	17
+17	発見	VERB	0
+18	さ	AUX	17
+19	れ	AUX	17
+20	て	SCONJ	17
+21	いる	VERB	20
+22	。 	PUNCT	17
+- Task 3
+1	この	DET	2	det
+2	時代	NOUN	17	obl
+3	から	ADP	2	case
+4	、	PUNCT	2	punct
+5	日本	PROPN	6	compound
+6	列島	NOUN	10	obl
+7	に	ADP	6	case
+8	人類	NOUN	10	nsubj
+9	が	ADP	8	case
+10	住ん	VERB	12	acl
+11	だ	AUX	10	aux
+12	遺跡	NOUN	14	nmod
+13	や	ADP	12	case
+14	遺物	NOUN	17	nsubj
+15	が	ADP	14	case
+16	多く	ADJ	17	advcl
+17	発見	VERB	0	root
+18	さ	AUX	17	aux
+19	れ	AUX	17	aux
+20	て	SCONJ	17	mark
+21	いる	VERB	20	fixed
+22	。 	PUNCT	17	punct
+```
+## Training and Evaluation
+### Training Data and Hyper-parameters
+We used the train-sets of the UD datasets below for LoRA SFT.
+本モデルのLoRA SFTには次のUDデータセットのtrainセットを使用しました。
+- [UD_English-EWT](https://github.com/UniversalDependencies/UD_English-EWT) r2.15
+- [UD_Japanese-GSD](https://github.com/UniversalDependencies/UD_Japanese-GSD) r2.15
+- [UD_Chinese-GSDSimp](https://github.com/UniversalDependencies/UD_Chinese-GSDSimp) r2.15
+- [UD_Korean-GSD](https://github.com/UniversalDependencies/UD_Korean-GSD) r2.15
+- [UD_French-GSD](https://github.com/UniversalDependencies/UD_French-GSD) r2.15
+- [UD_German-GSD](https://github.com/UniversalDependencies/UD_German-GSD) r2.15
+- [UD_Slovenian-SSJ](https://github.com/UniversalDependencies/UD_Slovenian-SSJ) r2.15
+We also used the training hyper-parameters below:
+また訓練時には次のパイパーパラメータを使用しました。
+- lr: 5e-5
+- num_train_epochs: 2
+- lora_target_modules: "all-linear"
+- lora_r: 8
+- lora_alpha: 8
+- lora_dropout: 0.05
+The details of the experimental conditions will be released later.
+実験条件の詳細については後日公開予定です。
+### Evaluation Results
+The accuracies in the table below are based on the simple recovery process applied to the TSV output in Step 3.
+次の表に記載した精度は、Step 3のTSV出力に簡易なリカバリ処理を適用した上で評価を行っています。
+| dataset | UPOS | UAS | LAS |
+| ---- | ---- | ---- | ---- |
+| [UD_English-EWT](https://github.com/UniversalDependencies/UD_English-EWT) | 0.982 | 0.951 | 0.937 |
+| [UD_Japanese-GSD](https://github.com/UniversalDependencies/UD_Japanese-GSD) | 0.987 | 0.952 | 0.939 |
+| [UD_Chinese-GSDSimp](https://github.com/UniversalDependencies/UD_Chinese-GSDSimp) | 0.972 | 0.889 | 0.862 |
+| [UD_Korean-GSD](https://github.com/UniversalDependencies/UD_Korean-GSD) | 0.970 | 0.898 | 0.868 |
+| [UD_French-GSD](https://github.com/UniversalDependencies/UD_French-GSD) | 0.981 | 0.956 | 0.943 |
+| [UD_German-GSD](https://github.com/UniversalDependencies/UD_German-GSD) | 0.974 | 0.908 | 0.873 |
+| [UD_Slovenian-SSJ](https://github.com/UniversalDependencies/UD_Slovenian-SSJ) | 0.989 | 0.954 | 0.939 |
+### Framework versions
+- TRL v0.15.2 (for training)
+- vLLM 0.7.2 (for inference)
+## Citation
+```bibtex
+@article{matsuda-nl263,
+  title={大規模言語モデルによる対話型依存構造解析},
+  author={松田寛},
+  journal={研究報告自然言語処理 (NL)},
+  volume={2025},
+  number={17},
+  pages={1--7},
+  year={2025},
+  publisher={情報処理学会}
+}
+```