File size: 10,790 Bytes

---
license: gemma
datasets:
- jslin09/LegalElements
language:
- zh
base_model:
- google/gemma-2-2b
library_name: transformers
widget:
- text: "Is this review positive or negative? Review: Best cast iron skillet you will ever buy."
  example_title: "Sentiment analysis"
---
# Model Card for Gemma2-2b-ner

<!-- Provide a quick summary of what the model is/does. -->

本模型基於 [Gemma2:2b](https://huggingface.co/google/gemma-2-2b) 進行微調，目的是讓其依據台灣刑法學中常用的「刑法三階理論」，針對大型語言模型生成的詐欺罪「犯罪事實」段落，依照詐欺罪法條所規定的構成要件進行標註。具備生成詐欺罪「犯罪事實」的模型，可以參考以 BLOOM 560M 為基礎的[BLOOM 560M Fraud](https://huggingface.co/jslin09/bloom-560m-finetuned-fraud)微調模型，或是以 Gemma2 為基礎的[Gemma2:2b Fraud](https://huggingface.co/jslin09/gemma2-2b-fraud)微調模型。如果想知道實際的表現，可以到[示範平台](https://huggingface.co/spaces/jslin09/LE-NER)試用。

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->
本模型目前在識別出詐欺罪犯罪事實構成要件要素的平均準確率（percision）及召回率（recall）分別為0.98及0.75。從本模型訓練初期的語料資料錄為 979 筆開始，採用強化學習的流程，將生成的標註資料，採用人工對齊的方式修正後再投入語料庫中進行訓練。最終訓練用的語料計有 2577 筆，經過微調 3 個回合，就完成了本模型。以下是訓練過程各代的準確率及召回率的變化。

|版次|資料量|準確率(Precision)|召回率(Recall)|
|---|---|---|---|
|v1|979|0.272727273|0.218623482|
|v2|1538|0.725888325|0.581300813|
|v3|1886|0.717277487|0.465986395|
|v4|2173|0.826086957|0.550724638|
|v5|2577|0.983606557|0.75|

- **Developed by:** [Chun-Hsien Lin](https://huggingface.co/jslin09)
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Model type:** [More Information Needed]
- **Language(s) (NLP):** Traditional Chinese
- **License:** [More Information Needed]
- **Finetuned from model [optional]:** [Gemma2-2b](https://huggingface.co/google/gemma-2-2b)

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
目前可以識別出來的標註標籤有以下七種具名實體，無法識別出來的構成要件要素具名實體，則會傳回 None。

<pre>
  <code>
from colorama import Fore, Back, Style

elements = {'LEO_SOC': ('犯罪主體', 'Subject of Crime'),
            'LEO_VIC': ('客體', 'Victim'),
            'LEO_ACT': ('不法行為', 'Behavior'),
            'LEO_SLE': ('主觀要件', 'Subjective Legal Element of the Offense'),
            'LEO_CAU': ('因果關係', 'Causation'),
            'LEO_ROH': ('危害結果', 'Result of Hazard'),
            'LEO_ATP': ('未遂', 'Attempted')
           }
tag_color = {'LEO_SOC': Fore.BLACK + Back.RED,
             'LEO_VIC': Fore.BLACK + Back.YELLOW,
             'LEO_ACT': Fore.BLACK + Back.GREEN,
             'LEO_SLE': Fore.BLACK + Back.MAGENTA,
             'LEO_CAU': Fore.BLACK + Back.CYAN,
             'LEO_ROH': Fore.BLACK + Back.BLUE,
             'LEO_ATP': Fore.WHITE + Back.BLACK,
            }
  </code>
</pre>

為了要將本模型標註出來的結果以更明顯的方式識別，可以參考以下的程式碼，將本大型語言模型生成的標註結果以及所標註的標籤，同時送入以下的函數，就可以將結果以 colorama 的方式著色標註。

<pre>
  <code>
from colorama import Fore, Back, Style
    
def tag_in_color(response_content, tag):
    '''
    說明：
        將標註結果依照標籤進行標色。
    Parameters:
        response_content (str): 已經標註完畢並有標籤的內容。
        tag (str): 標籤名稱，英文，沒有括號。
    Return:
        result (str): 去除標籤並含有 colorama 標色符號的字串。
    '''
    response_head = response_content.split("標註結果:\n")[0]
    response_body = response_content.split("標註結果:\n")[1]
    start_index = 0
    # 使用正規表示式找出所有構成要件要素文字的起始位置
    # 加入 re.escape() 是為了避免處理到有逸脱字元的字串會報錯而中斷程式執行
    findall_open_tags = [m.start() for m in re.finditer(re.escape(f"[{tag}]"), response_body)]
    findall_close_tags = [m.start() for m in re.finditer(re.escape(f"[/{tag}]"), response_body)]
    try:
        parts = [response_body[start_index:findall_open_tags[0]]] # 第一個標籤之前的句子
    except IndexError:
        parts = []
    # 找出每個標籤所在位置，取出標籤文字並加以著色。
    for j, idx in enumerate(findall_open_tags):
        tag_text = response_body[idx + len(tag) + 2:findall_close_tags[j]]
        parts.append(f"{tag_color[tag]}" + tag_text + Style.RESET_ALL) # 標籤內文字著色
        closed_tag = findall_close_tags[j] + len(tag) + 3
        try:
            next_open_tag = findall_open_tags[j+1]
            parts.append(response_body[closed_tag: next_open_tag]) # 結束標籤之後到下一個標籤前的文字
        except IndexError:
            parts.append(response_body[findall_close_tags[-1] + len(tag) + 3 :]) # 加入最後一句
    result = ''
    for _, part in enumerate(parts):
        result = result + part
    if result == '':
        color_result = f"{tag_color[tag]}{tag}" + Fore.RESET + Back.RESET + " " +Fore.YELLOW + Back.RED + "*** 無標註結果 ***" + Fore.RESET + Back.RESET
    else:
        color_result = Fore.RED + Back.YELLOW +  "標註著色結果:\n" + Fore.RESET + Back.RESET + result
    return color_result
  </code>
</pre>

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

[More Information Needed]

### Downstream Use [optional]

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

[More Information Needed]

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
本模型目前僅能標示依據中華民國刑法規定的「詐欺罪」所擬撰（或是語言模型生成）之「犯罪事實」中的構成要件要素，若要具備標註其餘各種不同的犯罪構成要件要素之標註能力，則是後續可以發展以及擴增語料庫的方向。

[More Information Needed]

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

[More Information Needed]

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

## Training Details

### 訓練資料

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
本模型是以強化學習的方式微調 Gemma2:2b 並經過多回合人工對齊生成資料反覆迭代訓練而成，訓練所需要的資料集是[法律要件資料集](https://huggingface.co/datasets/jslin09/LegalElements)。使用者可以下載後自己持續迭代後修正及擴充資料集內容。

[More Information Needed]

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing [optional]

[More Information Needed]


#### Training Hyperparameters

- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

#### Speeds, Sizes, Times [optional]

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

[More Information Needed]

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

[More Information Needed]

#### Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

[More Information Needed]

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

[More Information Needed]

### Results

[More Information Needed]

#### Summary



## Model Examination [optional]

<!-- Relevant interpretability work for the model goes here -->

[More Information Needed]

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]

## Technical Specifications [optional]

### Model Architecture and Objective

[More Information Needed]

### Compute Infrastructure

[More Information Needed]

#### Hardware

[More Information Needed]

#### Software

[More Information Needed]

## Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]

## Glossary [optional]

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

[More Information Needed]

## More Information [optional]

[More Information Needed]

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[More Information Needed]