|
|
--- |
|
|
tags: |
|
|
- autotrain |
|
|
- text-generation |
|
|
- transformers |
|
|
- named entity recognition |
|
|
widget: |
|
|
- text: 'I love AutoTrain because ' |
|
|
license: mit |
|
|
datasets: |
|
|
- conll2012_ontonotesv5 |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
|
|
|
# Phi-2 model fine-tuned for named entity recognition task |
|
|
The model was fine-tuned using one quarter of the ConLL 2012 OntoNotes v5 dataset. |
|
|
- Dataset Source: [conll2012_ontonotesv5](https://huggingface.co/datasets/conll2012_ontonotesv5) |
|
|
- Subset Used: English_v12 |
|
|
- Number of Examples: 87,265 |
|
|
|
|
|
The prompts and expected outputs were constructed as described in [1]. |
|
|
|
|
|
Example input: |
|
|
```md |
|
|
Instruct: I am an excelent linquist. The task is to label organization entities in the given sentence. Below are some examples |
|
|
|
|
|
Input: A spokesman for B. A. T said of the amended filings that,`` It would appear that nothing substantive has changed. |
|
|
Output: A spokesman for @@B. A. T## said of the amended filings that,`` It would appear that nothing substantive has changed. |
|
|
|
|
|
Input: Since NBC's interest in the Qintex bid for MGM / UA was disclosed, Mr. Wright has n't been available for comment. |
|
|
Output: Since @@NBC##'s interest in the @@Qintex## bid for @@MGM / UA## was disclosed, Mr. Wright has n't been available for comment. |
|
|
|
|
|
Input: You know news organizations demand total transparency whether you're General Motors or United States government /. |
|
|
Output: You know news organizations demand total transparency whether you're @@General Motors## or United States government /. |
|
|
|
|
|
Input: We respectfully invite you to watch a special edition of Across China. |
|
|
Output: |
|
|
``` |
|
|
Expected output: |
|
|
```md |
|
|
We respectfully invite you to watch a special edition of @@Across China##. |
|
|
``` |
|
|
|
|
|
This model is trained to recognize the named entity categories |
|
|
- person |
|
|
- nationalities or religious or political groups |
|
|
- facility |
|
|
- organization |
|
|
- geopolitical entity |
|
|
- location |
|
|
- product |
|
|
- date |
|
|
- time expression |
|
|
- percentage |
|
|
- monetary value |
|
|
- quantity |
|
|
- event |
|
|
- work of art |
|
|
- law/legal reference |
|
|
- language name |
|
|
|
|
|
# Model Trained Using AutoTrain |
|
|
|
|
|
This model was trained using **SFT** AutoTrain trainer. For more information, please visit [AutoTrain](https://hf.co/docs/autotrain). |
|
|
|
|
|
Hyperparameters: |
|
|
```json |
|
|
{ |
|
|
"model": "microsoft/phi-2", |
|
|
"valid_split": null, |
|
|
"add_eos_token": false, |
|
|
"block_size": 1024, |
|
|
"model_max_length": 1024, |
|
|
"padding": "right", |
|
|
"trainer": "sft", |
|
|
"use_flash_attention_2": false, |
|
|
"disable_gradient_checkpointing": false, |
|
|
"evaluation_strategy": "epoch", |
|
|
"save_total_limit": 1, |
|
|
"save_strategy": "epoch", |
|
|
"auto_find_batch_size": false, |
|
|
"mixed_precision": "bf16", |
|
|
"lr": 0.0002, |
|
|
"epochs": 1, |
|
|
"batch_size": 1, |
|
|
"warmup_ratio": 0.1, |
|
|
"gradient_accumulation": 4, |
|
|
"optimizer": "adamw_torch", |
|
|
"scheduler": "linear", |
|
|
"weight_decay": 0.01, |
|
|
"max_grad_norm": 1.0, |
|
|
"seed": 42, |
|
|
"apply_chat_template": false, |
|
|
"quantization": "int4", |
|
|
"target_modules": null, |
|
|
"merge_adapter": false, |
|
|
"peft": true, |
|
|
"lora_r": 16, |
|
|
"lora_alpha": 32, |
|
|
"lora_dropout": 0.05, |
|
|
"dpo_beta": 0.1, |
|
|
} |
|
|
``` |
|
|
|
|
|
# Usage |
|
|
|
|
|
```python |
|
|
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_path = "pahautelman/phi2-ner-v1" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_path |
|
|
).eval() |
|
|
|
|
|
prompt = 'Label the person entities in the given sentence: Russian President Vladimir Putin is due to arrive in Havana a few hours from now to become the first post-Soviet leader to visit Cuba.' |
|
|
|
|
|
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt') |
|
|
outputs = model.generate( |
|
|
inputs.to(model.device), |
|
|
max_new_tokens=9, |
|
|
do_sample=False, |
|
|
) |
|
|
output = tokenizer.batch_decode(outputs)[0] |
|
|
|
|
|
# Model response: "Output: Russian President, Vladimir Putin" |
|
|
print(output) |
|
|
``` |
|
|
|
|
|
# References: |
|
|
[1] Wang et al., GPT-NER: Named entity recognition via large language models 2023 |