Dynamic_NER / README.md

jaeyong2

Update README.md

b6315f8 verified 2 months ago

preview code

raw

history blame contribute delete

7.32 kB

metadata

library_name: transformers
license: apache-2.0
language:
  - ko
  - en
  - ja
base_model:
  - Qwen/Qwen3-0.6B

Model Detail

Goal

Perform dynamic NER: given a sentence and a runtime schema of entity types, extract all matching entities.
Support multilingual input (English, Korean, Japanese, etc.).

Limitation

The model tends to extract only one entity per type and may miss multiple mentions of the same type.
Overlapping or nested entities (e.g., “New York” vs “York”) may be unclear without explicit overlap policy.
Due to the generative nature of the model, original input words may be modified or paraphrased in the output.

example(En)

system = """
You are an AI that dynamically performs Named Entity Recognition (NER).
You receive a sentence and a list of entity types the user wants to extract, and then identify all entities of those types within the sentence.
If you cannot find any suitable entities within the sentence, return an empty list.
"""

text = """
Once upon a time, a little boy named Tim went to the park with his mom. They saw a big fountain with water going up and down. Tim was very happy to see it.
Tim asked his mom, "Can I go near the fountain?" His mom answered, "Yes, but hold my hand tight." Tim held his mom's hand very tight and they walked closer to the fountain. They saw fish in the water and Tim laughed.
A little girl named Sue came to the fountain too. She asked Tim, "Do you like the fish?" Tim said, "Yes, I like them a lot!" Sue and Tim became friends and played near the fountain until it was time to go home.
""".strip()

named_entity = """
[{'type': 'PERSON', 'description': 'Names of individuals'}, {'type': 'LOCATION', 'description': 'Specific places or structures'}, {'type': 'ANIMAL', 'description': 'Names or types of animals'}]
""".strip()


user = f"<sentence>\n{text}\n</sentence>\n\n<entity_list>\n{named_entity}\n</entity_list>\n\n"
chat = [{"role":"system", "content":system}, {"role":"user", "content":user}]
chat_text = tokenizer.apply_chat_template(
            chat,
            enable_thinking=False,
            add_generation_prompt=True,
            tokenize=False
        )

model_inputs = tokenizer([chat_text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

result (en)

<entities>
[{'text': 'Tim', 'type': 'PERSON'}, {'text': 'mom', 'type': 'PERSON'}, {'text': 'Sue', 'type': 'PERSON'}, {'text': 'park', 'type': 'LOCATION'}, {'text': 'fountain', 'type': 'LOCATION'}, {'text': 'fish', 'type': 'ANIMAL'}]
</entities>

examlpe (ko)

system = """
You are an AI that dynamically performs Named Entity Recognition (NER).
You receive a sentence and a list of entity types the user wants to extract, and then identify all entities of those types within the sentence.
If you cannot find any suitable entities within the sentence, return an empty list.
"""

text = """
수진이는 지난주 토요일에 스타필드 하남에 갔어요.  
그들은 애플 스토어에서 새로 나온 아이폰 16을 구경하고, 카페 노티드에서 도넛을 먹었어요.  
그날 저녁엔 방탄소년단 콘서트 실황 영화를 봤어요. 정말 신났죠!
""".strip()

named_entity = """
[
  {"type": "PERSON", "description": "사람 이름"},
  {"type": "LOCATION", "description": "지명 또는 장소"},
  {"type": "ORGANIZATION", "description": "조직, 회사, 단체"},
  {"type": "PRODUCT", "description": "제품명"},
  {"type": "WORK_OF_ART", "description": "예술 작품, 영화, 책, 노래 등"},
  {"type": "DATE", "description": "날짜, 요일, 시점"}
]
""".strip()


user = f"<sentence>\n{text}\n</sentence>\n\n<entity_list>\n{named_entity}\n</entity_list>\n\n"
chat = [{"role":"system", "content":system}, {"role":"user", "content":user}]
chat_text = tokenizer.apply_chat_template(
            chat,
            enable_thinking=False,
            add_generation_prompt=True,
            tokenize=False
        )

model_inputs = tokenizer([chat_text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

result (ko)

<entities>
[{'text': '수진이', 'type': 'PERSON'}, {'text': '스타필드 하남', 'type': 'LOCATION'}, {'text': '아이폰 16', 'type': 'PRODUCT'}, {'text': '방탄소년단', 'type': 'ORGANIZATION'}, {'text': '콘서트 실황 영화', 'type': 'WORK_OF_ART'}, {'text': '토요일', 'type': 'DATE'}, {'text': '카페 노티드', 'type': 'LOCATION'}]
</entities>

examlpe (ja)

system = """
You are an AI that dynamically performs Named Entity Recognition (NER).
You receive a sentence and a list of entity types the user wants to extract, and then identify all entities of those types within the sentence.
If you cannot find any suitable entities within the sentence, return an empty list.
"""

text = """
リナは4月の終わりに東京ディズニーランドへ行きました。  
彼女はスパイファミリーのショーを見て、スターバックスで抹茶ラテを飲みました。  
夜には「千と千尋の神隠し」の特別上映会にも参加しました。
""".strip()

named_entity = """
[
  {"type": "PERSON", "description": "個人名"},
  {"type": "LOCATION", "description": "地名や施設名"},
  {"type": "ORGANIZATION", "description": "会社や団体名"},
  {"type": "WORK_OF_ART", "description": "映画、音楽、アニメ、書籍など"},
  {"type": "PRODUCT", "description": "商品やブランド名"},
  {"type": "DATE", "description": "日付や時期"}
]
""".strip()


user = f"<sentence>\n{text}\n</sentence>\n\n<entity_list>\n{named_entity}\n</entity_list>\n\n"
chat = [{"role":"system", "content":system}, {"role":"user", "content":user}]
chat_text = tokenizer.apply_chat_template(
            chat,
            enable_thinking=False,
            add_generation_prompt=True,
            tokenize=False
        )

model_inputs = tokenizer([chat_text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

result (ja)

<entities>
[{'text': 'リナ', 'type': 'PERSON'}, {'text': '東京', 'type': 'LOCATION'}, {'text': 'スパイファミリー', 'type': 'ORGANIZATION'}, {'text': 'スターバックス', 'type': 'ORGANIZATION'}, {'text': '千と千尋の神隠し', 'type': 'WORK_OF_ART'}, {'text': '厚茶ラテ', 'type': 'PRODUCT'}, {'text': '4月', 'type': 'DATE'}]
</entities>

License

Qwen/Qwen3-0.6B : https://choosealicense.com/licenses/apache-2.0/

Acknowledgement

This research is supported by TPU Research Cloud program.