File size: 7,322 Bytes
47338ab
 
78fb3c3
 
 
 
 
 
 
47338ab
0402cc0
 
 
 
 
 
 
 
c406e6b
47338ab
bf14982
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67018b8
0402cc0
67018b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0402cc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6315f8
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
library_name: transformers
license: apache-2.0
language:
- ko
- en
- ja
base_model:
- Qwen/Qwen3-0.6B
---
## Model Detail
### Goal
- Perform dynamic NER: given a sentence and a runtime schema of entity types, extract all matching entities.
- Support multilingual input (English, Korean, Japanese, etc.).

### Limitation
- The model tends to extract only one entity per type and may miss multiple mentions of the same type.
- Overlapping or nested entities (e.g., โ€œNew Yorkโ€ vs โ€œYorkโ€) may be unclear without explicit overlap policy.
- Due to the generative nature of the model, original input words may be modified or paraphrased in the output.

### example(En)
```
system = """
You are an AI that dynamically performs Named Entity Recognition (NER).
You receive a sentence and a list of entity types the user wants to extract, and then identify all entities of those types within the sentence.
If you cannot find any suitable entities within the sentence, return an empty list.
"""

text = """
Once upon a time, a little boy named Tim went to the park with his mom. They saw a big fountain with water going up and down. Tim was very happy to see it.
Tim asked his mom, "Can I go near the fountain?" His mom answered, "Yes, but hold my hand tight." Tim held his mom's hand very tight and they walked closer to the fountain. They saw fish in the water and Tim laughed.
A little girl named Sue came to the fountain too. She asked Tim, "Do you like the fish?" Tim said, "Yes, I like them a lot!" Sue and Tim became friends and played near the fountain until it was time to go home.
""".strip()

named_entity = """
[{'type': 'PERSON', 'description': 'Names of individuals'}, {'type': 'LOCATION', 'description': 'Specific places or structures'}, {'type': 'ANIMAL', 'description': 'Names or types of animals'}]
""".strip()


user = f"<sentence>\n{text}\n</sentence>\n\n<entity_list>\n{named_entity}\n</entity_list>\n\n"
chat = [{"role":"system", "content":system}, {"role":"user", "content":user}]
chat_text = tokenizer.apply_chat_template(
            chat,
            enable_thinking=False,
            add_generation_prompt=True,
            tokenize=False
        )

model_inputs = tokenizer([chat_text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```

### result (en)
```
<entities>
[{'text': 'Tim', 'type': 'PERSON'}, {'text': 'mom', 'type': 'PERSON'}, {'text': 'Sue', 'type': 'PERSON'}, {'text': 'park', 'type': 'LOCATION'}, {'text': 'fountain', 'type': 'LOCATION'}, {'text': 'fish', 'type': 'ANIMAL'}]
</entities>
```
----------
### examlpe (ko)
```
system = """
You are an AI that dynamically performs Named Entity Recognition (NER).
You receive a sentence and a list of entity types the user wants to extract, and then identify all entities of those types within the sentence.
If you cannot find any suitable entities within the sentence, return an empty list.
"""

text = """
์ˆ˜์ง„์ด๋Š” ์ง€๋‚œ์ฃผ ํ† ์š”์ผ์— ์Šคํƒ€ํ•„๋“œ ํ•˜๋‚จ์— ๊ฐ”์–ด์š”.  
๊ทธ๋“ค์€ ์• ํ”Œ ์Šคํ† ์–ด์—์„œ ์ƒˆ๋กœ ๋‚˜์˜จ ์•„์ดํฐ 16์„ ๊ตฌ๊ฒฝํ•˜๊ณ , ์นดํŽ˜ ๋…ธํ‹ฐ๋“œ์—์„œ ๋„๋„›์„ ๋จน์—ˆ์–ด์š”.  
๊ทธ๋‚  ์ €๋…์—” ๋ฐฉํƒ„์†Œ๋…„๋‹จ ์ฝ˜์„œํŠธ ์‹คํ™ฉ ์˜ํ™”๋ฅผ ๋ดค์–ด์š”. ์ •๋ง ์‹ ๋‚ฌ์ฃ !
""".strip()

named_entity = """
[
  {"type": "PERSON", "description": "์‚ฌ๋žŒ ์ด๋ฆ„"},
  {"type": "LOCATION", "description": "์ง€๋ช… ๋˜๋Š” ์žฅ์†Œ"},
  {"type": "ORGANIZATION", "description": "์กฐ์ง, ํšŒ์‚ฌ, ๋‹จ์ฒด"},
  {"type": "PRODUCT", "description": "์ œํ’ˆ๋ช…"},
  {"type": "WORK_OF_ART", "description": "์˜ˆ์ˆ  ์ž‘ํ’ˆ, ์˜ํ™”, ์ฑ…, ๋…ธ๋ž˜ ๋“ฑ"},
  {"type": "DATE", "description": "๋‚ ์งœ, ์š”์ผ, ์‹œ์ "}
]
""".strip()


user = f"<sentence>\n{text}\n</sentence>\n\n<entity_list>\n{named_entity}\n</entity_list>\n\n"
chat = [{"role":"system", "content":system}, {"role":"user", "content":user}]
chat_text = tokenizer.apply_chat_template(
            chat,
            enable_thinking=False,
            add_generation_prompt=True,
            tokenize=False
        )

model_inputs = tokenizer([chat_text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```

### result (ko)
```
<entities>
[{'text': '์ˆ˜์ง„์ด', 'type': 'PERSON'}, {'text': '์Šคํƒ€ํ•„๋“œ ํ•˜๋‚จ', 'type': 'LOCATION'}, {'text': '์•„์ดํฐ 16', 'type': 'PRODUCT'}, {'text': '๋ฐฉํƒ„์†Œ๋…„๋‹จ', 'type': 'ORGANIZATION'}, {'text': '์ฝ˜์„œํŠธ ์‹คํ™ฉ ์˜ํ™”', 'type': 'WORK_OF_ART'}, {'text': 'ํ† ์š”์ผ', 'type': 'DATE'}, {'text': '์นดํŽ˜ ๋…ธํ‹ฐ๋“œ', 'type': 'LOCATION'}]
</entities>
```
-------

### examlpe (ja)
```
system = """
You are an AI that dynamically performs Named Entity Recognition (NER).
You receive a sentence and a list of entity types the user wants to extract, and then identify all entities of those types within the sentence.
If you cannot find any suitable entities within the sentence, return an empty list.
"""

text = """
ใƒชใƒŠใฏ4ๆœˆใฎ็ต‚ใ‚ใ‚Šใซๆฑไบฌใƒ‡ใ‚ฃใ‚บใƒ‹ใƒผใƒฉใƒณใƒ‰ใธ่กŒใใพใ—ใŸใ€‚  
ๅฝผๅฅณใฏใ‚นใƒ‘ใ‚คใƒ•ใ‚กใƒŸใƒชใƒผใฎใ‚ทใƒงใƒผใ‚’่ฆ‹ใฆใ€ใ‚นใ‚ฟใƒผใƒใƒƒใ‚ฏใ‚นใงๆŠน่Œถใƒฉใƒ†ใ‚’้ฃฒใฟใพใ—ใŸใ€‚  
ๅคœใซใฏใ€Œๅƒใจๅƒๅฐ‹ใฎ็ฅž้š ใ—ใ€ใฎ็‰นๅˆฅไธŠๆ˜ ไผšใซใ‚‚ๅ‚ๅŠ ใ—ใพใ—ใŸใ€‚
""".strip()

named_entity = """
[
  {"type": "PERSON", "description": "ๅ€‹ไบบๅ"},
  {"type": "LOCATION", "description": "ๅœฐๅใ‚„ๆ–ฝ่จญๅ"},
  {"type": "ORGANIZATION", "description": "ไผš็คพใ‚„ๅ›ฃไฝ“ๅ"},
  {"type": "WORK_OF_ART", "description": "ๆ˜ ็”ปใ€้Ÿณๆฅฝใ€ใ‚ขใƒ‹ใƒกใ€ๆ›ธ็ฑใชใฉ"},
  {"type": "PRODUCT", "description": "ๅ•†ๅ“ใ‚„ใƒ–ใƒฉใƒณใƒ‰ๅ"},
  {"type": "DATE", "description": "ๆ—ฅไป˜ใ‚„ๆ™‚ๆœŸ"}
]
""".strip()


user = f"<sentence>\n{text}\n</sentence>\n\n<entity_list>\n{named_entity}\n</entity_list>\n\n"
chat = [{"role":"system", "content":system}, {"role":"user", "content":user}]
chat_text = tokenizer.apply_chat_template(
            chat,
            enable_thinking=False,
            add_generation_prompt=True,
            tokenize=False
        )

model_inputs = tokenizer([chat_text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```

### result (ja)
```
<entities>
[{'text': 'ใƒชใƒŠ', 'type': 'PERSON'}, {'text': 'ๆฑไบฌ', 'type': 'LOCATION'}, {'text': 'ใ‚นใƒ‘ใ‚คใƒ•ใ‚กใƒŸใƒชใƒผ', 'type': 'ORGANIZATION'}, {'text': 'ใ‚นใ‚ฟใƒผใƒใƒƒใ‚ฏใ‚น', 'type': 'ORGANIZATION'}, {'text': 'ๅƒใจๅƒๅฐ‹ใฎ็ฅž้š ใ—', 'type': 'WORK_OF_ART'}, {'text': 'ๅŽš่Œถใƒฉใƒ†', 'type': 'PRODUCT'}, {'text': '4ๆœˆ', 'type': 'DATE'}]
</entities>
```

## License
- Qwen/Qwen3-0.6B : https://choosealicense.com/licenses/apache-2.0/

## Acknowledgement
This research is supported by **TPU Research Cloud program**.