Subword Token Splitting Issue in Named Entity Recognition Output
#7
by
marmikp58
- opened
Hi team,
I am testing the ai4bharat/IndicNER model for a personal project and came across an issue while analyzing the model output. Here's an example to illustrate:
Input Text:मदर टेरेसा को जो इनाम में पैसे मिले, उन्होंने उन पैसों को भारत और विदोश दोनों में अपने काम में लगाया।
Model Output:
[
{'entity': 'B-PER', 'score': 0.9815, 'index': 1, 'word': 'म', 'start': 0, 'end': 1},
{'entity': 'I-PER', 'score': 0.5210, 'index': 2, 'word': '##दर', 'start': 1, 'end': 3},
{'entity': 'I-PER', 'score': 0.9766, 'index': 3, 'word': 'ट', 'start': 4, 'end': 5},
{'entity': 'I-PER', 'score': 0.9309, 'index': 4, 'word': '##रस', 'start': 6, 'end': 9},
{'entity': 'B-LOC', 'score': 0.9637, 'index': 21, 'word': 'भारत', 'start': 58, 'end': 62},
{'entity': 'B-LOC', 'score': 0.6438, 'index': 23, 'word': 'वि', 'start': 66, 'end': 68},
{'entity': 'B-LOC', 'score': 0.5717, 'index': 24, 'word': '##द', 'start': 68, 'end': 69}
]
It seems the model is splitting words like "मदर" into ['म', '##दर'] and "टेरेसा" into ['ट', '##रस'], which makes mapping the NER tags back to the original text challenging.
Could you please help me understand:
- Why is the model splitting the tokens this way?
- What is the best way to align these subword predictions back to the original word-level tokens and entities?
- Is there a recommended post-processing method to handle such cases for better word-level output?
Any insights or suggestions would be greatly appreciated. Thanks!
I'm experiencing the same issue. I have tried using a simple aggregator, but the problem persists. If you are done with this, can you please share your method