Subword Token Splitting Issue in Named Entity Recognition Output

by marmikp58 - opened Apr 15, 2025

Apr 15, 2025

Hi team,

I am testing the ai4bharat/IndicNER model for a personal project and came across an issue while analyzing the model output. Here's an example to illustrate:

Input Text:
मदर टेरेसा को जो इनाम में पैसे मिले, उन्होंने उन पैसों को भारत और विदोश दोनों में अपने काम में लगाया।

Model Output:

[
  {'entity': 'B-PER', 'score': 0.9815, 'index': 1, 'word': 'म', 'start': 0, 'end': 1},
  {'entity': 'I-PER', 'score': 0.5210, 'index': 2, 'word': '##दर', 'start': 1, 'end': 3},
  {'entity': 'I-PER', 'score': 0.9766, 'index': 3, 'word': 'ट', 'start': 4, 'end': 5},
  {'entity': 'I-PER', 'score': 0.9309, 'index': 4, 'word': '##रस', 'start': 6, 'end': 9},
  {'entity': 'B-LOC', 'score': 0.9637, 'index': 21, 'word': 'भारत', 'start': 58, 'end': 62},
  {'entity': 'B-LOC', 'score': 0.6438, 'index': 23, 'word': 'वि', 'start': 66, 'end': 68},
  {'entity': 'B-LOC', 'score': 0.5717, 'index': 24, 'word': '##द', 'start': 68, 'end': 69}
]

It seems the model is splitting words like "मदर" into ['म', '##दर'] and "टेरेसा" into ['ट', '##रस'], which makes mapping the NER tags back to the original text challenging.

Could you please help me understand:

Why is the model splitting the tokens this way?
What is the best way to align these subword predictions back to the original word-level tokens and entities?
Is there a recommended post-processing method to handle such cases for better word-level output?

Any insights or suggestions would be greatly appreciated. Thanks!

lakhanpal185

Aug 15, 2025

I'm experiencing the same issue. I have tried using a simple aggregator, but the problem persists. If you are done with this, can you please share your method

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment