Is there a maximum character limit for the trained models?
Hi all,
I'm using this model to recognize persons, organizations and locations in text, and Ive noticed that it fails to produce reliable results for lengthy texts
I'm using a locally downloaded model and I am not talking about the Hosted Interface API of huggingface
For example, for the below text (copied from a financial news article), the last line containing the obvious names of "Jim", "James", "Ben" and "Toby" are missed entirely by the model.
HSBC has swooped to buy the UK arm of collapsed US Silicon Valley Bank (SVB), bringing relief to UK tech firms who warned they could go bust without help.
Customers and businesses who had been unable to withdraw their money will now be able to access it as normal.
The government and the Bank of England led the talks and worked through the night to scramble together the deal, which involves no taxpayer money.
HSBC said it paid just £1 for the SVB's UK arm.
Silicon Valley Bank - which specialised in lending to technology companies - was shut down by US regulators on Friday in what was the largest failure of a US bank since 2008.
Its collapse sent shockwaves across the tech industry over the possible impact it could have on businesses, with some firms telling the BBC they could go bust if deposits were not secured.
With fears over how firms would be able to access cash on Monday morning, frantic talks were held between Chancellor Jeremy Hunt, the prime minister, the Bank of England governor, HSBC bosses and civil servants to find a solution.
The Bank of England said no other UK banks had been "materially affected" by SVB's collapse and said the banking system remained "safe, sound, and well capitalised".
Although the UK arm of SVB was small with just over 3,000 business customers, its collapse would have presented a risk for a sector which the government views as pivotal to the UK's future economic success.
Mr Hunt said some of the firms only had bank accounts with SVB UK, "so for that reason we were faced with a situation where we could have seen some of our most important companies, our most strategic companies, wiped out and that would have been extremely dangerous".
However, he added there was "never a systemic risk to our financial stability in the UK".
Toby Mather, chief executive and co-founder of Lingumi, an education technology start-up, said 85% of its cash was tied up with the bank and he had had a very "anxious weekend".
"We had enough money in bank accounts outside the UK and enough revenue coming through each week from our customers that we could look our staff in the eyes at nine o'clock this morning and say we can make payroll in two weeks, but it would have been very uncertain from then", Mr Mather said.
Sebastian Weidt, chief executive of Universal Quantum, a tech company which employs about 40 people and held all its funds with SVB, said the deal was a "huge relief" after an "unbelievably stressful" few days.
Although its US parent was in financial trouble, Silicon Valley Bank UK was in reasonable financial health when it was bought for £1 by HSBC.
It had adequate capital and was making reasonable profits. Bank of England sources confirmed this weekend's intervention was more a preventive strike before the collapse of its US parent sparked mass withdrawals from the UK business.
What that means is that HSBC got one hell of a deal which it owed to its size and strength - with regulators confident that Europe's largest bank could easily take on any risk from SVB UK's customers.
It seems the only thing wrong with SVB UK was its name. While not a Lehman Brothers moment, what the collapse of SVB US has highlighted is that many banks are riskier than they look on paper as they have all sustained losses on their investments in government bonds as interest rates have soared - pushing their value down.
One reason why bank shares are lower again on Monday as that thought sinks in with jittery investors. Nice to meet you Jim. Nice to meet you James. Nice to meet you Ben. Hello Toby.
But if I reduce the text by half, or move the last paragraph to the top, then the names are recognized as "PER" as expected.
Is this a known issue? that for lengthy texts, it just gives up after a certain number of characters and anything after this it fails to detect?
Thanks
So, I can pretty much confirm that there is a maximum limit on the model. For long articles, after a certain number of words the model just stops working.
I solved this by splitting my text into chunks of 300 words and then identifying each chunk. This gave much better results and I did not have the problem of the model just giving up mid text.
Below is my code for the solution explained above for anyone coming across this in future:
import re
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# slice text into 300 words
split_text = re.findall(r'\W*(?:\w+\W+){1,300}', text)
all_ner = []
for text_seg in split_text:
ner_results = nlp(text_seg)
all_ner.append(ner_results)
# the all_ner is a list of list so converting that to a flat list
flat_all_ner = [item for sublist in all_ner for item in sublist]