Critical Hallucinations and Cross-Script Confusion in Multilingual Stress Test (CJK + Cyrillic)

#42

by The1Just - opened Feb 26

Feb 26

We conducted a stress test using a custom "Multilingual Complexity Document" designed to evaluate the OCR engine's robustness against edge cases. The document contains archaic characters, high-stroke-count CJK ideograms, and long agglutinative Cyrillic compound words.
The current OCR model demonstrates significant failure modes, including semantic hallucinations, radical decomposition errors, and cross-lingual script contamination.

Observed Failures & Analysis:

Russian (Cyrillic) - Segmentation Failure & Hallucination
The engine struggles with long compound words, breaking them into nonsensical tokens.

Original: человеконенавистничества (misanthropy - single word).
OCR Output: человека конечнастичства
Error: The model hallucinated a space and invented a non-existent suffix, changing the meaning entirely.

Chinese (Traditional/Complex) - Radical Decomposition Error
Instead of recognizing a complex character, the engine tries to describe its visual components or splits it into separate characters.

Original: 龘 (A rare character composed of three dragons).
OCR Output: 龍龍之龍 (Literal translation: "Dragon Dragon of Dragon").
Error: The engine failed to map the glyph to its unicode point and instead outputted a description of its components.
Original: 爨 (30 strokes).
OCR Output: 翼 (Wing).
Error: Visual hallucination. The characters are visually distinct; the model guessed based on density.

Korean (Hangul) - Cross-Lingual Contamination
The engine fails to recognize archaic Korean letters (Jamo) and substitutes them with characters from completely different languages (Japanese/Chinese).

Original: ㅿ (Bansiot - Archaic Korean), ㆆ (Yeorinhieut).
OCR Output: 人 (Chinese 'Person'), レ (Japanese Katakana 'Re').
Error: The model lacks training data for archaic Hangul and defaults to visually similar shapes from other CJK languages, rendering the text meaningless.

Conclusion:
The OCR engine is currently unreliable for non-standard or high-complexity texts. It prioritizes producing any output over accuracy, leading to hallucinations that are harder to detect than simple "no text found" errors.

The1Just

Feb 26

Detailed Error Analysis:

ENGLISH (Section A) — ✅ Excellent
The English block was recognized perfectly. Even linguistic monstrosities like floccinaucinihilipilification and antidisestablishmentarianism were read without a single error.
RUSSIAN (Section B) — ⚠️ Average (Critical Errors Present)
The OCR "stumbled" on the longest agglutinative words:

Original: субстанциализирующийся
OCR: субстанциал**ь**изирующийся (Inserted an extra soft sign).
Original: человеконенавистничества
OCR: человека конечнастичства (Gross error: the word was torn into two tokens, and the ending was turned into gibberish).
Original: психоневрологическим
OCR: психоневро**по**логическим (Added the syllable "po," turning it into a non-existent word).

CHINESE (Section C) — ❌ Poor (Hallucinations and Simplifications)
The system failed the character complexity stress test:

Original: 魍魎 (Monsters) -> OCR: 魍魍 (Repetition of the first character; the second was not recognized).
Original: 糾纏 (Entanglement) -> OCR: 糾緟 (The second character was replaced by a visually similar but incorrect one).
Original: 龘 (Triple Dragon character) -> OCR: 龍龍之龍 (Interesting glitch: The OCR saw three dragons and, instead of the single symbol, wrote the phrase "Dragon Dragon of Dragon." It decomposed the character into its components).
Original: 「爨」 (Stove, 30 strokes) -> OCR:「翼」 (Wing). Complete hallucination; the symbols are not visually similar.
Original: 灩與齉 -> OCR: 豐與鮋. Absolutely different characters. Complex symbols were simply replaced by random simpler ones.

JAPANESE (Section D) — ⚠️ Acceptable (With Nuances)

Original: 薔薇 (Rose) -> OCR: **蔷**薇 (The first Kanji was replaced by its Chinese Simplified variant).
Original: 的 (Suffix/Particle) -> OCR: ” (Closing quote). This is a frequent error, as 的 in small font looks like a closing quote mark.
Original: 鬱 (Depression) -> OCR: 鬰 (A very rare variation of this character was used).

KOREAN (Section E) — ☠️ Total Failure
The Korean section suffered the most. The OCR failed to understand archaic letters and complex syllables, resorting to inventing text or inserting Japanese characters:

Original: 고독할한 -> OCR: 교동할한 (Meaning changed).
Original: 눍어적 (Complex syllable) -> OCR: 법어데서 (Complete hallucination; text was invented).
Original: 닭볶음탕 (Spicy Chicken Stew) -> OCR: 당류음탕 (Sugar soup? Nonsense).
Original: 맛을 논하다 (Discuss taste) -> OCR: 막을 느끗하다 (Nonsense).
Archaic letters: ㅿ, ㆁ, ㆆ -> OCR: 人, レ, ド. The funniest moment: The system could not find these symbols in its Korean database and inserted the Chinese character for "Person" and the Japanese Katakana for "Re" and "Do."
Original: 뻅뗏하기 -> OCR: 빛냉하기. Complex syllables were completely replaced.

iyuge2

Z.ai org Feb 27

@The1Just Hi, thank you for your careful feedbacks. The current version still has limitations in multilingual scenarios, the recognition quality for non-Chinese/English text is not yet fully optimized. Enhancing multilingual text performance is one of our key priorities in the next release.

aoiandroid

Mar 7

一つの文書に一つの言語の場合の言語カバレッジはどこで確認できますか？

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment