Critical Hallucinations and Cross-Script Confusion in Multilingual Stress Test (CJK + Cyrillic)

#42
by The1Just - opened

We conducted a stress test using a custom "Multilingual Complexity Document" designed to evaluate the OCR engine's robustness against edge cases. The document contains archaic characters, high-stroke-count CJK ideograms, and long agglutinative Cyrillic compound words.
The current OCR model demonstrates significant failure modes, including semantic hallucinations, radical decomposition errors, and cross-lingual script contamination.
image

Observed Failures & Analysis:

  1. Russian (Cyrillic) - Segmentation Failure & Hallucination
    The engine struggles with long compound words, breaking them into nonsensical tokens.
  • Original: человеконенавистничества (misanthropy - single word).
  • OCR Output: человека конечнастичства
  • Error: The model hallucinated a space and invented a non-existent suffix, changing the meaning entirely.
  1. Chinese (Traditional/Complex) - Radical Decomposition Error
    Instead of recognizing a complex character, the engine tries to describe its visual components or splits it into separate characters.
  • Original: (A rare character composed of three dragons).
  • OCR Output: 龍龍之龍 (Literal translation: "Dragon Dragon of Dragon").
  • Error: The engine failed to map the glyph to its unicode point and instead outputted a description of its components.
  • Original: (30 strokes).
  • OCR Output: (Wing).
  • Error: Visual hallucination. The characters are visually distinct; the model guessed based on density.
  1. Korean (Hangul) - Cross-Lingual Contamination
    The engine fails to recognize archaic Korean letters (Jamo) and substitutes them with characters from completely different languages (Japanese/Chinese).
  • Original: (Bansiot - Archaic Korean), (Yeorinhieut).
  • OCR Output: (Chinese 'Person'), (Japanese Katakana 'Re').
  • Error: The model lacks training data for archaic Hangul and defaults to visually similar shapes from other CJK languages, rendering the text meaningless.

Conclusion:
The OCR engine is currently unreliable for non-standard or high-complexity texts. It prioritizes producing any output over accuracy, leading to hallucinations that are harder to detect than simple "no text found" errors.

Detailed Error Analysis:

  1. ENGLISH (Section A) — ✅ Excellent
    The English block was recognized perfectly. Even linguistic monstrosities like floccinaucinihilipilification and antidisestablishmentarianism were read without a single error.
  2. RUSSIAN (Section B) — ⚠️ Average (Critical Errors Present)
    The OCR "stumbled" on the longest agglutinative words:
  • Original: субстанциализирующийся
  • OCR: субстанциал**ь**изирующийся (Inserted an extra soft sign).
  • Original: человеконенавистничества
  • OCR: человека конечнастичства (Gross error: the word was torn into two tokens, and the ending was turned into gibberish).
  • Original: психоневрологическим
  • OCR: психоневро**по**логическим (Added the syllable "po," turning it into a non-existent word).
  1. CHINESE (Section C) — ❌ Poor (Hallucinations and Simplifications)
    The system failed the character complexity stress test:
  • Original: 魍魎 (Monsters) -> OCR: 魍魍 (Repetition of the first character; the second was not recognized).
  • Original: 糾纏 (Entanglement) -> OCR: 糾緟 (The second character was replaced by a visually similar but incorrect one).
  • Original: (Triple Dragon character) -> OCR: 龍龍之龍 (Interesting glitch: The OCR saw three dragons and, instead of the single symbol, wrote the phrase "Dragon Dragon of Dragon." It decomposed the character into its components).
  • Original: 「爨」 (Stove, 30 strokes) -> OCR:「翼」 (Wing). Complete hallucination; the symbols are not visually similar.
  • Original: 灩與齉 -> OCR: 豐與鮋. Absolutely different characters. Complex symbols were simply replaced by random simpler ones.
  1. JAPANESE (Section D) — ⚠️ Acceptable (With Nuances)
  • Original: 薔薇 (Rose) -> OCR: **蔷**薇 (The first Kanji was replaced by its Chinese Simplified variant).
  • Original: (Suffix/Particle) -> OCR: (Closing quote). This is a frequent error, as in small font looks like a closing quote mark.
  • Original: (Depression) -> OCR: (A very rare variation of this character was used).
  1. KOREAN (Section E) — ☠️ Total Failure
    The Korean section suffered the most. The OCR failed to understand archaic letters and complex syllables, resorting to inventing text or inserting Japanese characters:
  • Original: 고독할한 -> OCR: 교동할한 (Meaning changed).
  • Original: 눍어적 (Complex syllable) -> OCR: 법어데서 (Complete hallucination; text was invented).
  • Original: 닭볶음탕 (Spicy Chicken Stew) -> OCR: 당류음탕 (Sugar soup? Nonsense).
  • Original: 맛을 논하다 (Discuss taste) -> OCR: 막을 느끗하다 (Nonsense).
  • Archaic letters: ㅿ, ㆁ, ㆆ -> OCR: 人, レ, ド. The funniest moment: The system could not find these symbols in its Korean database and inserted the Chinese character for "Person" and the Japanese Katakana for "Re" and "Do."
  • Original: 뻅뗏하기 -> OCR: 빛냉하기. Complex syllables were completely replaced.
Z.ai org

@The1Just Hi, thank you for your careful feedbacks. The current version still has limitations in multilingual scenarios, the recognition quality for non-Chinese/English text is not yet fully optimized. Enhancing multilingual text performance is one of our key priorities in the next release.

Sign up or log in to comment