77.9 kB

Title: TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

URL Source: https://arxiv.org/html/2412.01137

Markdown Content: Xingsong Ye 1, Yongkun Du 1, Yunbo Tao 1, Zhineng Chen 2

1 College of Computer Science and Artificial Intelligence, Fudan University, China

2 Institute of Trustworthy Embodied AI, Fudan University, China

{xsye20, zhinchen}@fudan.edu.cn, {ykdu23, ybtao24}@m.fudan.edu.cn

Abstract

Scene text recognition (STR) suffers from challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained models. Meanwhile, despite producing holistically appealing text images, diffusion-based visual text generation methods struggle to synthesize accurate and realistic instance-level text at scale. To tackle this, we introduce TextSSR: a novel pipeline for S ynthesizing S cene Text R ecognition training data. TextSSR targets three key synthesizing characteristics: accuracy, realism, and scalability. It achieves accuracy through a proposed region-centric text generation with position-glyph enhancement, ensuring proper character placement. It maintains realism by guiding style and appearance generation using contextual hints from surrounding text or background. This character-aware diffusion architecture enjoys precise character-level control and semantic coherence preservation, without relying on natural language prompts. Therefore, TextSSR supports large-scale generation through combinatorial text permutations. Based on these, we present TextSSR-F, a dataset of 3.55 million quality-screened text instances. Extensive experiments show that STR models trained on TextSSR-F outperform those trained on existing synthetic datasets by clear margins on common benchmarks, and further improvements are observed when mixed with real-world training data. Code is available at https://github.com/YesianRohn/TextSSR.

1 Introduction

Figure 1: Top: Text instances detected by an end-to-end STR model[23], and synthesized by our TextSSR based on the detected examples. Bottom: Accuracy of MAERec[20] across common benchmarks. Peach indicates training on ST[15] and TextOCR[39], while blue denotes that TextSSR-F is additionally included as training data.

Scene text images are a unique type of image captured in the wild, they differ from general images in the presence of text, which appears with diverse sizes, fonts, and often with varying degrees of distortion and occlusion. However, these texts usually contain high-level semantic information closely related to the captured scene, making STR a crucial task for complex scene understanding, image search, etc.

Recent STR practices[20, 9, 10] have shown that besides advanced modeling techniques, high-quality training data also plays a critical role. Training STR models also follows the scaling law from the data perspective [31] and current data size is far from saturated even for English. Enlarging the size of real-world datasets is a feasible way to further enhance recognition performance. However, it is challenging to collect real-world scene text images on an even larger scale. On one hand, diversified data collection is costly and time-consuming. On the other hand, real-world scenes predominantly contain high-frequency, semantically meaningful words, while low-frequency words are difficult to collect. Synthesizing high-quality text images seems to be an effective alternative. However, research on this direction is somewhat laggard. The two most popular synthetic datasets, MJ[18] and ST[15], are constructed based on traditional rendering techniques nearly a decade ago. Although they can be easily scaled up in quantity, recent studies[9, 11] indicate that there are significant accuracy gaps between models trained on MJ&SJ and Union14M-L[20], mainly because MJ&SJ fail to comprehensively describing the challenge of real-world text instances.

Recently, diffusion models have shown impressive performance in generating text-rich images[50, 41, 4, 5, 57]. However, most of them focus on generating aesthetic text images holistically, e.g., posters. Using data generated by these methods for STR model training encounters several issues. First, these methods are designed for the full image level generation, when switching to synthesizing text in instance-level regions, it would be less effective and ensuring the generation accuracy becomes challenging. Second, current generation methods mainly rely on well-refined natural language as prompts. They produce overly homogeneous and unrealistic text instances with some probability, which is not suitable as training data. Moreover, constructing a diverse set of prompts is inherently difficult, which hinders the ability to generate data at scale.

Nevertheless, given the exceptional performance of diffusion models across various tasks, it is anticipated that there are diffusion-based techniques for producing satisfactory training data for STR. Before proceeding, we first outline three key characteristics that an improved data synthesis method should exhibit:

•Accurate: The synthesized text content should match the given text, ensuring that the synthesized words are readable and free from character misplacement or duplication errors. This is fundamental for maintaining the correctness of the trained STR models.
•Realistic: The generated text instances should visually resemble real-world text, simulating scene conditions in the wild. This includes being harmoniously blended into the background, ensuring consistent generation at different sizes and capturing the diversity of text presentation, especially in complex scenes. Achieving realism is essential for enhancing the performance of text recognizers.
•Scalable: The method should support extensive training data production. In other words, abandon complex control inputs and design processes, and enable large-scale data generation by leveraging easily accessible and processable resources only.

To satisfy these characteristics, we propose TextSSR, a novel diffusion-driven pipeline focusing on S ynthesizing training data for S cene Text R ecognition. Our approach introduces the following innovations to overcome persistent limitations in text synthesis. First, we implement a region-centric text generation. It employs end-to-end OCR model to acquire text location and size. Moreover, we devise a position-glyph enhancement that further resolves character duplication and positional disorder, ensuring precise text placement and arrangement. Second, we leverage the detected text or only background near the designated region as contextual hints for font style and appearance generation, guiding the generated text to match real-world scenarios better. Third, we design a character-aware diffusion architecture that employs glyph priors rather than natural language prompts as injected conditions. It enables precise character-level control while maintaining full-word semantic coherence. To expand the generation quantity, we use detected text to generate a series of anagrams by rearranging characters within the original bounding boxes, thus supporting large-scale generation. As a result, TextSSR achieves accurate, realistic, and scalable text instance synthesize, as illustrated in Fig.2.

Through these attempts, we construct the TextSSR-F dataset with 3.55 million quality-screened samples, which is automatically validated by checking their recognitions and pseudo labels. Extensive experiments demonstrate the superiority of TextSSR-F over existing synthesis data, as well as its complementarity with real-world training data.

Our contributions are summarized as follows:

•We propose TextSSR, a diffusion-based text synthesis method centered on seamlessly blending text instances into real scenes. It develops a region-centric processing that enables accurate instance-level generation.
•TextSSR fully exploits positional and glyph priors of text, achieving accurate, diverse and realistic text synthesis by using only the detected text as input. Moreover, it can be easily scaled up and we construct the TextSSR-F dataset with 3.55 million text instances.
•Extensive evaluations validate that STR models trained on TextSSR-F surpass their counterparts trained on existing synthetic data by 3.1% on common benchmarks, and even better STR models can be obtained when further combined with real-world training data, highlighting the utility of the proposed TextSSR.

Figure 2: Showcases of TextSSR’s capacity to synthesize accurate, realistic and scalable text instances.

2 Related Work

2.1 Scene Text Recognition

STR researches can be categorized from both the model and the data perspectives. From a model point of view, early approaches employ the CTC-based pipeline[36, 52]. They used CNN for visual feature extraction, followed by RNN for sequence modeling, then with CTC loss applied as a constraint. Subsequently, attention mechanisms gradually gained attention in the STR field[35, 26, 38, 24]. Inspired by developments in the NLP field, many methods[12, 2, 13, 44, 46, 58, 48] further refined recognizers from a language modeling perspective. While some approaches[8, 56, 11] continued to explore solutions based on a single vision model and CTC decoding, due to their fast inference nature. Accuracy was steadily improved following these efforts.

From a data perspective, the limited availability of well-annotated real-world training data led to a practical STR training and evaluation protocol: training solely on synthetic data, or a combination of synthetic and real data, followed by evaluation on real-world benchmarks. Besides the six common benchmarks[27, 43, 21, 22, 29, 32]. Union14M-L[20] aggregated publicly available real-world datasets (including TextOCR[39]) and created a real-world training dataset with nearly 4 million instances. However, recent studies on the OCR scaling law[31] suggested that even this substantial volume had not reached the saturation point. This observation implies that STR models can still benefit from additional training data, and high-quality synthetic data could be a feasible supplement, which motivates our study.

2.2 Scene Text Synthesis

Early synthesis methods[18, 15, 25] employed rule-based methods to overlay text onto background with transformations simulating real-world appearance. To be more realistic, later research shifted toward deep learning-based generation methods. Among these, GAN-based scene text editing[47, 49, 30] methods aimed to expand data by editing text directly within specified regions, but they were limited by the lack of real paired training data. Recent advances had leveraged diffusion models for both editing[19, 17, 51] and synthesis, with fine-grained control over glyphs[3, 50], as well as positioning at word[41] and character[4, 57, 45] levels. The latest methods[5, 60] incorporate large language or multi-modal models to enhance performance. However, these approaches primarily targeted visually appealing generation for aesthetic purposes, rather than generating data for better STR models. While some work has explored diffusion-based text image synthesis. For example, CTIG-DM[59] employed character sequence as the condition for generation. However, it is less controllable and faced with the dilemma of diversity-realism tradeoff, limiting quality of the generated text. SceneVTG[60] developed a first-erasing-then-synthesizing pipeline. It requires a pre-erased model to remove existing text completely, which is difficult to guarantee in large-scale synthesizing scenarios.

3 Method

To achieve accurate, realistic, and scalable text instance generation, TextSSR proposes a four-step pipeline (detailed in Fig.3(b)). To handle text instances with varying sizes and locations, we first focus on text region processing, which will be discussed in Sec.3.1. To enhance accuracy through fine-grained extraction and decoupling of text glyphs and positions, Sec.3.2 and Sec.3.3 will detail the specific module design and training strategy, respectively. Sec.3.4 introduces our data extension method, achieving scalability in data synthesis at scale.

Figure 3: Overall architecture of TextSSR’s (a) generative model and (b) synthesis pipeline. (a) Given a scene text instance (marked in red) and its content (e.g., “kills”), it is preprocessed into the smallest local image containing the text. A text-size mask is then applied to identify non-text regions, while the text content is rendered as priors at both word and character levels. These different conditions are subsequently injected into a diffusion model as training guidance. (b) An end-to-end OCR model extracts the text location and content within the image (Step 1). All possible text words are generated through anagram (Step 2). The generative model then synthesizes text instances in batches (Step 3), and finally, a SOTA STR model filters the samples, retaining only the correctly recognized instances for use (Step 4).

3.1 Text Region-centric Processing

Instance-level text generation presents significant challenges, as we must handle varying sizes, diverse potential positions, and determine how to select appropriate surrounding information for effective prompting. Synthesis methods that edit the entire image are not suitable for this task. Such approaches create an imbalance, and small text regions lack sufficient pixels for accurate reconstruction. Furthermore, information from the entire scene is often redundant for text region synthesis. For example, in Fig.3(a), the text “kills” to be synthesized shares a similar style and context with the nearby upper text and background, while the region farther away is inconsistent in appearance.

Therefore, we propose a generative scheme based on a locality assumption—text styles are more likely to correlate with adjacent text and background information. This method focuses generation at the region level, synthesizing the text using only surrounding local context rather than the entire scene. Specifically, we use the annotated text box to extract the smallest enclosing square around the text region and resize it to a fixed size (e.g., 256). For practical considerations, we uniformly employ an end-to-end OCR tool[23] to obtain text instance size, position, and textual labels. It well reflects real-world scenarios where annotations are often unavailable.

3.2 Fine-grained Condition Mechanisms

Visual text rendering needs to consider two key fine-grained features: glyph structure and positional arrangement, both at word (holistic) and character (individual) levels. Word-level features concern entire word appearance and region placement, while the character-level address individual characters and their relative ordering. To guide the generative process, we implement the following approach:

Word-level Position and Glyph Guidance. For a text t t with position P t P_{t} in the original image, we extract a square region I s I_{s} where P t P_{t} becomes P t′P_{t}^{\prime} (Fig.3(a), red dashed rectangle). This allows us to establish our desired word-level position guidance. Then we create a binary mask I M I_{M} that separates text (0) from non-text (1) regions. The size of I M I_{M} is identical to I s I_{s}, indicating where the text is present. For glyph guidance, we render the text in a standard font, applying an affine transformation to align it with the position and orientation of P t′P_{t}^{\prime} (resulting in I G I_{G}). As noted in previous work[3, 41, 50], when we close the door on the original styled text, we must at least provide a window with a basic visual style of the text at that location.

Character-level Position and Glyph Guidance. Only word-level information limits the model’s understanding to the words, as we observe issues such as repeated or incorrect characters when using only word-level features. We address this by encoding character-level ordering through pixel values, enabling the model to learn spatial relationships between characters. Specifically, we define a maximum character length L L. For characters from position 1 to l l (l≤L l\leq L, with truncation if l>L l>L and zero-padding in channels l+1 l+1 to L L if l<L l<L), the i i-th character is rendered into the i i-th channel of an L L-channel image I g I_{g}. Each character glyph is rendered with a pixel intensity of i×⌊P M L⌋i\times\left\lfloor\frac{P_{M}}{L}\right\rfloor, where P M P_{M} denotes the maximum number of pixels allocated to distinguish a character from the background. We represent the glyph image tensor as I g∈ℝ L×S×S I_{g}\in\mathbb{R}^{L\times S\times S}, where L L is the number of characters and S S is the image size for each character. This creates position-prior pixel values that serve as positional encodings to help the model distinguish character information at each specific position.

Vision Transformer (ViT)[7] is employed as the Glyph Encoder G ϕ G_{\phi} to accommodate input I g I_{g} and output features compatible with diffusion model’s condition inputs:

F g=G ϕ(I g)∈ℝ(N+1)×D F_{g}=G_{\phi}(I_{g})\in\mathbb{R}^{(N+1)\times D}(1)

Here, N N represents the number of patches, the additional one corresponds to a class token, and dimension D D is set to match the input requirements of the diffusion process.

3.3 Adapting to Stable Diffusion

We leverage a pretrained model and fine-tune its Variational Auto Encoder (VAE) and retrain the Conditional Diffusion Model (CDM) components for our specific task.

VAE Fine-tuning. The standard VAE, when pretrained on general datasets, may not be sensitive enough to capture the fine-grained spatial and glyph features of scaled text instances. As highlighted by DiffUTE[3] and also confirmed by our ablation experiments, fine-tuning the VAE is essential when working with text images. Therefore, we fine-tune it to enhance the capacity for accurate representation of text regions in a fair, local manner. The VAE consists of an encoder E θ E_{\theta} and a decoder D θ D_{\theta}, and we let V θ V_{\theta} represent the reconstruction after the encoding and decoding process. Following the pre-defined I s I_{s}, we minimize the following loss function:

ℒ TextSSR-VAE=‖V θ(I s)−I s‖2 2\mathcal{L}{\text{TextSSR-VAE}}=|V{\theta}(I_{s})-I_{s}|_{2}^{2}(2)

Table 1: Comparing TextSSR (differences between the pre-training and fine-tuning versions are detailed in Appendix6.3.1. We use the fine-tuned version, abbreviated as TextSSR, in the rest part of the paper.) and existing synthesis methods on regular text (IC13), irregular text (IC15), and multi-language text (ShopSign) datasets, alongside results from real datasets (Real). SeqAcc and NED denote the recognition accuracy, editing distance, respectively. FID-R score means FID on the text region.

CDM Retraining. To mask the actual text instance region while obtaining surrounding contextual information, we apply I M I_{M} to I s I_{s}, resulting in I m I_{m}, through the operation I m=I M⋅I s I_{m}=I_{M}\cdot I_{s}. I m I_{m}, I G I_{G}, and I s I_{s} are fed into the frozen VAE, generating latent representations Z m Z_{m}, Z G Z_{G}, and Z s Z_{s}. Additionally, Z s Z_{s} is further processed with added noise to produce Z T Z_{T}. Moreover, we downsample I M I_{M} to match their dimensions (e.g., in SD 2.1[33], height = width = 16), transforming it into Z M Z_{M}.

These latent features, Z M Z_{M}, Z T Z_{T}, Z m Z_{m}, and Z G Z_{G}, are concatenated and then passed through a Conv2D layer to match the channel dimension required by the U-Net (e.g., in SD 2.1, channel = 4). Meanwhile, the Glyph Encoder (see Fig.3(a)) that produces F g F_{g} is also integrated into the CDM to extract features from I g I_{g}. Finally, the CDM processes these inputs and produces the denoised latent representation Z T−1 Z_{T-1}, progressively refining the noisy input to recover the clean latent space. Assuming ϵ\epsilon represents the original added noise and ϵ θ\epsilon_{\theta} denotes the entire denoising process that outputs the predicted noise, the modified CDM is trained by minimizing the following objective:

ℒ TextSSR-CDM=‖ϵ−ϵ θ(z t,t,Z M,Z m,Z G,I g)‖2 2\mathcal{L}{\text{TextSSR-CDM}}=|\epsilon-\epsilon{\theta}(z_{t},t,Z_{M},Z_{m},Z_{G},I_{g})|_{2}^{2}(3)

3.4 Anagram-based Data Synthesis

As mentioned above, our pipeline uses text region and content detected by an OCR engine as inputs. The recognized content is regarded as pseudo-label of the text. Our generation produces a synthesized text instance on the detected region for each pseudo-label rather than the groundtruth, which is typically unknown. Therefore, errors in OCR would not affect the generation. To expand the generation in quantity, we propose an anagram-based data expansion strategy. Unlike previous approaches that arbitrarily edit text[4, 41, 5] or substitute equal-length strings[57, 45], we manipulate internal character order to maintain better region-content coherence. Meanwhile, the diffusion process adaptively handles the varying occupation requirement of different characters, e.g., ‘o’ typically is wider than ‘l’ in width. Therefore, our TextSSR leads to a more flexible generation. By changing the internal character order only, the region remain unchanged while still generating many samples with different character arrangements.

Our model’s character-level rendering capability enables correct generation of text from this permutation-based operation. We define the total number of possible permutations for a word of length l l as:

P(l)=l!P(l)=l!(4)

Note that some words may repeat due to character duplication; they still represent different samples, as the diffusion model is inherently random and there are also variations in cropped regions.

4 Experiments

4.1 TextSSR Data Evaluation

We first evaluate both accuracy and realism of the generation text, as well as the scaling capabilities of TextSSR data in small-scale. Then, we construct large-scale training data and assess its effectiveness at larger scales. Fig.4 visualizes the examples of TextSSR compared to other methods.

Figure 4: Examples synthesized by TextSSR and other existing open-source methods (UDiffText only supports English text for its limited text encoder). TextSSR consistently produces accurate and realistic text instances. Please refer to Appendix7.1 for more analysis.

Accuracy Evaluation. Evaluating the accuracy of generated scene text is challenging, as manually verifying alignment with target text is costly. Previous benchmarks typically employ an STR model to automatically assess text consistency. We also follow this scheme. To mitigate potential OCR recognition errors and more effectively reflect the accuracy and difficulty level of our synthetic data, we choose SVTRv2[11], a SOTA STR model upgraded from SVTR[8], as the recognition model.

Tab.1 gives the accuracy assessment results across different datasets, where IC13 denotes all the methods using text instances from the IC13 dataset[21] to initialize their generation, while others are defined similarly. As can be seen, a large portion of text instances generated by TextSSR are correctly recognized. For more irregular and difficult IC15[22], and less linguistically ShopSign[54] datasets, despite accuracy declines observed, TextSSR still outperforms the others by significant margins. The results provide strong evidence of TextSSR’s correctness.

Comparing with AnyText[41], the current SOTA, TextSSR outperforms it by 30.1% and 56.7% in accuracy on IC13 and IC15, respectively. This is mainly because TextSSR operates generation at the instance level, while AnyText performs on the full image level. Note that methods based on full-image synthesis struggle significantly in complex, challenging scenarios, e.g., IC15. This dataset includes many text instances affected by blur, low-resolution, perspective distortion, and curvature—challenges that previous methods do not address effectively. In contrast, TextSSR benefits from a region-centered generation pipeline. It excels at handling challenging text instances and maintains reasonable synthesis accuracy. Moreover, the condition for TextSSR is glyph shapes, which can be extended to other languages like Chinese. The results on ShopSign prove TextSSR’s multilingual capability. It outperforms the purportedly multilingual-focused AnyText by 18.7%. More details on accuracy evaluation are provided in Appendix6.3.2.

Type Method SeqAcc(%)↑\uparrow Regular Irregular Avg Rendering SynthText[15]48.55 26.08 39.74 VISD[53]38.02 26.88 33.65 UnrealText[25]32.06 19.85 27.28 Diffusion TextDiffuser[4]30.08 8.79 21.74 GlyphControl[50]40.80 13.21 29.99 AnyText[41]49.80 19.62 37.97 TextDiffuser-2[5]40.26 10.48 28.59 SceneVTG[60]54.97 35.50 47.34 TextSSR (Ours)59.76 35.71 50.33 Real Real-L[1]57.97 37.66 50.01 TextOCR[39]57.83 41.56 54.05

Table 2: Comparison of different data synthetic methods, where CRNN[36] trained on different synthetic data and real data of 30k is employed to obtain the results. “Regular” denotes the average on IIIT[27], SVT[43], and IC13[21] datasets, “Irregular” represents the average on IC15[22], SVTP[29], and CUTE[32] datasets, while “Avg” is the average of the six datasets.

Model Dataset Volume SeqAcc(%)↑\uparrow IIIT5k[27]SVT[43]IC13[21]IC15[22]SVTP[29]CUTE80[32]Avg Contextless[20] CRNN[36]ST 4m 90.13 82.84 90.90 72.45 72.09 82.99 81.90 33.50 ST 6.98m 90.80 83.00 91.83 72.34 73.80 81.60 82.06 32.61 ST+SynthAdd 4m+1.05m 90.43 83.93 91.13 73.55 72.71 80.21 81.99 34.53 ST+TextSSR 4m+1.05m 92.13 87.02 93.35 77.42 77.67 84.37 85.33 55.97 ST+TextSSR 4m+3.55m 93.57 89.18 94.28 78.58 78.45 82.99 86.17 57.89 ST+TextOCR 4m+1.05m 93.13 88.10 92.53 79.68 79.22 83.68 86.06 44.93 ST+TextOCR 5.05m+3.55m 94.57 89.34 94.40 80.23 79.22 84.72 87.08 58.15 +TextSSR+1.44+1.24+1.87+0.55+0.00+1.04+1.02+13.22 MAERec[20]ST 4m 96.00 91.96 95.80 83.88 85.74 89.58 90.49 59.43 ST 6.98m 96.90 92.89 95.80 84.70 86.82 92.36 91.58 61.62 ST+SynthAdd 4m+1.05m 95.97 93.20 95.80 85.04 86.98 90.28 91.21 60.46 ST+TextSSR 4m+1.05m 97.53 93.82 96.85 86.75 89.46 94.10 93.08 76.77 ST+TextSSR 4m+3.55m 98.13 94.74 97.32 87.85 89.61 94.10 93.63 78.17 ST+TextOCR 4m+1.05m 98.10 95.52 97.20 89.34 90.39 96.87 94.57 74.33 ST+TextOCR 5.05m+3.55m 98.20 96.75 97.20 89.95 92.71 97.92 95.46 82.03 +TextSSR+0.10+1.23+0.00+0.61+2.32+1.05+0.89+7.70

Table 3: Performance of models trained on different data combinations, where CRNN and MAERec are selected as model representatives. Bold and underline indicate the highest and second highest values, respectively.

Realism Evaluation. Following the validation presented in SceneVTG[60], we generate a fixed amount (30k) of data to train an identical model under fair settings and then test it on common STR benchmarks[27, 43, 21, 22, 29, 32]. This approach avoids the need for separate verification of text correctness or synthesis quality. If the generated text is inaccurate or unrealistic, this will be reflected in poor recognition performance during testing.

As shown in Tab.2, for the compared methods, their average SeqAcc is below 40% except for SceneVTG[60]. The result is closely linked to their low generation accuracy. Note that rendering-based methods generally perform better than traditional diffusion-based methods in irregular benchmarks, highlighting the challenge of generating highly controllable text images. SceneVTG benefits from its two-stage pipeline of text erasure and then rendering, it reaches a SeqAcc of 47.3%. Nevertheless, TextSSR, featured by its glyph-based condition injection and anagram-based generation, surpasses SceneVTG by nearly 3%. Moreover, the gap between models trained on 30k TextSSR and real-world TextOCR is only 3.7%, which comes from irregular text. These results convincingly demonstrate the realism of TextSSR data, especially in regular text synthesizing. However, further efforts are still needed to improve the synthesis of irregular text.

Scalability Evaluation. TextSSR data can be easily scaled up to a large quantity. It is necessary to assess whether the increase in data volume can translate into performance improvements. Therefore, we use the model generated by the 30k data in the realism evaluation as baseline (TextSSR ×\times 1), and re-train CRNN under the same experimental setup but with data doubled every time. Results of this scalability evaluation are presented in Tab.4. When scaled up to ×\times 4 and ×\times 8, accuracy improvements of 4.7% and 14.1% are observed, respectively. The result indicates that increasing the data volume indeed contributes to better STR models. Note that models in the two scenarios also surpass the model trained on 30k real-world data by 1.0% and 10.4%, respectively. It is observed that TextSSR ×\times 8 also presents a much better result than real data model in irregular text, highlighting that the challenge of irregularity can be overcome to some extent by increasing the data volume. Meanwhile, we also conducted a comparison experiment similar to [57], i.e., substituting text instances with equal-length strings (TextSSR EL), where the trained model reported quite worse results. This again demonstrates the effectiveness of our anagram-based generation pipeline.

Table 4: TextSSR scalability evaluation. ×n\times n refers to perform the anagram-based rendering n n times. EL denotes equal length edit.

Usability Assessment. We further assess the effectiveness of our synthesis method in supporting the large-scale training of STR models. To ensure the quality of data, we apply a quality-screened step to every sample generated by TextSSR. We still use SVTRv2[11] to recognize the sample and compare the recognized content with its generation label. The sample is reserved if they are consistent, otherwise we discard it. The results on different training data combinations are shown in Tab.3. The observations can be summarized as follows: First, the rendering-based ST dataset reaches a saturation point in terms of accuracy, where 6.98m samples only show marginal improvement (0.16% on CRNN) over the 4m dataset on the six common benchmarks. This suggests that we have to move towards more realistic data synthesis. Second, when incorporating 1.05m our synthetic dataset, the two STR models (CRNN and MAERec) gain improvements of 3.4% and 2.6% over the 4m ST dataset on the six common benchmarks, respectively. They are only 0.7% and 1.5% lower than ST combined with equivalent size real-world TextOCR. This suggests that TextSSR data is approaching real-world data in quality. Third, when further increasing TextSSR data to 3.55m, it is seen that the purely synthetic training data (ST+3.55m TextSSR) gain performance more approaching the synthetic-real mixed training data (ST+1.05m TextOCR). Moreover, when combining the 3.55m synthetic data with this mixed data, the trained model can report further performance improvements, obtaining 1.02% on CRNN and 0.89% on MAERec. This implies that TextSSR data is a valuable supplement to existing training data. Using it individually or collectively both can benefit STR model training. As a result, we mark this 3.55m dataset as TextSSR-F, an accurate and realistic large-scale synthetic dataset constructed by our TextSSR, and make it publicly available.

Since our anagram-based synthesis generates a large number of contextless words, and contextless word generation is also discussed in SynthAdd[24], we conduct a comparison between SynthAdd and our TextSSR. As shown in Tab.3, there are clear performance margins (3.3% on CRNN and 1.9% on MAERec) between the model trained on traditional SynthAdd and TextSSR. This also indicates the superiority of TextSSR as a STR training data synthetic method.

Table 5: Ablation study of TextSSR components using the TextSSR pre-training model in Tab.1 on the IC13 dataset.

4.2 Ablation Study

To assess the proposed components, we start with the TextSSR pre-training model in Tab.1 (to simplify, trained for only 5k steps), where all components are considered. Then, we stepwisely remove one or some of the components to assess their respective necessity and effectiveness. From the results on Tab.5, we can see that the removal of each one brings a clear decline in performance, while the absence of “region-centric” processing can lead to a catastrophic drop of 35.4% in accuracy. Additionally, fine-tuning the VAE offers a trade-off, slightly sacrificing visual quality for an improvement in accuracy. This is because real-world scene text also confronts poor visual quality challenge, simulating this to some extent can provide valuable samples for model training.

Figure 5: Performance of TextSSR fine-tuning model with and without region-centric processing across text regions of different sizes (in pixels).

We perform another experiment to verify the effectiveness of the region-centric processing. The results in Fig.5 show that without this processing, TextSSR has to synthesize text instance on the whole image. It exhibits a performance trend that first rises and then falls as the size increases. When the region is small, the entire image is scaled severely, resulting in more blurred text regions and insufficient pixels for text rendering. As the region size increases, larger regions and more pixels are allowed, therefore performance increase. When a too large region given, it causes the side effect that constraints the shape of the generated text, and performance decline. In contrast, incorporating region-centric processing allows TextSSR to achieve superior and consistent performance across all sizes.

In Fig.6, we also illustratively ablate the character-level position and glyph components. The results indicate that omitting either component leads to issues such as character deformation, incorrect characters, and duplication errors.

Figure 6: Visualizations of TextSSR with and without character-level prior. More examples are provided in Appendix7.3

5 Conclusion

We have proposed TextSSR to provide STR model training with high-quality synthetic data. It employs the glyph of existing text and surrounding context as the prompt of the diffusion-based synthesizing model, and combines character-level constraints and permutations. TextSSR successfully synthesizes accurate and realistic text instances at a large scale. The TextSSR-F dataset with 3.55 million diverse and realistic instances are constructed. We have conducted extensive experiments to assess TextSSR-F. STR models trained solely on TextSSR-F, and on the combination of TextSSR-F and real-world data both show performance improvements. While the effectiveness of TextSSR has been basically confirmed, for irregular text, there is still an accuracy gap between models trained on TextSSR-F and real-world data. Thus, future work includes the exploration of more controllable ways to further improve the synthesizing quality, especially for challenging instances such as curved and multi-oriented text. We are also interested in extending TextSSR to generate large-scale data in different languages like Chinese.

Acknowledgement This work was supported by the National Natural Science Foundation of China (Nos. 32341012, 62172103).

References

Baek et al. [2021] J. Baek, Y. Matsui, and K. Aizawa. What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels. In CVPR, pages 3113–3122, 2021.
Bautista and Atienza [2022] D. Bautista and R. Atienza. Scene text recognition with permuted autoregressive sequence models. In ECCV, pages 178–196, 2022.
Chen et al. [2023a] H. Chen, Z. Xu, Z. Gu, J. Lan, X. Zheng, Y. Li, C. Meng, H. Zhu, and W. Wang. DiffUTE: Universal text editing diffusion model. In NeurIPS, pages 63062–63074, 2023a.
Chen et al. [2023b] J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. TextDiffuser: Diffusion models as text painters. In NeurIPS, pages 9353–9387, 2023b.
Chen et al. [2024] J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. TextDiffuser-2: Unleashing the power of language models for text rendering. In ECCV, pages 386–402, 2024.
Chng et al. [2019] C.K. Chng, Y. Liu, Y. Sun, C.C. Ng, C. Luo, Z. Ni, C.M. Fang, S. Zhang, J. Han, E. Ding, J. Liu, D. Karatzas, C.S. Chan, and L. Jin. ICDAR2019 robust reading challenge on arbitrary-shaped text-rrc-art. In ICDAR, pages 1571–1576, 2019.
Dosovitskiy et al. [2021] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
Du et al. [2022] Y. Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y. Du, and Y. Jiang. SVTR: Scene text recognition with a single visual model. In IJCAI, pages 884–890, 2022.
Du et al. [2025a] Y. Du, Z. Chen, C. Jia, X. Yin, C. Li, Y. Du, and Y.-G. Jiang. Context perception parallel decoder for scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell., 47(6):4668–4683, 2025a.
Du et al. [2025b] Y. Du, Z. Chen, Y. Su, C. Jia, and Y.-G. Jiang. Instruction-guided scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell., 47(4):2723–2738, 2025b.
Du et al. [2025c] Y. Du, Z. Chen, H. Xie, C. Jia, and Y. Jiang. SVTRv2: CTC beats encoder-decoder models in scene text recognition. In ICCV, 2025c.
Fang et al. [2021] S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In CVPR, pages 7098–7107, 2021.
Fang et al. [2022] S. Fang, Z. Mao, H. Xie, Y. Wang, C. Yan, and Y. Zhang. Abinet++: Autonomous, bidirectional and iterative language modeling for scene text spotting. IEEE Trans. Pattern Anal. Mach. Intell., 45(6):7123–7141, 2022.
Gu et al. [2022] J. Gu, X. Meng, G. Lu, L. Hou, N. Minzhe, X. Liang, L. Yao, R. Huang, W. Zhang, X. Jiang, C. Xu, and H. Xu. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. In NeurIPS, pages 26418–26431, 2022.
Gupta et al. [2016] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In CVPR, pages 2315–2324, 2016.
He et al. [2018] M. He, Y. Liu, Z. Yang, S. Zhang, C. Luo, F. Gao, Q. Zheng, Y. Wang, X. Zhang, and L. Jin. ICPR2018 contest on robust reading for multi-type web images. In ICPR, pages 7–12, 2018.
J et al. [2024] Santoso J, Simon C, and Williem. On manipulating scene text in the wild with diffusion models. In WACV, pages 5202–5211, 2024.
Jaderberg et al. [2014] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. In Workshop on Deep Learning, NeurIPS, 2014.
Ji et al. [2024] J. Ji, G. Zhang, Z. Wang, B. Hou, Z. Zhang, B. Price, and S. Chang. Improving diffusion models for scene text editing with dual encoders. TMLR, 2024.
Jiang et al. [2023] Q. Jiang, J. Wang, D. Peng, C. Liu, and L. Jin. Revisiting scene text recognition: A data perspective. In ICCV, pages 20543–20554, 2023.
Karatzas et al. [2013] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. Gomez i Bigorda, S. Robles Mestre, J. Mas, D. Fernandez Mota, J. Almazàn, and L.P. de las Heras. ICDAR 2013 robust reading competition. In ICDAR, pages 1484–1493, 2013.
Karatzas et al. [2015] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V.R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In ICDAR, pages 1156–1160, 2015.
Li et al. [2022] C. Li, W. Liu, R. Guo, X. Yin, K. Jiang, Y. Du, Y. Du, L. Zhu, B. Lai, X. Hu, D. Yu, and Y. Ma. PP-OCRv3: More attempts for the improvement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022.
Li et al. [2019] H. Li, P. Wang, C. Shen, and G. Zhang. Show, attend and read: A simple and strong baseline for irregular text recognition. In AAAI, pages 8610–8617, 2019.
Long and Yao [2020] S. Long and C. Yao. Unrealtext: Synthesizing realistic scene text images from the unreal world. In CVPR, 2020.
Luo et al. [2019] C. Luo, L. Jin, and Z. Sun. Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognit., 90:109–118, 2019.
Mishra et al. [2012] A. Mishra, K. Alahari, and CV Jawahar. Scene text recognition using higher order language priors. In BMVC, 2012.
Nayef et al. [2019] N. Nayef, C. Liu, J. Ogier, Y. Patel, M. Busta, P.N. Chowdhury, D. Karatzas, W. Khlif, J. Matas, U. Pal, and J. Burie. ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition-rrc-mlt-2019. In ICDAR, pages 1582–1587, 2019.
Phan et al. [2013] T.Q. Phan, P. Shivakumara, S. Tian, and C.L. Tan. Recognizing text with perspective distortion in natural scenes. In ICCV, pages 569–576, 2013.
Qu et al. [2023] Y. Qu, Q. Tan, H. Xie, J. Xu, Y. Wang, and Y. Zhang. Exploring stroke-level modifications for scene text editing. In AAAI, pages 2119–2127, 2023.
Rang et al. [2024] M. Rang, Z. Bi, C. Liu, Y. Wang, and K. Han. An empirical study of scaling law for scene text recognition. In CVPR, pages 15619–15629, 2024.
Risnumawan et al. [2014] A. Risnumawan, P. Shivakumara, C.S. Chan, and C.L. Tan. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41(18):8027–8048, 2014.
Rombach et al. [2022] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
Schuhmann et al. [2021] C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
Sheng et al. [2019] F. Sheng, Z. Chen, and B. Xu. NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In ICDAR, pages 781–786, 2019.
Shi et al. [2016] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell., 39(11):2298–2304, 2016.
Shi et al. [2017] B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, and X. Bai. ICDAR2017 competition on reading chinese text in the wild (rctw-17). In ICDAR, pages 1429–1434, 2017.
Shi et al. [2018] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai. ASTER: An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell., 41(9):2035–2048, 2018.
Singh et al. [2021] A. Singh, G. Pang, M. Toh, J. Huang, W. Galuba, and T. Hassner. TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In CVPR, pages 8802–8812, 2021.
Sun et al. [2019] Y. Sun, Z. Ni, C.K. Chng, Y. Liu, C. Luo, C.C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, C.S. Chan, and L. Jin. ICDAR 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In ICDAR, pages 1557–1562, 2019.
Tuo et al. [2024] Y. Tuo, W. Xiang, J.Y. He, Y. Geng, and X. Xie. AnyText: Multilingual visual text generation and editing. In ICLR, 2024.
Veit et al. [2016] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie. COCO-Text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016.
Wang et al. [2011] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In ICCV, pages 1457–1464, 2011.
Wang et al. [2022] Y. Wang, H. Xie, S. Fang, M. Xing, J. Wang, S. Zhu, and Y. Zhang. Petr: Rethinking the capability of transformer-based language model in scene text recognition. IEEE Trans. Image Process., 31:5585–5598, 2022.
Wang et al. [2025] Y. Wang, W. Zhang, H. Xu, and C. Jin. Dreamtext: High fidelity scene text synthesis. In CVPR, pages 28555–28563, 2025.
Wei et al. [2024] J. Wei, H. Zhan, Y. Lu, X. Tu, B. Yin, C. Liu, and U. Pal. Image as a language: Revisiting scene text recognition via balanced, unified and synchronized vision-language reasoning network. In AAAI, pages 5885–5893, 2024.
Wu et al. [2019] L. Wu, C. Zhang, J. Liu, J. Han, J. Liu, E. Ding, and X. Bai. Editing text in the wild. In ACM MM, pages 1500–1508, 2019.
Xu et al. [2024] J. Xu, Y. Wang, H. Xie, and Y. Zhang. OTE: Exploring accurate scene text recognition using one token. In CVPR, pages 28327–28336, 2024.
Yang et al. [2020] Q. Yang, J. Huang, and W. Lin. SwapText: Image based texts transfer in scenes. In CVPR, pages 14700–14709, 2020.
Yang et al. [2023] Y. Yang, D. Gui, Y. Yuan, W. Liang, H. Ding, H. Hu, and K. Chen. GlyphControl: Glyph conditional control for visual text generation. In NeurIPS, pages 44050–44066, 2023.
Zeng et al. [2024] W. Zeng, Y. Shu, Z. Li, D. Yang, and Y. Zhou. TextCtrl: Diffusion-based scene text editing with prior guidance control. In NeurIPS, pages 138569–138594, 2024.
Zhai et al. [2016] C. Zhai, Z. Chen, J. Li, and B. Xu. Chinese image text recognition with blstm-ctc: a segmentation-free method. In CCPR, pages 525–536, 2016.
Zhan et al. [2018] F. Zhan, S. Lu, and C. Xue. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In ECCV, pages 249–266, 2018.
Zhang et al. [2021] C. Zhang, W. Ding, G. Peng, F. Fu, and W. Wang. Street view text recognition with deep learning for urban scene understanding in intelligent transportation systems. IEEE Trans. Intell. Transp., 22(7):4727–4743, 2021.
Zhang et al. [2019] R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang, D. Wang, M. Liao, M. Yang, X. Bai, B. Shi, D. Karatzas, S. Lu, and C.V. Jawahar. ICDAR 2019 robust reading challenge on reading chinese text on signboard. In ICDAR, pages 1577–1581, 2019.
Zhang et al. [2024] Z. Zhang, N. Lu, M. Liao, Y. Huang, C. Li, M. Wang, and W. Peng. Self-distillation regularized connectionist temporal classification loss for text recognition: A simple yet effective approach. In AAAI, pages 7441–7449, 2024.
Zhao and Lian [2024] Y. Zhao and Z. Lian. UDiffText: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models. In ECCV, pages 217–233, 2024.
Zheng et al. [2024] T. Zheng, Z. Chen, S. Fang, H. Xie, and Y.-G. Jiang. CDistNet: Perceiving multi-domain character distance for robust text recognition. Int. J. Comput. Vis., 132(2):300–318, 2024.
Zhu et al. [2023] Y. Zhu, Z. Li, T. Wang, M. He, and C. Yao. Conditional text image generation with diffusion models. In CVPR, pages 14235–14245, 2023.
Zhu et al. [2024] Y. Zhu, J. Liu, F. Gao, W. Liu, X. Wang, P. Wang, F. Huang, C. Yao, and Z. Yang. Visual text generation in the wild. In ECCV, pages 89–106, 2024.

\thetitle

Supplementary Material

6 More Implementation Details

6.1 Model specific settings

In the setup of Sec.3.2, we consider the specific characteristics of scene text: L L is set to 25, and P M P_{M} is configured to 128, assuming a background color of 255. It is worth noting that our method is not limited to this configuration. This configuration theoretically enables the generation of text up to 255 characters in length, allowing for background color selection from 255 and character values from 0 to 254. Each character image is set to a 64×\times 64 square, resulting in a character glyph image of size 25×\times 64×\times 64. The deformed ViT is configured with a patch size of 8, generating a latent feature vector of size 65×\times 1024, where 1024 represents the dimensionality of the control information required by the CDM.

After passing through the VAE in Sec.3.3, the dimensions of outputs are uniformly [4, 16, 16], while Z M Z_{M} is formatted as [1, 16, 16]. These are concatenated to form [13, 16, 16]. Therefore, the Conv2d Layer has an input dimension of 13 and an output dimension of 4, producing an output of [4, 16, 16] to match the original input dimensions of the U-Net.

To meet the rendering requirements for visible characters in the majority of languages, we utilize to employ Puhui Font, an open-source and commercially-free font tool. It adheres to the latest Chinese national standard, GB18030-2022, and supports 178 languages.

6.2 More Details for Datasets

We will further elaborate on the data processing related to training and generation.

Training Data. To train our generative model, we utilize the large-scale multilingual text image dataset, AnyWord-3M[41]. It contains real annotated text boxes and text contents, designed for scene text detection and recognition tasks, which we collectively call AnyWord-Scene. This collection includes a range of popular datasets such as ArT[6], COCO-Text[42], RCTW[37], LSVT[40], MLT[28], MTWI[16], and ReCTS[55]. In addition, two larger datasets are included: AnyWord-Wukong [14] and AnyWord-Laion [34], which provide a large collection of images with bounding boxes and text content obtained by the PP-OCRv3 [23] detection and recognition model. We filter out anomalous images with pure white backgrounds from the AnyWord-3M dataset. Ensuring that when cropping local images, we minimize the inclusion of white borders, thereby reflecting real-world conditions. The processed AnyWord-Wukong and AnyWord-Laion datasets together contain a total of 3,430,412 complete images, from which we crop 14,856,392 local regions containing text instances for the first training stage. In the second training stage, we utilize 78,395 full images from processed AnyWord-Scene, cropping 201,599 text regions from these.

Computational Overhead and Runtime Efficiency. The training times for our three stages (VAE fine-tuning, UNet pretraining, and UNet fine-tuning) are 192, 400, and 200 GPU-hours, respectively. For inference, using a single RTX-3090 GPU with diffusion_steps=20 and batch_size=32, 5k batches take 26 hours, averaging 0.59 seconds per image.

Figure 7: The pipeline comparison between TextSSR and previous methods. In (c), “optional” indicates that either or both options should be provided. And in (d), it means that the base image can be used with or without text.

Table 6: Quantitative comparison of multilingual text image generation methods. For each language, 100 images are used for test.

Data for Accuracy Evaluation. Meanwhile, in Accuracy Evaluation we employ datasets that the model has not previously encountered. They represent a range of image difficulty levels and cover multiple languages. Specifically, we use the following datasets for evaluation:

•IC13[21]: This benchmark is designed for relatively regular text detection and recognition. We use 233 full images and 917 cropped text images from the test set for evaluation.
•IC15[22]: This dataset contains more challenging real-world scene text, derived from incidental scene captures where the text was not the primary focus. We employ 500 full images and 2,077 cropped text images for evaluation.
•Shopsign[54]: This dataset consists of Chinese scene text, primarily from shop signs. We selecte 183 full images and 932 cropped text images from this dataset.

Base Data for Scalable Generation. We utilize the TextOCR[39], a large-scale scene text dataset including 25,119 images, as the base images for synthetic data generation. To mimic realistic conditions where unlabeled data is abundant, we use pseudo-labels generated by the PP-OCRv3 model, thus simulating a scenario without human annotation.

Generation of TextSSR-F. After the previous step, we obtain 188,526 text regions annotated by the PP-OCRv3[23] model. Based on the anagram method (described in Section3.4), we expand the data and filter with the SVTRv2[11] model, yielding a final dataset of 3,551,396 fully usable text instances, referred to as TextSSR-F.

Quality Filtering Bias Our filtering process employs a double-check mechanism: given the label the generation model attempts to generate, and SVTRv2 is used to verify that the recognized text matches the label. An error occurs only if both the generation model and SVTRv2 fail simultaneously. This pipeline ensures the correctness of most TextSSR-F instances. To validate this, we randomly sample 300 instances from TextSSR-F and recruit three assessors each checking 100 instances. The average accuracy is 98.67% (98%, 98%, 100%).

Impact of Pseudo-Labeling When OCR pseudo-label errors occur, the generation process still attempts to follow the pseudo-labels (see examples in Fig.8), so the side impact is relatively limited.

Figure 8: Examples of correct rendering despite incorrect English and Chinese pseudo-labels.

6.3 Training and Evaluation Details

6.3.1 Training Details

We fine-tune our generative model using Stable Diffusion 2.1 (SD 2-1)[33] on eight NVIDIA 3090 GPUs. First, we train the VAE on the full AnyWord dataset with a total batch size of 512 and 256x256 image patches for 150k steps. Then, we freeze the VAE and train the CDM in two stages: 50k steps on AnyWord-Wukong and AnyWord-Laion datasets to pre-train, followed by 25k steps on the AnyWord-Scene dataset to fine-tune, using a total batch size of 256.

6.3.2 Accuracy Evaluation Details

For a fair comparison in our accuracy evaluation, we render all visible bounding boxes and contents annotated in the test datasets. In cases where certain models could not render longer texts or handle multiple text instances per image, we will restrict the input information to within their acceptable ranges, while padding the missing portions. Our model is also limited, with only the first 25 characters rendered for single-character features. For all models, the number of timesteps in the sampling process is set to 20. The evaluation code for generated results is based on the open-source evaluation scripts from AnyText[41] and UDiffText[57]. Except for GlyphControl[50], which requires additional image descriptions to function properly, the other methods only use their predefined text prompts.

Figure 9: Visualization of synthesized multilingual examples.

6.3.3 Expanded Multilingual Evaluation

We have added four languages (French, German, Japanese, and Traditional Chinese) and use the multilingual version of SVTRv2 for evaluation (see Tab.6). We also provide illustrative examples in Fig.9 to validate the generalization to non-English languages. TextSSR generates correct instances while others mostly fail.

6.3.4 Realism and Scalability Evaluation Details

In the Realism and extended experiments, CRNN is trained[1] with a batch size of 64 on a single 3090 GPU for 10k steps. The data augmentation configurations preset in the codebase are utilized throughout the training process.

6.3.5 Usability Assessment Details

In the Usability experiments, we train two widely-used STR models—CRNN[36], and MAERec[20]—on the generated data to assess the effectiveness of our synthetic data in enhancing STR performance. All models are trained using the OpenOCR framework, with a total batch size of 1024 on four 3090 GPUs for 20 epochs.

To vividly demonstrate that TextSSR significantly enhances the performance of STR models under challenging scenarios, we conduct a small-scale validation experiment. In this experiment, we limit the dataset size to 429k and employ an identical NRTR[35] model trained under the same configuration for comparison. The experimental results indicate that the model trained on TextSSR-F exhibits more realistic performance when dealing with challenging text conditions such as perspective distortion and blurring. We provide visual comparisons in Fig.10, showcasing TextSSR’s superior performance in recognizing low-resolution and perspective-distorted text.

Figure 10: Visualization of recognition results on NRTR.

6.3.6 Ablation Study Details

The ablation study can be considered a simplified version of the second stage of training, with all settings kept consistent except for the reduction of training steps to 5k. To align with the full-image inference process used in other methods, the image size is set to 512×\times 512, although training is conducted at a resolution of 256. The “Char-Glyph” ablation experiment involves removing the condition from the CDM training, while the “Char-Position” ablation renders all characters uniformly at a pixel value of 127.

Figure 11: Visualization results of TextSSR with and without character-level glyph prior.

7 Visualization

7.1 Visualization Analysis

Fig.4 sequentially simulates various situations, including English text in a regular scene, text under challenging conditions, and Chinese text in a natural setting. TextSSR consistently generates accurate and high-quality visual text, demonstrating several powerful capabilities: (1) it can synthesize arbitrary text with standard glyphs from any language, as shown in examples of both Chinese and English; (2) it learns font style information from surrounding context, such as the font color in Sample 1, which is derived from the horizontal line below; (3) it synthesizes correct text even without strong background information, as illustrated in Sample 2, where the local image provides no usable information for imitation; and (4) it exhibits scale invariance, allowing for text synthesis in scenes of any size, with the three samples representing large, small, and medium text sizes, respectively.

Figure 12: Users will select a scene image as the base, perform mask marking in the designated area, and then input the text content to be written. After processing, the desired text region and the edited original image will be obtained.

7.2 Function Demonstration Platform

We have concretely implemented the inference process and build a demonstration demo. To ensure that the user input matches the label format used during training, we recalculate text boxes that align with the input text location after the user applies the mask. As shown in Fig.12, the text is roughly displayed within the user-specified area, though it does not follow the mask strictly.

7.3 Ablation Visualization Results

Fig.13 and Fig.11 illustrate the impact of character-level position and glyph on the rendering results of TextSSR. The results indicate that omitting either component leads to issues such as character deformation, incorrect characters, and duplication errors in some cases, further supporting the findings of the ablation study.

Figure 13: Visualization results of TextSSR with and without character-level position prior.

7.4 More Visualization Results

To provide a detailed illustration of the synthesis process and effectiveness of TextSSR, Fig.14 illustrates TextSSR’s synthesis process, showing how it reconstructs local images from original regions and crops them to obtain final results. Comparisons with ground truth demonstrate TextSSR’s strong synthesis capabilities across diverse scenarios, including regular text, low-resolution text, curved text, perspective text, multilingual text and multi-oriented text.

7.5 Failure Cases

It is important to note that our synthesis method is not flawless and has certain limitations. Fig.15 presents several common failure cases, which can be attributed to the following reasons:

1.Long text: Excessively long text can confuse the model, resulting in disordered text images. This issue is exacerbated by the limited amount of training data for such cases.
2.Blurred regions: When the text region itself is excessively blurred, the model struggles to accurately reconstruct and synthesize the text.
3.Multi-directional text: The model, primarily trained on horizontally aligned text, faces challenges with multi-directional text, especially vertical text. Applying rotation-based post-processing, as used in STR methods, could be a potential solution.
4.Incorrect text labels: Errors in manual labeling can lead to mismatches between the rendered regions and their corresponding labels.
5.Language Characteristics: The performance on Chinese text is generally worse than on English, due to the higher number of characters and the complexity of Chinese characters.

Despite the minority in quantity, handling challenging text instances are also important. We plan to tackle these instances as follows: splitting long text into shorter segments, simulating blur by adding noise and augmenting multi-directional text via rotation based on common instances, leveraging render-based data for pretraining on multilingual characters, etc.

8 Discussion

However, our study has limitations and avenues for further research, including the following: (1) The text location and the text content must be paired. While we utilize the anagram-based method to mitigate this issue, we will design methods for reasonable, large-scale usable pairings for broader synthesis considerations. (2) Currently, large-scale synthetic post-processing relies on an STR model; we aim to integrate a self-checking mechanism into the entire framework to verify the correctness of the synthesized output. This could further enhance learning and adjust the arrangement of text location until generating usable text correctly. (3) Due to available large-scale scene text images already used for training, we plan to collect a larger dataset of untrained text images for the base images, creating a more extensive synthetic dataset to benefit the STR community. (4) Although the generation is related to the surrounding context, currently TextSSR does not fully address this issue due to lack of customized design. However, by substituting certain TextSSR component, e.g., the anagram expansion, or using LLM to recommend contextually appropriate content, TextSSR can largely alleviate it while the rest TextSSR components can still be reused. We will improve the semantic diversity and contextual realism from these aspects in future. (5) While our primary focus is on STR, our approach can also benefit other downstream tasks. For example, by directly writing text onto the background or editing original text, our method can generate new data for text detection and document understanding tasks. We also agree that domain generalization is a valuable topic. We will investigate other downstream applications and discuss broader specialized domains in future.

Figure 14: More visualization results for TextSSR.

Figure 15: Failure Cases. We show several disappointing synthesis results produced by TextSSR.

Xet Storage Details

Size:: 77.9 kB
Xet hash:: bec73907b6967010094753277399cd08aee8478b129f1a63ac29c0f8c921963e

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.