blowing-up-groundhogs
/

eruku

@@ -1,127 +1,159 @@
 ---
-license: mit
 tags:
   - handwriting-generation
   - text-to-image
   - autoregressive
   - pytorch
-  - eruku
-  - text-image-generation
-pipeline_tag: image-to-text
 ---
 # Eruku - Autoregressive Styled Text Image Generation
-**Eruku** is a state-of-the-art autoregressive model for styled text image generation, particularly excelling at handwritten text generation (HTG).
-📄 **Paper**: ["Autoregressive Styled Text Image Generation, but Make it Reliable"](https://arxiv.org/abs/2510.23240)
-🎮 **Demo**: [HuggingFace Space](https://huggingface.co/spaces/blowing-up-groundhogs/eruku-demo)
-## Model Description
-Eruku addresses key limitations of previous handwriting generation methods while maintaining their strengths:
-- ✨ **No Style Text Required**: Unlike previous methods, Eruku doesn't require transcriptions of style images
-- 🎯 **Reliable Generation**: Proper stop mechanism prevents repetition loops and visual artifacts
-- 🔤 **Special Token Alignment**: Introduces special textual tokens (SOG/EOG) for better alignment between text and visual representations
-- ⚡ **Classifier-Free Guidance**: Implements CFG for improved control over style adherence and text fidelity
-- 📏 **Arbitrary Length**: Can generate text images of any length without architectural constraints
-## Architecture
-The model combines:
-- **T5 Transformer**: Autoregressive text encoder for understanding and generation control
-- **VAE (Variational Autoencoder)**: Efficient image tokenizer (from `blowing-up-groundhogs/emuru_vae`)
-- **OrigamiNet OCR**: For auxiliary OCR loss during training
-## Model Files
-- `000073688.pth` - Main trained model weights (8.0 GB)
-- `origami.pth` - OCR model checkpoint (OrigamiNet, 41 MB)
-## Usage
 ```python
 import torch
-from huggingface_hub import hf_hub_download
-from eruku_continuous_inf import Emuru
-# Download checkpoints
-model_checkpoint = hf_hub_download(
-    repo_id="blowing-up-groundhogs/eruku",
-    filename="000073688.pth"
 )
-ocr_checkpoint = hf_hub_download(
-    repo_id="blowing-up-groundhogs/eruku",
-    filename="origami.pth"
-)
-# Initialize model
-model = Emuru(
-    t5_checkpoint='google-t5/t5-base',
-    vae_checkpoint='blowing-up-groundhogs/emuru_vae',
-    ocr_checkpoint=ocr_checkpoint,
-    slices_per_query=1,
-    channels=1
 )
-# Load trained weights
-checkpoint = torch.load(model_checkpoint, map_location='cpu')
-model.load_state_dict(checkpoint, strict=False)
-model.eval()
-# Generate handwriting
-style_text = ""  # Optional
-gen_text = "Hello World!"
-# Prepare inputs
-inputs = model.get_model_inputs(
-    style_img=[torch.ones(1, 1, 64)],  # Minimal style image
-    gen_img=None,
-    style_len=64,
-    gen_len=None,
-    max_img_len=128 * 8
-)
-# Generate
-output_img, _ = model.generate(
-    decoder_inputs_embeds_vae=inputs['decoder_inputs_embeds'],
-    style_text=[style_text],
-    gen_text=[gen_text],
-    cfg_scale=1.5,
-    max_new_tokens=128
-)
-```
-## Performance Highlights
-From the paper, Eruku demonstrates:
-- **Superior Text Adherence**: Lower Character Error Rate (CER) compared to previous methods
-- **Better Generalization**: Excellent performance on both handwritten and typewritten styles
-- **Style Consistency**: High-fidelity style replication while maintaining readability
-- **Efficient Training**: Simpler training process without requiring auxiliary networks
-## Training
-The model was trained in two stages:
-1. **Stage 1**: Pre-training on large-scale synthetic and real handwriting datasets
-2. **Stage 2**: Fine-tuning with longer text sequences and dropout strategies
-Training details are available in the paper.
-## Limitations
-- Best performance on English text (model trained primarily on English datasets)
-- Very long texts (>200 tokens) may require chunking
-- Style transfer quality depends on the style reference provided
-## Citation
-If you use Eruku in your research, please cite:
 ```bibtex
 @InProceedings{pippi2025zeroshot,
@@ -137,26 +169,25 @@ If you use Eruku in your research, please cite:
     author = {Carmine Zaccagnino and Fabio Quattrini and Vittorio Pippi and Silvia Cascianelli and Alessio Tonioni and Rita Cucchiara},
     title = {Autoregressive Styled Text Image Generation, but Make it Reliable},
     booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
-    month=3,
-    year = 2026
 }
 ```
-## Authors
-- Carmine Zaccagnino (University of Modena and Reggio Emilia)
-- Fabio Quattrini (University of Modena and Reggio Emilia)
-- Vittorio Pippi (University of Modena and Reggio Emilia)
-- Silvia Cascianelli (University of Modena and Reggio Emilia)
-- Alessio Tonioni (Google)
-- Rita Cucchiara (University of Modena and Reggio Emilia)
-## License
-MIT License
-## Related
-- **VAE Model**: [blowing-up-groundhogs/emuru_vae](https://huggingface.co/blowing-up-groundhogs/emuru_vae)
-- **Demo Space**: [HuggingFace Space](https://huggingface.co/spaces/blowing-up-groundhogs/eruku-demo)
-- **Paper**: [arXiv:2510.23240](https://arxiv.org/abs/2510.23240)

 ---
+license: apache-2.0
 tags:
   - handwriting-generation
+  - styled-text-generation
   - text-to-image
   - autoregressive
+  - vision
+  - transformers
   - pytorch
+language:
+  - en
+pipeline_tag: image-to-image
+library_name: transformers
 ---
 # Eruku - Autoregressive Styled Text Image Generation
+<p align="center">
+  <img src="https://img.shields.io/badge/CVPR-2025-blue" alt="CVPR 2025">
+  <img src="https://img.shields.io/badge/WACV-2026-green" alt="WACV 2026">
+  <img src="https://img.shields.io/badge/License-Apache%202.0-yellow" alt="License">
+</p>
+**Eruku** is a state-of-the-art autoregressive model for styled handwritten and typewritten text image generation. Given a style reference image and text to generate, it produces high-quality text images that faithfully replicate the input style.
+## 🌟 Key Features
+- **Zero-shot style transfer**: No training required for new styles
+- **No transcription required**: Works with just a style image (transcription optional but helps)
+- **Reliable generation**: Proper EOG (End of Generation) mechanism prevents artifacts
+- **Arbitrary length**: Generate text of any length
+- **High fidelity**: Excellent style consistency and text readability
+- **Classifier-Free Guidance**: Fine control over generation quality
+## 📦 Installation
+```bash
+pip install torch torchvision transformers diffusers einops pillow
+```
+## 🚀 Quick Start
 ```python
+from transformers import AutoModel
+from PIL import Image
 import torch
+# Load model
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = AutoModel.from_pretrained(
+    "blowing-up-groundhogs/eruku",
+    trust_remote_code=True
 )
+model.to(device)
+model.eval()
+# Load a style image (handwritten/typewritten text sample)
+style_image = Image.open("style_sample.png")
+# Generate text in that style
+result = model.generate_handwriting(
+    style_image=style_image,
+    gen_text="Hello, World!",
+    style_text="",  # Optional: transcription of style image
+    cfg_scale=1.25,  # Classifier-free guidance scale
 )
+# Save the result
+result.save("generated.png")
+```
+## 📖 Detailed Usage
+### Input Format
+The model takes three inputs:
+1. **Style Image** (`style_image`): A PIL Image containing handwritten or typewritten text that serves as the style reference. The model will replicate this style.
+2. **Generation Text** (`gen_text`): The text you want to render in the extracted style.
+3. **Style Text** (`style_text`, optional): The transcription of the text in the style image. Providing this helps the model better understand the style, but it's not required.
+### Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `style_image` | PIL.Image | Required | Reference style image |
+| `gen_text` | str | Required | Text to generate |
+| `style_text` | str | `""` | Optional transcription of style image |
+| `cfg_scale` | float | `1.25` | Classifier-free guidance scale |
+| `max_new_tokens` | int | `512` | Maximum generation tokens |
+### CFG Scale Guide
+- `1.0`: No guidance (faster but may drift from prompt)
+- `1.25`: Recommended default - good balance
+- `1.5-2.0`: Stronger adherence to prompt
+- `>2.0`: May cause artifacts
+## 🖼️ Example Results
+The model excels at:
+- Handwritten text in various styles (cursive, print, mixed)
+- Typewritten text with different fonts
+- Multi-language text (trained primarily on English)
+- Long text sequences
+## 📊 Model Architecture
+Eruku combines:
+- **T5-Large encoder-decoder** for text understanding and autoregressive generation
+- **VAE (Variational Autoencoder)** for image encoding and decoding
+- **Custom embeddings** for style transfer and special tokens (SOS, SOG, EOG)
+The model generates images autoregressively, predicting one latent slice at a time until it produces an EOG (End of Generation) token.
+## 🔧 Advanced Usage
+### Lower-level API
+For more control, you can use the lower-level methods:
+```python
+import torch
+from torchvision import transforms as T
+# Prepare style image manually
+style_img = Image.open("style.png").convert('RGB')
+width, height = style_img.size
+new_width = int(64 * width / height)
+style_img = style_img.resize((new_width, 64), Image.LANCZOS)
+style_tensor = T.ToTensor()(style_img).to(device)
+# Get model inputs
+inputs = model.get_model_inputs(
+    style_img=[style_tensor],
+    style_len=style_tensor.shape[-1],
+    max_img_len=1024*1024
+)
+# Generate with full control
+with torch.inference_mode():
+    output_img, special_sequence = model.generate(
+        decoder_inputs_embeds_vae=inputs['decoder_inputs_embeds'],
+        style_text=["Style text here"],
+        gen_text=["Text to generate"],
+        cfg_scale=1.25,
+        max_new_tokens=512
+    )
+```
+## 📚 Citation
+If you use Eruku in your research, please cite both papers:
 ```bibtex
 @InProceedings{pippi2025zeroshot,
     author = {Carmine Zaccagnino and Fabio Quattrini and Vittorio Pippi and Silvia Cascianelli and Alessio Tonioni and Rita Cucchiara},
     title = {Autoregressive Styled Text Image Generation, but Make it Reliable},
     booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
+    month = {March},
+    year = {2026}
 }
 ```
+## 🔗 Links
+- 📄 **Paper**: [arXiv:2510.23240](https://arxiv.org/abs/2510.23240)
+- 🌐 **Project Website**: [eruku.carminezacc.com](https://eruku.carminezacc.com)
+- 🤗 **Demo**: [Hugging Face Space](https://huggingface.co/spaces/carminezacc/eruku)
+- 🎨 **VAE Model**: [blowing-up-groundhogs/emuru_vae](https://huggingface.co/blowing-up-groundhogs/emuru_vae)
+## 📜 License
+This model is released under the Apache 2.0 License.
+## 🙏 Acknowledgments
+- T5: google-t5/t5-large
+- VAE: blowing-up-groundhogs/emuru_vae
+- Training datasets: IAM, CVL, RIMES, FontSquare

configuration_eruku.py ADDED Viewed

	@@ -0,0 +1,52 @@

+"""
+Eruku Configuration
+Configuration class for the Eruku Styled Handwritten Text Recognition model.
+"""
+from transformers import PretrainedConfig
+class ErukuConfig(PretrainedConfig):
+    """
+    Configuration class for Eruku model.
+    Args:
+        t5_name_or_path (`str`, *optional*, defaults to `"google-t5/t5-large"`):
+            The name or path of the T5 model to use as the backbone.
+        vae_name_or_path (`str`, *optional*, defaults to `"blowing-up-groundhogs/emuru_vae"`):
+            The name or path of the VAE model for image encoding/decoding.
+        tokenizer_name_or_path (`str`, *optional*, defaults to `"google/byt5-small"`):
+            The name or path of the tokenizer (character-level).
+        slices_per_query (`int`, *optional*, defaults to 1):
+            Number of VAE latent slices per query token.
+        channels (`int`, *optional*, defaults to 1):
+            Number of channels in the VAE latent space.
+        vae_latent_dim (`int`, *optional*, defaults to 8):
+            Dimension of the VAE latent space.
+        cfg_scale (`float`, *optional*, defaults to 1.25):
+            Default classifier-free guidance scale for generation.
+    """
+    model_type = "eruku"
+    def __init__(
+        self,
+        t5_name_or_path: str = "google-t5/t5-large",
+        vae_name_or_path: str = "blowing-up-groundhogs/emuru_vae",
+        tokenizer_name_or_path: str = "google/byt5-small",
+        slices_per_query: int = 1,
+        channels: int = 1,
+        vae_latent_dim: int = 8,
+        cfg_scale: float = 1.25,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.t5_name_or_path = t5_name_or_path
+        self.vae_name_or_path = vae_name_or_path
+        self.tokenizer_name_or_path = tokenizer_name_or_path
+        self.slices_per_query = slices_per_query
+        self.channels = channels
+        self.vae_latent_dim = vae_latent_dim
+        self.cfg_scale = cfg_scale

modeling_eruku.py ADDED Viewed

	@@ -0,0 +1,418 @@

+"""
+Eruku Model - Styled Handwritten Text Recognition
+This module implements the Eruku model for autoregressive styled text image generation.
+Based on the papers:
+- "Zero-Shot Styled Text Image Generation, but Make It Autoregressive" (CVPR 2025)
+- "Autoregressive Styled Text Image Generation, but Make it Reliable" (WACV 2026)
+"""
+import torch
+import torch.nn as nn
+from typing import Optional, Tuple, List, Union
+from transformers import PreTrainedModel, T5ForConditionalGeneration, T5Config, AutoTokenizer
+from diffusers import AutoencoderKL
+from einops import rearrange, repeat
+from torch.nn.utils.rnn import pad_sequence
+from torchvision.transforms import Normalize
+from PIL import Image
+import numpy as np
+from .configuration_eruku import ErukuConfig
+# Number of special tokens: SOG, EOG, IMG
+SPECIAL_TOKEN_COUNT = 3
+def pad_images(images: List[torch.Tensor], padding_value: float = 1.0) -> torch.Tensor:
+    """Pad a list of images to the same width."""
+    images = [rearrange(img, 'c h w -> w c h') for img in images]
+    padded = rearrange(pad_sequence(images, padding_value=padding_value), 'w b c h -> b c h w')
+    return padded.contiguous()
+class ErukuPreTrainedModel(PreTrainedModel):
+    """
+    Base class for Eruku models.
+    """
+    config_class = ErukuConfig
+    base_model_prefix = "eruku"
+    supports_gradient_checkpointing = True
+    def _init_weights(self, module):
+        """Initialize weights - handled by sub-components."""
+        pass
+class ErukuForConditionalGeneration(ErukuPreTrainedModel):
+    """
+    Eruku model for conditional styled text image generation.
+    The model takes a style image (handwritten/typewritten text sample),
+    optional style text (transcription of the style image), and generation
+    text (text to render), and produces an image of the generation text
+    in the style of the reference image.
+    Example usage:
+    ```python
+    from transformers import AutoModel
+    from PIL import Image
+    import torch
+    # Load model
+    model = AutoModel.from_pretrained("blowing-up-groundhogs/eruku", trust_remote_code=True)
+    model.eval()
+    # Generate handwriting
+    result = model.generate_handwriting(
+        style_image=Image.open("style.png"),
+        style_text="Hello",  # optional - text in style image
+        gen_text="World",    # text to generate
+    )
+    result.save("output.png")
+    ```
+    """
+    def __init__(self, config: ErukuConfig):
+        super().__init__(config)
+        self.config = config
+        # Character-level tokenizer
+        self.tokenizer = AutoTokenizer.from_pretrained(config.tokenizer_name_or_path)
+        self.tokenizer.add_tokens(["<sog>"])
+        # T5 backbone
+        t5_config = T5Config.from_pretrained(config.t5_name_or_path)
+        t5_config.vocab_size = len(self.tokenizer)
+        self.T5 = T5ForConditionalGeneration(t5_config)
+        self.T5.lm_head = nn.Identity()
+        # Image normalization
+        self.normalize = Normalize(0.5, 0.5)
+        # Special token embeddings
+        self.sos = nn.Embedding(1, t5_config.d_model)
+        self.sog = nn.Embedding(1, t5_config.d_model)
+        self.eog = nn.Embedding(1, t5_config.d_model)
+        # VAE for image encoding/decoding
+        self.vae = AutoencoderKL.from_pretrained(config.vae_name_or_path)
+        self._freeze_module(self.vae)
+        # Projection layers
+        vae_dim = config.vae_latent_dim * config.channels * config.slices_per_query
+        self.query_emb = nn.Linear(vae_dim, t5_config.d_model)
+        self.t5_to_vae = nn.Linear(t5_config.d_model, vae_dim)
+        self.t5_to_special = nn.Linear(t5_config.d_model, SPECIAL_TOKEN_COUNT)
+        # Unconditional embedding for CFG
+        self.uncond_embedding = nn.Embedding(1, t5_config.d_model)
+        # CFG configuration
+        self.drop_text = False
+        self.drop_img = False
+        # Einops rearrangements
+        self.z_rearrange = lambda x: rearrange(x, 'b w (q c h) -> b c h (w q)',
+                                                c=config.channels, q=config.slices_per_query)
+        self.post_init()
+    def _freeze_module(self, module: nn.Module):
+        """Freeze all parameters in a module."""
+        module.eval()
+        for param in module.parameters():
+            param.requires_grad = False
+    def _img_encode(self, img: torch.Tensor) -> torch.Tensor:
+        """Encode image to VAE latent space."""
+        img = self.normalize(img)
+        img = img.contiguous()
+        return self.vae.encode(img.float()).latent_dist.sample()
+    @torch.no_grad()
+    def get_model_inputs(
+        self,
+        style_img: List[torch.Tensor],
+        style_len: Union[int, List[int]],
+        max_img_len: int = 1024 * 1024
+    ) -> dict:
+        """
+        Prepare model inputs from style images.
+        Args:
+            style_img: List of style image tensors [C, H, W]
+            style_len: Width(s) of style images
+            max_img_len: Maximum image length in pixels
+        Returns:
+            Dictionary with decoder_inputs_embeds
+        """
+        bs = len(style_img)
+        decoder_inputs_embeds_list = []
+        # Pad images to same width
+        style_img_padded = pad_images([el.to(self.T5.device) for el in style_img])
+        style_img_embeds = self._img_encode(style_img_padded)
+        for el in range(bs):
+            if isinstance(style_len, int):
+                sl = style_len
+            else:
+                sl = int(style_len[el])
+            # Ensure width is within bounds
+            sl = max(64, min(sl, style_img_embeds.shape[-1] * 8))
+            # Style image embeddings + SOG marker
+            sample_embeds = torch.cat([
+                style_img_embeds[el, :, :, :sl // 8],
+                torch.ones(1, 8, 1).to(self.T5.device),  # SOG placeholder
+            ], dim=-1)
+            sample_embeds = rearrange(sample_embeds, 'c h w -> w (h c)', h=8, c=1)
+            decoder_inputs_embeds_list.append(sample_embeds)
+        decoder_inputs_embeds = pad_sequence(
+            decoder_inputs_embeds_list,
+            padding_value=1,
+            batch_first=True
+        )[:, :max_img_len // 8]
+        return {'decoder_inputs_embeds': decoder_inputs_embeds}
+    @torch.inference_mode()
+    def generate(
+        self,
+        decoder_inputs_embeds_vae: torch.Tensor,
+        style_text: List[str],
+        gen_text: List[str],
+        cfg_scale: float = 1.25,
+        max_new_tokens: int = 512
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Generate styled text image autoregressively.
+        Args:
+            decoder_inputs_embeds_vae: VAE embeddings of style image
+            style_text: List of style text strings (can be empty)
+            gen_text: List of generation text strings
+            cfg_scale: Classifier-free guidance scale (1.25 recommended)
+            max_new_tokens: Maximum tokens to generate
+        Returns:
+            Tuple of (generated_image, special_sequence)
+        """
+        # Encode text
+        encoded_text = self.tokenizer(
+            [f"{style}<sog>{gen}" for style, gen in zip(style_text, gen_text)],
+            padding=True,
+            return_tensors="pt"
+        )
+        text_input_ids = encoded_text['input_ids'].to(self.T5.device)
+        text_mask = encoded_text['attention_mask'].to(self.T5.device)
+        # Initialize generation
+        sog = repeat(self.sog.weight, '1 d -> b 1 d', b=1)
+        sos = repeat(self.sos.weight, '1 d -> b 1 d', b=1)
+        z_sequence = [decoder_inputs_embeds_vae]
+        special_sequence = torch.ones(decoder_inputs_embeds_vae.size(1)) * 3
+        # Build initial decoder inputs
+        decoder_inputs_embeds = self.query_emb(torch.cat(z_sequence, dim=1))
+        if len(style_text[0]) != 0:
+            decoder_inputs_embeds = torch.cat([sos, decoder_inputs_embeds], dim=1)
+        else:
+            decoder_inputs_embeds = torch.cat([sos, decoder_inputs_embeds, sog], dim=1)
+            vae_latent = self.t5_to_vae(sog)
+            special_sequence = torch.cat([special_sequence, torch.zeros(1)])
+            z_sequence.append(vae_latent)
+        # Autoregressive generation
+        for i in range(max_new_tokens):
+            if cfg_scale != 1.0:
+                # Classifier-free guidance
+                conditional_text_embeds = self.T5.shared(text_input_ids)
+                if self.drop_text:
+                    unconditional_text_embeds = self.uncond_embedding.weight.expand_as(conditional_text_embeds)
+                else:
+                    unconditional_text_embeds = conditional_text_embeds
+                if self.drop_img:
+                    unconditional_decoder_embeds = self.uncond_embedding.weight.expand_as(decoder_inputs_embeds)
+                else:
+                    unconditional_decoder_embeds = decoder_inputs_embeds
+                output_uncond = self.T5(
+                    inputs_embeds=unconditional_text_embeds,
+                    attention_mask=text_mask,
+                    decoder_inputs_embeds=unconditional_decoder_embeds
+                ).logits[:, -1:]
+                output_cond = self.T5(
+                    input_ids=text_input_ids,
+                    attention_mask=text_mask,
+                    decoder_inputs_embeds=decoder_inputs_embeds
+                ).logits[:, -1:]
+                output = output_uncond + (output_cond - output_uncond) * cfg_scale
+            else:
+                output = self.T5(
+                    input_ids=text_input_ids,
+                    attention_mask=text_mask,
+                    decoder_inputs_embeds=decoder_inputs_embeds
+                ).logits[:, -1:]
+            # Predict special token
+            special_prediction = self.t5_to_special(output)
+            predicted_special = torch.argmax(special_prediction, dim=-1).item()
+            if predicted_special == 0:  # SOG
+                decoder_inputs_embeds = torch.cat([decoder_inputs_embeds, sog], dim=1)
+                vae_latent = self.t5_to_vae(output)
+                special_sequence = torch.cat([special_sequence, torch.zeros(1)])
+            elif predicted_special == 1:  # EOG - stop generation
+                special_sequence = torch.cat([special_sequence, torch.ones(1)])
+                vae_latent = self.t5_to_vae(output)
+                z_sequence.append(vae_latent)
+                break
+            else:  # IMG token
+                vae_latent = self.t5_to_vae(output)
+                decoder_inputs_embeds = torch.cat([
+                    decoder_inputs_embeds,
+                    self.query_emb(vae_latent)
+                ], dim=1)
+                special_sequence = torch.cat([special_sequence, torch.ones(1) * 2])
+            z_sequence.append(vae_latent)
+        # Decode to image
+        z_sequence = [el.to(self.vae.device) for el in z_sequence]
+        z_sequence = torch.cat(z_sequence, dim=1)
+        z_sequence = self.z_rearrange(z_sequence)
+        img = torch.clamp(self.vae.decode(z_sequence).sample, -1, 1)
+        return img, special_sequence.to(self.T5.device)
+    def generate_handwriting(
+        self,
+        style_image: Image.Image,
+        gen_text: str,
+        style_text: str = "",
+        cfg_scale: float = 1.25,
+        max_new_tokens: int = 512,
+        device: Optional[str] = None
+    ) -> Image.Image:
+        """
+        High-level API for generating handwriting.
+        This is the recommended entry point for inference.
+        Args:
+            style_image: PIL Image containing handwriting style reference
+            gen_text: Text to generate in the style
+            style_text: Optional transcription of text in style_image
+            cfg_scale: Classifier-free guidance scale (default: 1.25)
+            max_new_tokens: Maximum generation length
+            device: Device to use (auto-detected if None)
+        Returns:
+            PIL Image of generated handwriting
+        """
+        import torchvision.transforms as T
+        if device is None:
+            device = next(self.parameters()).device
+        # Preprocess style image
+        style_img = style_image.convert('RGB')
+        # Resize to height 64 maintaining aspect ratio
+        width, height = style_img.size
+        new_width = int(64 * width / height)
+        style_img = style_img.resize((new_width, 64), Image.LANCZOS)
+        # Convert to tensor
+        style_tensor = T.ToTensor()(style_img).to(device)
+        style_len = style_tensor.shape[-1]
+        # Get model inputs
+        inputs = self.get_model_inputs(
+            style_img=[style_tensor],
+            style_len=style_len,
+            max_img_len=1024 * 1024
+        )
+        # Generate
+        output_img, _ = self.generate(
+            decoder_inputs_embeds_vae=inputs['decoder_inputs_embeds'],
+            style_text=[style_text],
+            gen_text=[gen_text],
+            cfg_scale=cfg_scale,
+            max_new_tokens=max_new_tokens
+        )
+        # Crop out the style image part (keep only generated portion)
+        style_width_latent = style_len // 8 + 1  # +1 for SOG token
+        output_img = output_img[:, :, :, style_width_latent * 8:]
+        # Trim whitespace
+        output_img = self._trim_white(output_img)
+        # Convert to PIL
+        output_img = (torch.clamp(output_img, -1, 1) + 1) * 127.5
+        output_img = output_img.byte().squeeze().cpu().numpy()
+        if len(output_img.shape) == 2:
+            return Image.fromarray(output_img, mode='L')
+        elif output_img.shape[0] == 3:
+            output_img = np.transpose(output_img, (1, 2, 0))
+            return Image.fromarray(output_img, mode='RGB')
+        else:
+            return Image.fromarray(output_img[0], mode='L')
+    @staticmethod
+    def _trim_white(img: torch.Tensor, threshold: float = 0.9, padding: int = 8) -> torch.Tensor:
+        """Trim white margins from generated image."""
+        start_idx, end_idx = 0, img.size(-1)
+        vertical_min = img[0, 0].min(-2).values.tolist()
+        # Skip initial non-white columns
+        for v in vertical_min:
+            if v >= threshold:
+                break
+            start_idx += 1
+        # Skip initial white columns
+        for v in vertical_min:
+            if v < threshold:
+                break
+            start_idx += 1
+        # Skip trailing white columns
+        for v in vertical_min[::-1]:
+            if v < threshold:
+                break
+            end_idx -= 1
+        start_idx = max(start_idx - padding, 0)
+        end_idx = min(end_idx + padding, img.size(-1))
+        if start_idx >= end_idx:
+            return img
+        return img[..., start_idx:end_idx]
+    def forward(self, **kwargs):
+        """Forward pass - mainly for training compatibility."""
+        raise NotImplementedError(
+            "Direct forward() is not supported. Use generate_handwriting() for inference."
+        )
+# Register for AutoModel
+ErukuConfig.register_for_auto_class()
+ErukuForConditionalGeneration.register_for_auto_class("AutoModel")