Arabic-GLM-OCR-v1 / README.md
sherif1313's picture
Update README.md
f820033 verified
metadata
license: apache-2.0
language:
  - ar
base_model:
  - zai-org/GLM-OCR
pipeline_tag: image-text-to-text
new_version: sherif1313/Arabic-GLM-OCR-v2

๐Ÿ’œ Github   |   ๐Ÿค— Hugging Face   |   ๐Ÿ“š Cookbooks  
๐Ÿ–ฅ๏ธ Demo  

# ๐Ÿ† sherif1313/Arabic-GLM-OCR-v1 ### High-Quality AI Model for Arabic Documents

๐Ÿš€ Ready-to-Use Arabic OCR Engine

Arabic-GLM-OCR-v1 is a production-optimized model for Arabic OCR, developed from GLM-OCR for high-accuracy document understanding.

Specifically designed for real-world Arabic documents, The most powerful Arabic handwriting recognition model ever . it delivers powerful performance in extracting printed and handwritten Arabic text from structured and semi-structured documents.


๐Ÿ’Ž Key Strengths

โœ… Highly accurate Arabic text reconstruction

โœ… Preserves punctuation well

โœ… Clear spacing and consistent formatting

โœ… Fine-tuned decoding strategy

โœ… Safe generation settings for production environments


๐Ÿง  Technical Architecture

  • Base Model: GLM-OCR (Visual Language Model)

  • Fine-tuning:

  • Accuracy: FP16

  • Loss Strategy: Supervised training with answers only

  • Guidance hiding: Enabled

  • Learning Method: Progression from easy to difficult

Engineering Outcomes

  • Stable convergence
  • Minimal over-customization
  • Robust generalization
  • Clear symbol hiding behavior

๐Ÿ“Š Training Stability Analysis

| Scale | Final Value |

|--------|------------|

| Training Loss | ~0.35 |

| Evaluation Loss | ~0.34 |

| Gap | <10% |

โœ… Observations

  • Stable loss curves
  • No over-allocation observed
  • Balanced behavior between training and evaluation
  • Gradual improvement in evaluation during training

๐Ÿ”Ž Over-allocation Evaluation

| Indicator | Evaluation |

|------------|------------|

| Training Stability | โ˜…โ˜…โ˜…โ˜…โ˜… |

| Generalization | โ˜…โ˜…โ˜…โ˜…โ˜… |

| Over-allocation Risk | Very Low |

| Methodology Efficiency | Excellent |

| Hiding Accuracy | Verified |


๐Ÿ“ˆ Performance Characteristics

| Scenario | Performance |

|----------|------------|

| Clear Printed Text | โ˜…โ˜…โ˜…โ˜…โ˜… |

| Medium Quality Scan | โ˜…โ˜…โ˜…โ˜…โ˜† |

| Significant Distortion | โ˜…โ˜…โ˜†โ˜†โ˜† |

| Arabic Handwriting | Excellent |

Strengths

  • Handles lines of Arabic text

  • Preserves punctuation well

  • Reconstructs spaces stably

  • Works best on scanned documents with clear text


Prioritizes:

Accuracy Consistency Stability Ease of deployment

โš ๏ธ The model works with very high efficiency and is still in the testing phase, with ongoing work to improve the formatting. It is the most powerful OCR model ever.

โš ๏ธ Known Limitations

โš™๏ธ Implementation Methodology

The official inference pipeline and the modified pipeline differ significantly in processing strategy.

The modified implementation provides:

  • Better generation length control
  • Improved repetition handling
  • Cleaner post-processing
  • More stable decoding behavior

The official pipeline requires adjustments to fully suit structured Arabic OCR tasks.

For this reason, development continues using the optimized modified pipeline, with ongoing stability and formatting improvements.


โš ๏ธ Chat Template Compatibility

The model sherif1313/Arabic-GLM-OCR-v1 may not be fully aligned with the default apply_chat_template behavior.

Improper usage may lead to:

  • Incorrect image token encoding
  • Minor text token misalignment
  • Reduced OCR extraction accuracy

It is recommended to verify prompt formatting and ensure correct image-text separation during preprocessing.

๐Ÿ” Generation Control Notice

Using excessively large generation limits such as:

may cause:

  • Repetitive outputs
  • Failure to stop at eos_token
  • Duplicate or unstructured text
  • Recommended settings:

install

git clone https://github.com/zai-org/glm-ocr.git cd glm-ocr uv venv --python 3.12 --seed && source .venv/bin/activate uv pip install -e .

import argparse
import sys
import os
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
from PIL import Image

MODEL_PATH = "sherif1313/Arabic-GLM-OCR-v1"

# ุชุญุฏูŠุฏ ุงู„ุฌู‡ุงุฒ (GPU ุฅู† ูˆุฌุฏ)
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32
print(f"[ู…ุนู„ูˆู…ุงุช] ุงู„ุฌู‡ุงุฒ: {device}", file=sys.stderr)

# ุชุญู…ูŠู„ ุงู„ู†ู…ูˆุฐุฌ ูˆุงู„ู…ุนุงู„ุฌ
try:
    processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
    model = AutoModelForImageTextToText.from_pretrained(
        MODEL_PATH,
        dtype=dtype,
        trust_remote_code=True,
        low_cpu_mem_usage=True,
        device_map="auto"
    )
    model.eval()
except Exception as e:
    print(f"[ุฎุทุฃ] ูุดู„ ุชุญู…ูŠู„ ุงู„ู†ู…ูˆุฐุฌ: {e}", file=sys.stderr)
    sys.exit(1)

def ocr_image(image_path):
    try:
        image = Image.open(image_path).convert("RGB")
    except Exception as e:
        return f"[ุฎุทุฃ] ู„ุง ูŠู…ูƒู† ูุชุญ ุงู„ุตูˆุฑุฉ: {e}"

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": "Text Recognition:"}
            ],
        }
    ]

    try:
        inputs = processor.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=True,
            return_dict=True,
            return_tensors="pt"
        ).to(model.device)

        with torch.no_grad():
            generated_ids = model.generate(
                **inputs,
                max_new_tokens=2048,
                do_sample=False
            )

        hasil = generated_ids[0][len(inputs["input_ids"][0]):]
        teks_final = processor.decode(hasil, skip_special_tokens=True)
        return teks_final
    except Exception as e:
        return f"[ุฎุทุฃ] ูุดู„ุช ุนู…ู„ูŠุฉ ุงู„ุชุนุฑู: {e}"

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="ุงู„ุชุนุฑู ุนู„ู‰ ุงู„ู†ุต ู…ู† ุตูˆุฑ ู…ุชุนุฏุฏุฉ ุจุงุณุชุฎุฏุงู… Arabic-GLM-OCR-v1")
    parser.add_argument("paths", nargs="*", help="ู…ุณุงุฑุงุช ุงู„ู…ู„ูุงุช ุฃูˆ ุงู„ู…ุฌู„ุฏุงุช (ุฅุฐุง ูƒุงู† ู…ุฌู„ุฏู‹ุง ุชุชู… ู…ุนุงู„ุฌุฉ ูƒู„ ุงู„ุตูˆุฑ ุฏุงุฎู„ู‡)")
    parser.add_argument("--ext", default=".jpg,.jpeg,.png,.bmp,.tiff", help="ุงู…ุชุฏุงุฏุงุช ุงู„ุตูˆุฑ ุงู„ู…ูุตูˆู„ุฉ ุจููˆุงุตู„")
    args = parser.parse_args()

    # ุฅุฐุง ู„ู… ูŠุญุฏุฏ ุงู„ู…ุณุชุฎุฏู… ุฃูŠ ู…ุณุงุฑุŒ ู†ุณุชุฎุฏู… ุงู„ู…ุฌู„ุฏ ุงู„ุงูุชุฑุงุถูŠ
    if not args.paths:
        default_dir = "/home/sheriff/Desktop/5"
        if os.path.isdir(default_dir):
            print(f"[ู…ุนู„ูˆู…ุงุช] ู„ู… ูŠุชู… ุชุญุฏูŠุฏ ู…ุณุงุฑุŒ ู†ุณุชุฎุฏู… ุงู„ู…ุฌู„ุฏ ุงู„ุงูุชุฑุงุถูŠ: {default_dir}", file=sys.stderr)
            args.paths = [default_dir]
        else:
            print("[ุฎุทุฃ] ู„ู… ุชุญุฏุฏ ุฃูŠ ู…ุณุงุฑ ูˆุงู„ู…ุฌู„ุฏ ุงู„ุงูุชุฑุงุถูŠ ุบูŠุฑ ู…ูˆุฌูˆุฏ!", file=sys.stderr)
            sys.exit(1)

    # ุชุฌู…ูŠุน ู‚ุงุฆู…ุฉ ุจู…ู„ูุงุช ุงู„ุตูˆุฑ ุงู„ู…ุฑุงุฏ ู…ุนุงู„ุฌุชู‡ุง
    image_files = []
    extensions = [ext.strip().lower() for ext in args.ext.split(",")]

    for path in args.paths:
        if os.path.isfile(path):
            # ุฅุฐุง ูƒุงู† ุงู„ู…ู„ู ู…ูˆุฌูˆุฏุงู‹ ูˆู„ู‡ ุงู…ุชุฏุงุฏ ุตูˆุฑุฉุŒ ู†ุถูŠูู‡
            ext = os.path.splitext(path)[1].lower()
            if ext in extensions:
                image_files.append(path)
            else:
                print(f"[ุชุญุฐูŠุฑ] ุงู„ู…ู„ู {path} ู„ูŠุณ ุตูˆุฑุฉ ู…ุฏุนูˆู…ุฉ (ุงู„ุงู…ุชุฏุงุฏุงุช: {extensions})", file=sys.stderr)
        elif os.path.isdir(path):
            # ุฅุฐุง ูƒุงู† ู…ุฌู„ุฏุงู‹ุŒ ู†ุจุญุซ ุนู† ุฌู…ูŠุน ุงู„ุตูˆุฑ ุฏุงุฎู„ู‡ (ุบูŠุฑ ู…ุชูƒุฑุฑ)
            for file in os.listdir(path):
                filepath = os.path.join(path, file)
                if os.path.isfile(filepath):
                    ext = os.path.splitext(file)[1].lower()
                    if ext in extensions:
                        image_files.append(filepath)
        else:
            print(f"[ุฎุทุฃ] ุงู„ู…ุณุงุฑ ุบูŠุฑ ู…ูˆุฌูˆุฏ: {path}", file=sys.stderr)

    if not image_files:
        print("[ุฎุทุฃ] ู„ุง ุชูˆุฌุฏ ุตูˆุฑ ุตุงู„ุญุฉ ู„ู„ู…ุนุงู„ุฌุฉ!", file=sys.stderr)
        sys.exit(1)

    # ู…ุนุงู„ุฌุฉ ูƒู„ ุตูˆุฑุฉ
    for img_path in image_files:
        print(f"\n--- ุงู„ู…ู„ู: {img_path} ---")
        result = ocr_image(img_path)
        print(result)
...



## Web


```python

import gradio as gr
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
from PIL import Image

# --- KONFIGURASI MODEL ---
MODEL_PATH = "sherif1313/Arabic-GLM-OCR-v1"

# Deteksi perangkat secara otomatis
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32
print(f"๐Ÿš€ Mesin OCR dimulai: Device={device} | Dtype={dtype}")

# --- INISIALISASI MODEL (dengan pengecekan error) ---
try:
    print("โณ Memuat processor...")
    processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)

    print("โณ Memuat model (mungkin butuh waktu beberapa menit)...")
    model = AutoModelForImageTextToText.from_pretrained(
        MODEL_PATH,
        torch_dtype=dtype,
        trust_remote_code=True,
        low_cpu_mem_usage=True,
        device_map="auto"
    )
    model.eval()
    print("โœ… Model siap digunakan!")
except Exception as e:
    print(f"โŒ Gagal memuat model: {e}")
    raise  # Hentikan eksekusi jika model gagal dimuat

# --- DAFTAR GAMBAR CONTOH (pastikan file-file ini ada di folder yang sama dengan skrip) ---
EXAMPLE_IMAGES = [
    "train_22062.jpg",
    "train_22057.jpg",
    "BULAC_MS_ARA_417_0006_0011.jpg",
    "00025.png",
    "AHTD3A0074_Para4_1.jpg",
    "00060.png",
]

# --- FUNGSI OCR ---
def proses_intelijen(image):
    if image is None:
        return "โš ๏ธ Silakan unggah gambar terlebih dahulu."

    # Format pesan sesuai template model
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": "Text Recognition:"}
            ],
        }
    ]

    try:
        # Terapkan template chat dan tokenisasi
        inputs = processor.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=True,
            return_dict=True,
            return_tensors="pt"
        ).to(model.device)

        # Generate teks
        with torch.no_grad():
            generated_ids = model.generate(
                **inputs,
                max_new_tokens=2048,
                do_sample=False
            )

        # Ambil hanya bagian output (tanpa prompt)
        hasil = generated_ids[0][len(inputs["input_ids"][0]):]
        teks_final = processor.decode(hasil, skip_special_tokens=True)
        return teks_final

    except Exception as e:
        return f"๐Ÿšจ Terjadi kesalahan: {str(e)}"

# --- ANTARMUKA GRADIO ---
css_custom = """
.container { max-width: 1200px; margin: auto; padding-top: 20px; }
h1 { text-align: center; color: #3b82f6; }
"""

with gr.Blocks(css=css_custom, title="Arabic GLM-OCR") as app:
    with gr.Column(elem_classes="container"):
        gr.Markdown("# Arabic GLM-OCR")
        gr.Markdown("Arabic OCR powered by GLM-OCR.")

        with gr.Row():
            with gr.Column(scale=1):
                input_img = gr.Image(type="pil", label="Upload Gambar", height=450)
                scan_btn = gr.Button("๐Ÿš€ MULAI SCAN", variant="primary", size="lg")

            with gr.Column(scale=1):
                output_txt = gr.Textbox(label="Hasil Teks", lines=24)

        # Tambahkan contoh gambar yang bisa diklik
        gr.Examples(
            examples=EXAMPLE_IMAGES,
            inputs=input_img,
            outputs=output_txt,
            fn=proses_intelijen,
            cache_examples=False,  # Set ke True jika ingin mempercepat (butuh disk space)
            label="Contoh Gambar (klik untuk memuat)"
        )

    # Hubungkan tombol dengan fungsi
    scan_btn.click(fn=proses_intelijen, inputs=input_img, outputs=output_txt)

if __name__ == "__main__":
    app.launch()

Careful verification of message formatting is recommended when using custom paths.

๐Ÿ“œ License

Apache 2.0